Skip to content

Commit 51ad657

Browse files
committed
refactor: WIP. Module name to filepath optimisation
This is related to #4598. This changes the file to module associating logic done during dependency graph building. Before, each time a module `Foo.Bar` is found, HLS is testing inside all the import path for the existence of a relevant fiel.. It means that for `i` import paths and `m` modules to locate, `m * n` filesystem operations are done. Note also that this involves a lot of complex string concatenation primitive to build the `FilePath`. A module is tested for each `import` for each of the file of the project. We also test for `boot` files, doubling the number of test. In #4598 we have a project with `1100` modules, in more than 250 import paths and we count more than `17000` `import` statments, resulting on over 6 millions test for file existences. This project was blocking for more than 3 minutes during HLS startup. This commit changes the way this is computed: - At startup, a `Map ModuleName FilePath` (the real type is a bit more involved for performance, multiples unit and boot files handling) is built by scanning all the import paths for files representing the different modules. - Directory scanning is efficient and if import path only contains haskell module, this will never do more job that listing the files of the project. - The lookup is now simplify a `Map` lookup. The performance improvement is as follows: - The number of IO operation is dramatically reduced, from multiples millions to a few recursive directories listing. - A lot of the boilerplate of converting path had be removed. - TODO: add an RTS stats before / after with number of allocations - On my project, the graph building time is reduced from a few minutes to 3s. Limitations: - How to rebuild the `Map` if the content of one directory change? - If one directory is filled with millions of files which are not of interested, performance can be damaged. TODO: add a diagnostic during this phase so the user can learn about this issue. Code status: - The `lookup` is not fully restored, especially it does not include the handling of home unit as well as reexport. - The initialisation phase is cached inside a `TVar` stored as a top level identifier using `unsafePerformIO`. This is to be improved. A note about performance Most users won't see the benefits of these change, but I think they apply to everbody: - We are still doing 1 lookup per `import` per module. But the lookup result is not multiples IO, so it should be faster by a large amount. - Most project only have 1 (or a few) import paths so won't benefit as dramatically as me from this. TODO for allocations
1 parent 321e12e commit 51ad657

File tree

5 files changed

+97
-32
lines changed

5 files changed

+97
-32
lines changed

ghcide/ghcide.cabal

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,7 @@ library
107107
, unliftio-core
108108
, unordered-containers >=0.2.10.0
109109
, vector
110+
, pretty-simple
110111

111112
if os(windows)
112113
build-depends: Win32

ghcide/src/Development/IDE/Core/RuleTypes.hs

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -392,6 +392,9 @@ type instance RuleResult GetModSummary = ModSummaryResult
392392
-- | Generate a ModSummary with the timestamps and preprocessed content elided, for more successful early cutoff
393393
type instance RuleResult GetModSummaryWithoutTimestamps = ModSummaryResult
394394

395+
type instance RuleResult GetModulesPaths = (M.Map ModuleName (UnitId, NormalizedFilePath),
396+
M.Map ModuleName (UnitId, NormalizedFilePath))
397+
395398
data GetParsedModule = GetParsedModule
396399
deriving (Eq, Show, Generic)
397400
instance Hashable GetParsedModule
@@ -494,6 +497,13 @@ data GetModSummaryWithoutTimestamps = GetModSummaryWithoutTimestamps
494497
instance Hashable GetModSummaryWithoutTimestamps
495498
instance NFData GetModSummaryWithoutTimestamps
496499

500+
-- | Scan all the import directory for existing modules and build a map from
501+
-- module name to paths
502+
data GetModulesPaths = GetModulesPaths
503+
deriving (Eq, Show, Generic)
504+
instance Hashable GetModulesPaths
505+
instance NFData GetModulesPaths
506+
497507
data GetModSummary = GetModSummary
498508
deriving (Eq, Show, Generic)
499509
instance Hashable GetModSummary

ghcide/src/Development/IDE/Core/Rules.hs

Lines changed: 58 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
{-# LANGUAGE CPP #-}
55
{-# LANGUAGE DuplicateRecordFields #-}
66
{-# LANGUAGE TypeFamilies #-}
7+
{-# LANGUAGE PartialTypeSignatures #-}
78

89
-- | A Shake implementation of the compiler service, built
910
-- using the "Shaker" abstraction layer for in-memory use.
@@ -93,7 +94,7 @@ import Data.Proxy
9394
import qualified Data.Text as T
9495
import qualified Data.Text.Encoding as T
9596
import qualified Data.Text.Utf16.Rope.Mixed as Rope
96-
import Data.Time (UTCTime (..))
97+
import Data.Time (UTCTime (..), getCurrentTime, diffUTCTime)
9798
import Data.Time.Clock.POSIX (posixSecondsToUTCTime)
9899
import Data.Tuple.Extra
99100
import Data.Typeable (cast)
@@ -173,6 +174,12 @@ import System.Info.Extra (isWindows)
173174

174175
import qualified Data.IntMap as IM
175176
import GHC.Fingerprint
177+
import Text.Pretty.Simple
178+
import qualified Data.Map.Strict as Map
179+
import System.FilePath (takeExtension, takeFileName, normalise, dropTrailingPathSeparator, dropExtension, splitDirectories)
180+
import Data.Char (isUpper)
181+
import System.Directory.Extra (listFilesRecursive, listFilesInside)
182+
import System.IO.Unsafe
176183

177184
data Log
178185
= LogShake Shake.Log
@@ -311,30 +318,21 @@ getParsedModuleDefinition packageState opt file ms = do
311318
getLocatedImportsRule :: Recorder (WithPriority Log) -> Rules ()
312319
getLocatedImportsRule recorder =
313320
define (cmapWithPrio LogShake recorder) $ \GetLocatedImports file -> do
321+
314322
ModSummaryResult{msrModSummary = ms} <- use_ GetModSummaryWithoutTimestamps file
315-
(KnownTargets targets targetsMap) <- useNoFile_ GetKnownTargets
323+
-- TODO: should we reverse this concatenation, there are way less
324+
-- source import than normal import in theory, so it should be faster
316325
let imports = [(False, imp) | imp <- ms_textual_imps ms] ++ [(True, imp) | imp <- ms_srcimps ms]
317326
env_eq <- use_ GhcSession file
318327
let env = hscEnv env_eq
319328
let import_dirs = map (second homeUnitEnv_dflags) $ hugElts $ hsc_HUG env
320329
let dflags = hsc_dflags env
321330
opt <- getIdeOptions
322-
let getTargetFor modName nfp
323-
| Just (TargetFile nfp') <- HM.lookup (TargetFile nfp) targetsMap = do
324-
-- reuse the existing NormalizedFilePath in order to maximize sharing
325-
itExists <- getFileExists nfp'
326-
return $ if itExists then Just nfp' else Nothing
327-
| Just tt <- HM.lookup (TargetModule modName) targets = do
328-
-- reuse the existing NormalizedFilePath in order to maximize sharing
329-
let ttmap = HM.mapWithKey const (HashSet.toMap tt)
330-
nfp' = HM.lookupDefault nfp nfp ttmap
331-
itExists <- getFileExists nfp'
332-
return $ if itExists then Just nfp' else Nothing
333-
| otherwise = do
334-
itExists <- getFileExists nfp
335-
return $ if itExists then Just nfp else Nothing
331+
332+
moduleMaps <- use_ GetModulesPaths file
336333
(diags, imports') <- fmap unzip $ forM imports $ \(isSource, (mbPkgName, modName)) -> do
337-
diagOrImp <- locateModule (hscSetFlags dflags env) import_dirs (optExtensions opt) getTargetFor modName mbPkgName isSource
334+
335+
diagOrImp <- locateModule moduleMaps (hscSetFlags dflags env) import_dirs (optExtensions opt) modName mbPkgName isSource
338336
case diagOrImp of
339337
Left diags -> pure (diags, Just (modName, Nothing))
340338
Right (FileImport path) -> pure ([], Just (modName, Just path))
@@ -624,6 +622,43 @@ getModuleGraphRule recorder = defineEarlyCutOffNoFile (cmapWithPrio LogShake rec
624622
fs <- toKnownFiles <$> useNoFile_ GetKnownTargets
625623
dependencyInfoForFiles (HashSet.toList fs)
626624

625+
{-# NOINLINE cacheVar #-}
626+
cacheVar = unsafePerformIO (newTVarIO mempty)
627+
628+
getModulesPathsRule :: Recorder (WithPriority Log) -> Rules ()
629+
getModulesPathsRule recorder = defineEarlyCutoff (cmapWithPrio LogShake recorder) $ Rule $ \GetModulesPaths file -> do
630+
env_eq <- use_ GhcSession file
631+
632+
cache <- liftIO (readTVarIO cacheVar)
633+
case Map.lookup (envUnique env_eq) cache of
634+
Just res -> pure (mempty, ([], Just res))
635+
Nothing -> do
636+
let env = hscEnv env_eq
637+
let import_dirs = map (second homeUnitEnv_dflags) $ hugElts $ hsc_HUG env
638+
opt <- getIdeOptions
639+
let exts = (optExtensions opt)
640+
let acceptedExtensions = concatMap (\x -> ['.':x, '.':x <> "-boot"]) exts
641+
642+
(unzip -> (a, b)) <- flip mapM import_dirs $ \(u, dyn) -> do
643+
(unzip -> (a, b)) <- flip mapM (importPaths dyn) $ \dir' -> do
644+
let dir = dropTrailingPathSeparator dir'
645+
let predicate path = pure (path == dir || isUpper (head (takeFileName path)))
646+
let dir_number_directories = length (splitDirectories dir)
647+
let toModule file = mkModuleName (intercalate "." $ drop dir_number_directories (splitDirectories (dropExtension file)))
648+
649+
-- TODO: we are taking/droping extension, this could be factorized to save a few cpu cycles ;)
650+
-- TODO: do acceptedextensions needs to be a set ? or a vector?
651+
modules <- fmap (\path -> (toModule path, toNormalizedFilePath' path)) . filter (\y -> takeExtension y `elem` acceptedExtensions) <$> liftIO (listFilesInside predicate dir)
652+
let isSourceModule (_, path) = "-boot" `isSuffixOf` fromNormalizedFilePath path
653+
let (sourceModules, notSourceModules) = partition isSourceModule modules
654+
pure $ (Map.fromList notSourceModules, Map.fromList sourceModules)
655+
pure (fmap (u,) $ mconcat a, fmap (u, ) $ mconcat b)
656+
657+
let res = (mconcat a, mconcat b)
658+
liftIO $ atomically $ modifyTVar' cacheVar (Map.insert (envUnique env_eq) res)
659+
660+
pure (mempty, ([], Just $ (mconcat a, mconcat b)))
661+
627662
getModuleGraphSingleFileRule :: Recorder (WithPriority Log) -> Rules ()
628663
getModuleGraphSingleFileRule recorder =
629664
defineEarlyCutoff (cmapWithPrio LogShake recorder) $ Rule $ \GetFileModuleGraph file -> do
@@ -632,8 +667,12 @@ getModuleGraphSingleFileRule recorder =
632667

633668
dependencyInfoForFiles :: [NormalizedFilePath] -> Action (BS.ByteString, DependencyInformation)
634669
dependencyInfoForFiles fs = do
670+
-- liftIO $ print ("fs length", length fs)
635671
(rawDepInfo, bm) <- rawDependencyInformation fs
672+
-- liftIO $ print ("ok with raw deps")
673+
-- liftIO $ pPrint rawDepInfo
636674
let (all_fs, _all_ids) = unzip $ HM.toList $ pathToIdMap $ rawPathIdMap rawDepInfo
675+
-- liftIO $ print ("all_fs length", length all_fs)
637676
msrs <- uses GetModSummaryWithoutTimestamps all_fs
638677
let mss = map (fmap msrModSummary) msrs
639678
let deps = map (\i -> IM.lookup (getFilePathId i) (rawImports rawDepInfo)) _all_ids
@@ -712,6 +751,7 @@ loadGhcSession recorder ghcSessionDepsConfig = do
712751
IdeGhcSession{loadSessionFun} <- useNoFile_ GhcSessionIO
713752
-- loading is always returning a absolute path now
714753
(val,deps) <- liftIO $ loadSessionFun $ fromNormalizedFilePath file
754+
-- TODO: this is responsible for a LOT of allocations
715755

716756
-- add the deps to the Shake graph
717757
let addDependency fp = do
@@ -1232,6 +1272,7 @@ mainRule recorder RulesConfig{..} = do
12321272
getModIfaceRule recorder
12331273
getModSummaryRule templateHaskellWarning recorder
12341274
getModuleGraphRule recorder
1275+
getModulesPathsRule recorder
12351276
getModuleGraphSingleFileRule recorder
12361277
getFileHashRule recorder
12371278
knownFilesRule recorder

ghcide/src/Development/IDE/Import/FindImports.hs

Lines changed: 27 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55

66
module Development.IDE.Import.FindImports
77
( locateModule
8-
, locateModuleFile
98
, Import(..)
109
, ArtifactsLocation(..)
1110
, modSummaryToArtifactsLocation
@@ -14,9 +13,8 @@ module Development.IDE.Import.FindImports
1413
) where
1514

1615
import Control.DeepSeq
17-
import Control.Monad.Extra
1816
import Control.Monad.IO.Class
19-
import Data.List (find, isSuffixOf)
17+
import Data.List (isSuffixOf)
2018
import Data.Maybe
2119
import qualified Data.Set as S
2220
import Development.IDE.GHC.Compat as Compat
@@ -26,7 +24,8 @@ import Development.IDE.Types.Diagnostics
2624
import Development.IDE.Types.Location
2725
import GHC.Types.PkgQual
2826
import GHC.Unit.State
29-
import System.FilePath
27+
import Data.Map.Strict (Map)
28+
import qualified Data.Map.Strict as Map
3029

3130

3231
#if MIN_VERSION_ghc(9,11,0)
@@ -70,6 +69,7 @@ data LocateResult
7069
| LocateFoundReexport UnitId
7170
| LocateFoundFile UnitId NormalizedFilePath
7271

72+
{-
7373
-- | locate a module in the file system. Where we go from *daml to Haskell
7474
locateModuleFile :: MonadIO m
7575
=> [(UnitId, [FilePath], S.Set ModuleName)]
@@ -94,6 +94,7 @@ locateModuleFile import_dirss exts targetFor isSource modName = do
9494
maybeBoot ext
9595
| isSource = ext ++ "-boot"
9696
| otherwise = ext
97+
-}
9798

9899
-- | This function is used to map a package name to a set of import paths.
99100
-- It only returns Just for unit-ids which are possible to import into the
@@ -110,36 +111,45 @@ mkImportDirs _env (i, flags) = Just (i, (importPaths flags, reexportedModules fl
110111
-- Haskell
111112
locateModule
112113
:: MonadIO m
113-
=> HscEnv
114+
=> (Map ModuleName (UnitId, NormalizedFilePath),Map ModuleName (UnitId, NormalizedFilePath))
115+
-> HscEnv
114116
-> [(UnitId, DynFlags)] -- ^ Import directories
115117
-> [String] -- ^ File extensions
116-
-> (ModuleName -> NormalizedFilePath -> m (Maybe NormalizedFilePath)) -- ^ does file exist predicate
117118
-> Located ModuleName -- ^ Module name
118119
-> PkgQual -- ^ Package name
119120
-> Bool -- ^ Is boot module
120121
-> m (Either [FileDiagnostic] Import)
121-
locateModule env comp_info exts targetFor modName mbPkgName isSource = do
122+
locateModule moduleMaps@(moduleMap, moduleMapSource) env comp_info exts modName mbPkgName isSource = do
122123
case mbPkgName of
123124
-- 'ThisPkg' just means some home module, not the current unit
124125
ThisPkg uid
126+
-- TODO: there are MANY lookup on import_paths, which is a problem considering that it can be large.
125127
| Just (dirs, reexports) <- lookup uid import_paths
126-
-> lookupLocal uid dirs reexports
128+
-> lookupLocal moduleMaps uid dirs reexports
127129
| otherwise -> return $ Left $ notFoundErr env modName $ LookupNotFound []
128130
-- if a package name is given we only go look for a package
129131
OtherPkg uid
130132
| Just (dirs, reexports) <- lookup uid import_paths
131-
-> lookupLocal uid dirs reexports
133+
-> lookupLocal moduleMaps uid dirs reexports
132134
| otherwise -> lookupInPackageDB
133135
NoPkgQual -> do
134136

135137
-- Reexports for current unit have to be empty because they only apply to other units depending on the
136138
-- current unit. If we set the reexports to be the actual reexports then we risk looping forever trying
137139
-- to find the module from the perspective of the current unit.
138-
mbFile <- locateModuleFile ((homeUnitId_ dflags, importPaths dflags, S.empty) : other_imports) exts targetFor isSource $ unLoc modName
140+
---- locateModuleFile ((homeUnitId_ dflags, importPaths dflags, S.empty) : other_imports) exts targetFor isSource $ unLoc modName
141+
--
142+
-- TODO: handle the other imports, the unit id, ..., reexport.
143+
-- - TODO: should we look for file existence now? If the file was
144+
-- removed from the disk, how will it behaves? How do we invalidate
145+
-- that?
146+
let mbFile = case Map.lookup (unLoc modName) (if isSource then moduleMapSource else moduleMap) of
147+
Nothing -> LocateNotFound
148+
Just (uid, file) -> LocateFoundFile uid file
139149
case mbFile of
140150
LocateNotFound -> lookupInPackageDB
141151
-- Lookup again with the perspective of the unit reexporting the file
142-
LocateFoundReexport uid -> locateModule (hscSetActiveUnitId uid env) comp_info exts targetFor modName noPkgQual isSource
152+
LocateFoundReexport uid -> locateModule moduleMaps (hscSetActiveUnitId uid env) comp_info exts modName noPkgQual isSource
143153
LocateFoundFile uid file -> toModLocation uid file
144154
where
145155
dflags = hsc_dflags env
@@ -180,12 +190,15 @@ locateModule env comp_info exts targetFor modName mbPkgName isSource = do
180190
let genMod = mkModule (RealUnit $ Definite uid) (unLoc modName) -- TODO support backpack holes
181191
return $ Right $ FileImport $ ArtifactsLocation file (Just loc) (not isSource) (Just genMod)
182192

183-
lookupLocal uid dirs reexports = do
184-
mbFile <- locateModuleFile [(uid, dirs, reexports)] exts targetFor isSource $ unLoc modName
193+
lookupLocal moduleMaps@(moduleMapSource, moduleMap) uid dirs reexports = do
194+
-- mbFile <- locateModuleFile [(uid, dirs, reexports)] exts targetFor isSource $ unLoc modName
195+
let mbFile = case Map.lookup (unLoc modName) (if isSource then moduleMapSource else moduleMap) of
196+
Nothing -> LocateNotFound
197+
Just (uid, file) -> LocateFoundFile uid file
185198
case mbFile of
186199
LocateNotFound -> return $ Left $ notFoundErr env modName $ LookupNotFound []
187200
-- Lookup again with the perspective of the unit reexporting the file
188-
LocateFoundReexport uid' -> locateModule (hscSetActiveUnitId uid' env) comp_info exts targetFor modName noPkgQual isSource
201+
LocateFoundReexport uid' -> locateModule moduleMaps (hscSetActiveUnitId uid' env) comp_info exts modName noPkgQual isSource
189202
LocateFoundFile uid' file -> toModLocation uid' file
190203

191204
lookupInPackageDB = do

ghcide/src/Development/IDE/Types/HscEnvEq.hs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{-# LANGUAGE CPP #-}
22
module Development.IDE.Types.HscEnvEq
33
( HscEnvEq,
4-
hscEnv, newHscEnvEq,
4+
hscEnv, newHscEnvEq, envUnique,
55
updateHscEnvEq,
66
envPackageExports,
77
envVisibleModuleNames,

0 commit comments

Comments
 (0)