Sensitivity of step_size parameter
#926
Replies: 1 comment 1 reply
-
|
For some clarity on the Why use the compressed TBasket size (the size on disk) instead of the uncompressed size or some kind of estimate of the memory space needed by the objects that will be instantiated?
These issues complicate the relationship between |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear Jim, uprooters et al.,
I am running an
uprootprogram which reads-in.roottrees, conducts an analysis and writes into trees in ouput.rootfiles. The program usesutils.make_chunk_eventsfromUprootFramework. Each of the ntuples being read contain a varying number of.rootfiles which amount to varying sizes per ntuple (from ~0.5 GBto ~1.5 TBwhich correspond to ~1 to ~350 files per ntuple).Recently, it has occurred to me that the program is extremely sensitive to the
step_sizeparameter, both in terms of calculation speed and memory usage. E.g. when running with the nominal1.5 GB(as given in the documentation), the chunks (see above) become very large and the memory usage of the program increases dramatically (~O(20 GB)). Conversely, reducing thestep_sizeto50 MB(75 MBand100 MBgive similar results) reduces the chunk size and requires much less memory during the run. Reducingstep_sizetoo much to about2 MBresults in the chunk size being extremely small and total running time quite long. Also, the memory usage whilst using a ~2 MBstep_sizeis not kept low during a long run, and increases to about6 GBafter an ~hour's run.It seems there is a non-linear dependence of running time and memory usage on
step_sizeand it's hard to understand what the optimalstep_sizeis, per size of the input ntuple. Similarly, it seems that during a run, the memory usage increases and it is not clear whether it reaches a steady-state. To that effect, I thought settingparallel=Falsemight help be more memory-conservative, but it's unclear whether this helps. It is hard to test and know, for a run that's anticipated to take many hours to complete.Could someone kindly provide some information about these issues? Having read the documentation, it's very unclear how to tune the
step_sizecorrectly (and whetherparallel=True/Falsehas anything to do with restricting memory consumption). Given this program is meant to run on a very large amount of datasets in a batch system, it would be very helpful to understand and hence optimise, before exploiting many computing resources.Many thanks in advance.
Roy
Beta Was this translation helpful? Give feedback.
All reactions