You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let's say I am trying to access multiple trees from the same set of files which can fit into memory, call it files. The documentation suggests that for reading sets of files, uproot.concatenate is the way to go. There are multiple ways I can think to do this:
Simply make two sets of file lists files_tree1 and files_tree2 where files_treeX = [f+":treeX" for f in files], pass each list to uproot.concatenate in a separate call:
This I think is inefficient, since the same files wil be opened twice. Maybe caching would help, but feels like it can be done better.
[Naively] make a combined files list: files_all = files_tree1 + files_tree2 and combined branches list: all_branches = tree1_branches + tree2_branches] and pass these to one uproot call, hoping uproot magic will know what to do. This surprisngly (at least to me) did not crash. Uproot just produced an awkward array which is a union of 2 awkward arrays tree1_branches (N entries) and tree2_branches (M entries), with ak.type(data) giving :
N+M * union[{"tree1_branch1": var * float32, "tree1_branch2": var * float32}, {"tree2_branch1": int32}]
so calling data[N+M-1] gives the last entry from tree2 and data[0] gives the first entry of tree1. I guess it can be expected behaviour from the global_index that uproot.concatenate() seems to keep track of (or maybe I'm completely off)?
Anyway, I think we still open each file twice, which is non-ideal.
I can do a manual loop over the files and call uproot.open on each of them, then access the keys from the structure we get back. This way each file is opened once.
My questions are:
What does uproot.concatenate do in the background that makes it more performant (if that's even true) than uproot.open inside a loop over files? What I can see quickly from a skim over the source code is that concatenate loops over the files one by one, opening them as ReadOnlyFile then grabbing the data, but I am probably missing something subtle in the steps.
What do you recommend as best practicei n reading multiple trees from many files?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi experts,
Let's say I am trying to access multiple trees from the same set of files which can fit into memory, call it
files. The documentation suggests that for reading sets of files,uproot.concatenateis the way to go. There are multiple ways I can think to do this:files_tree1andfiles_tree2wherefiles_treeX = [f+":treeX" for f in files], pass each list touproot.concatenatein a separate call:This I think is inefficient, since the same files wil be opened twice. Maybe caching would help, but feels like it can be done better.
files_all = files_tree1 + files_tree2and combined branches list:all_branches = tree1_branches + tree2_branches]and pass these to one uproot call, hoping uproot magic will know what to do. This surprisngly (at least to me) did not crash. Uproot just produced anawkwardarray which is aunionof 2 awkward arraystree1_branches(N entries) andtree2_branches(M entries), withak.type(data)giving :so calling
data[N+M-1]gives the last entry fromtree2anddata[0]gives the first entry oftree1. I guess it can be expected behaviour from theglobal_indexthatuproot.concatenate()seems to keep track of (or maybe I'm completely off)?Anyway, I think we still open each file twice, which is non-ideal.
uproot.openon each of them, then access the keys from the structure we get back. This way each file is opened once.My questions are:
uproot.concatenatedo in the background that makes it more performant (if that's even true) thanuproot.openinside a loop over files? What I can see quickly from a skim over the source code is thatconcatenateloops over the files one by one, opening them asReadOnlyFilethen grabbing the data, but I am probably missing something subtle in the steps.Beta Was this translation helpful? Give feedback.
All reactions