-
Notifications
You must be signed in to change notification settings - Fork 20
fix: remove known scalar optimization from dak.num
#592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #592 +/- ##
==========================================
- Coverage 93.06% 92.38% -0.69%
==========================================
Files 23 24 +1
Lines 3290 3596 +306
==========================================
+ Hits 3062 3322 +260
- Misses 228 274 +46 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This is a heavy change! But it's not just In chat, we talked about explicitly removing the knowledge about the partitions in the case that a uses suspects failures and is using the reporting mechanism. |
Well it's not really that the partitions are wrong. You know a file exists and it has a certain number of entries that you chose to split into partitions in a certain way. These partitions are correct. Now dask-awkward knows this and chooses to report the "length" of this array |
For the "would be length", we also have dask-awkward/src/dask_awkward/lib/core.py Lines 1079 to 1087 in f006fb9
|
Maybe |
It seems to me still, that it would make more logical sense to drop knowledge of the partition edges along with the request for a failure report. What we are saying, is that we expect some partitions to be missing, so the partition edge values are wrong in that case - and this will result in num not having the shortcut available. In this case, num retains the same semantics as other ak operations (as opposed to len, which must evaluate, and could be costly in some cases). |
Ahh so you are suggesting to not even have known partition edges when we request for a report. I don't know the implications of that. I'm not very familiar with the codebase so I don't know if there are other implications elsewhere if such a thing happens. |
This optimization is causing a headache to some users. The reason is that the users typically use
dak.num
on axis=0 to keep track of the number of events they processed. The problem is that if you know the divisions and yet you fail to read a partition due to some file access error (pretty common), you will have failed to read a chunk of events yetdak.num
is going to report a fixed numberarray.defined_divisions[-1]
. This causes users to over estimate the number of events they process.I claim that this optimization is causing more problems than it solves, therefore I propose to disable it.
If a user wants the number without doing any computation, they can just use
array.defined_divisions[-1]
.We also can't add an argument to
dak.num
because the whole point is that the function should have the same. arguments asak.num
(+ automatic dispatching).