fix: remove known scalar optimization from `dak.num` #592

ikrommyd · 2025-08-23T00:54:43Z

This optimization is causing a headache to some users. The reason is that the users typically use dak.num on axis=0 to keep track of the number of events they processed. The problem is that if you know the divisions and yet you fail to read a partition due to some file access error (pretty common), you will have failed to read a chunk of events yet dak.num is going to report a fixed number array.defined_divisions[-1]. This causes users to over estimate the number of events they process.

I claim that this optimization is causing more problems than it solves, therefore I propose to disable it.
If a user wants the number without doing any computation, they can just use array.defined_divisions[-1].
We also can't add an argument to dak.num because the whole point is that the function should have the same. arguments as ak.num (+ automatic dispatching).

codecov-commenter · 2025-08-23T00:56:15Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.38%. Comparing base (8cb8994) to head (5009e45).
⚠️ Report is 270 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #592      +/-   ##
==========================================
- Coverage   93.06%   92.38%   -0.69%     
==========================================
  Files          23       24       +1     
  Lines        3290     3596     +306     
==========================================
+ Hits         3062     3322     +260     
- Misses        228      274      +46

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

martindurant · 2025-08-24T02:14:56Z

This is a heavy change! But it's not just num that wrong, but the partitions in general - is this information not used anywhere, not even by end-user introspection?

In chat, we talked about explicitly removing the knowledge about the partitions in the case that a uses suspects failures and is using the reporting mechanism.

ikrommyd · 2025-08-24T16:47:49Z

Well it's not really that the partitions are wrong. You know a file exists and it has a certain number of entries that you chose to split into partitions in a certain way. These partitions are correct. Now dask-awkward knows this and chooses to report the "length" of this array axis=0 ak.num from the index of the entry where the last partition ends. This is "fine"-ish but it's misleading. It's reported prematurely with this new_known_scalar because you don't actually know if you'll manage to read all of those partitions. Partition 1 may succeed but partition 2 may error due to a network problem when reading across the Atlantic ocean. So if the "length" of this array is meant to be the actual length that you managed to read from disk, then this optimization must be removed. If the "length" should be the would-be length if you managed to read ALL of the partitions, then this optimization is fine. Users think of the "length" as the former though. They want to know the actual length of the array they managed to load into memory and do math with upon computation, not the "would-be" length if HEP data access was perfect (which it isn't). This would-be length they can just always get from the partition sizes even before doing any computation just from pre-processing their files and looking only at metadata (the number of events of each file and the partition steps).

ikrommyd · 2025-08-25T01:09:27Z

For the "would be length", we also have __len__:

dask-awkward/src/dask_awkward/lib/core.py

Lines 1079 to 1087 in f006fb9

    
           def __len__(self) -> int: 
        
               if not self.known_divisions: 
        
                   raise TypeError( 
        
                       "Cannot determine length of collection with unknown partition sizes without executing the graph.\n" 
        
                       "Use `dask_awkward.num(..., axis=0)` if you want a lazy Scalar of the length.\n" 
        
                       "If you want to eagerly compute the partition sizes to have the ability to call `len` on the collection" 
        
                       ", use `.eager_compute_divisions()` on the collection." 
        
                   ) 
        
               return self.divisions[-1]  # type: ignore

ikrommyd · 2025-08-25T15:27:53Z

Maybe __len__ can report back unknown_length if the divisions are not known and the actual length otherwise instead of erroring in the former case? And we can not do the optimization in dak.num and let this one ALWAYS calculate the real length? cc @pfackeldey

martindurant · 2025-08-26T01:56:20Z

It seems to me still, that it would make more logical sense to drop knowledge of the partition edges along with the request for a failure report. What we are saying, is that we expect some partitions to be missing, so the partition edge values are wrong in that case - and this will result in num not having the shortcut available. In this case, num retains the same semantics as other ak operations (as opposed to len, which must evaluate, and could be costly in some cases).

ikrommyd · 2025-08-26T16:10:54Z

Ahh so you are suggesting to not even have known partition edges when we request for a report. I don't know the implications of that. I'm not very familiar with the codebase so I don't know if there are other implications elsewhere if such a thing happens.

ikrommyd added 2 commits August 23, 2025 03:47

Remove known scalar optimization from dak.num

4e9c7e9

We don't need this here

5009e45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: remove known scalar optimization from `dak.num` #592

fix: remove known scalar optimization from `dak.num` #592

Uh oh!

ikrommyd commented Aug 23, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Aug 23, 2025

Uh oh!

martindurant commented Aug 24, 2025

Uh oh!

ikrommyd commented Aug 24, 2025 •

edited

Loading

Uh oh!

ikrommyd commented Aug 25, 2025

Uh oh!

ikrommyd commented Aug 25, 2025

Uh oh!

martindurant commented Aug 26, 2025

Uh oh!

ikrommyd commented Aug 26, 2025

Uh oh!

Uh oh!

fix: remove known scalar optimization from dak.num #592

Are you sure you want to change the base?

fix: remove known scalar optimization from dak.num #592

Uh oh!

Conversation

ikrommyd commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Aug 23, 2025

Codecov Report

Uh oh!

martindurant commented Aug 24, 2025

Uh oh!

ikrommyd commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikrommyd commented Aug 25, 2025

Uh oh!

ikrommyd commented Aug 25, 2025

Uh oh!

martindurant commented Aug 26, 2025

Uh oh!

ikrommyd commented Aug 26, 2025

Uh oh!

Uh oh!

fix: remove known scalar optimization from `dak.num` #592

fix: remove known scalar optimization from `dak.num` #592

ikrommyd commented Aug 23, 2025 •

edited

Loading

ikrommyd commented Aug 24, 2025 •

edited

Loading