Skip to content

Conversation

ikrommyd
Copy link
Contributor

@ikrommyd ikrommyd commented Aug 23, 2025

This optimization is causing a headache to some users. The reason is that the users typically use dak.num on axis=0 to keep track of the number of events they processed. The problem is that if you know the divisions and yet you fail to read a partition due to some file access error (pretty common), you will have failed to read a chunk of events yet dak.num is going to report a fixed number array.defined_divisions[-1]. This causes users to over estimate the number of events they process.

I claim that this optimization is causing more problems than it solves, therefore I propose to disable it.
If a user wants the number without doing any computation, they can just use array.defined_divisions[-1].
We also can't add an argument to dak.num because the whole point is that the function should have the same. arguments as ak.num (+ automatic dispatching).

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.38%. Comparing base (8cb8994) to head (5009e45).
⚠️ Report is 270 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #592      +/-   ##
==========================================
- Coverage   93.06%   92.38%   -0.69%     
==========================================
  Files          23       24       +1     
  Lines        3290     3596     +306     
==========================================
+ Hits         3062     3322     +260     
- Misses        228      274      +46     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@martindurant
Copy link
Collaborator

This is a heavy change! But it's not just num that wrong, but the partitions in general - is this information not used anywhere, not even by end-user introspection?

In chat, we talked about explicitly removing the knowledge about the partitions in the case that a uses suspects failures and is using the reporting mechanism.

@ikrommyd
Copy link
Contributor Author

ikrommyd commented Aug 24, 2025

Well it's not really that the partitions are wrong. You know a file exists and it has a certain number of entries that you chose to split into partitions in a certain way. These partitions are correct. Now dask-awkward knows this and chooses to report the "length" of this array axis=0 ak.num from the index of the entry where the last partition ends. This is "fine"-ish but it's misleading. It's reported prematurely with this new_known_scalar because you don't actually know if you'll manage to read all of those partitions. Partition 1 may succeed but partition 2 may error due to a network problem when reading across the Atlantic ocean. So if the "length" of this array is meant to be the actual length that you managed to read from disk, then this optimization must be removed. If the "length" should be the would-be length if you managed to read ALL of the partitions, then this optimization is fine. Users think of the "length" as the former though. They want to know the actual length of the array they managed to load into memory and do math with upon computation, not the "would-be" length if HEP data access was perfect (which it isn't). This would-be length they can just always get from the partition sizes even before doing any computation just from pre-processing their files and looking only at metadata (the number of events of each file and the partition steps).

@ikrommyd
Copy link
Contributor Author

For the "would be length", we also have __len__:

def __len__(self) -> int:
if not self.known_divisions:
raise TypeError(
"Cannot determine length of collection with unknown partition sizes without executing the graph.\n"
"Use `dask_awkward.num(..., axis=0)` if you want a lazy Scalar of the length.\n"
"If you want to eagerly compute the partition sizes to have the ability to call `len` on the collection"
", use `.eager_compute_divisions()` on the collection."
)
return self.divisions[-1] # type: ignore

@ikrommyd
Copy link
Contributor Author

Maybe __len__ can report back unknown_length if the divisions are not known and the actual length otherwise instead of erroring in the former case? And we can not do the optimization in dak.num and let this one ALWAYS calculate the real length? cc @pfackeldey

@martindurant
Copy link
Collaborator

It seems to me still, that it would make more logical sense to drop knowledge of the partition edges along with the request for a failure report. What we are saying, is that we expect some partitions to be missing, so the partition edge values are wrong in that case - and this will result in num not having the shortcut available. In this case, num retains the same semantics as other ak operations (as opposed to len, which must evaluate, and could be costly in some cases).

@ikrommyd
Copy link
Contributor Author

Ahh so you are suggesting to not even have known partition edges when we request for a report. I don't know the implications of that. I'm not very familiar with the codebase so I don't know if there are other implications elsewhere if such a thing happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants