Skip to content

Conversation

daniel-wer
Copy link
Member

@daniel-wer daniel-wer commented Aug 2, 2025

Description:

  • It's a bit unexpected why this was an issue, but here's the gist:
    • In a PR a couple of weeks ago I fixed that if an exception occurs inside a with executor block, shutdown doesn't wait forever for all jobs to finish which caused deadlocks (because the exception might occur before the jobs were scheduled and they won't ever finish).
    • What I didn't realize is that voxelytics relies on the fact that if a job is killed, due to a time or memory limit exception or simply fails due to another error, the cluster tools actually wait for all other jobs to finish before leaving the with block (shutdown with wait=True). This is somewhat obscure and we should think about making that more obvious and deliberate, but I restored that behavior for now.
  • I added a regression test and made sure that it failed before and succeeds with this PR

Issues:

  • fixes that all slurm jobs were canceled if a single job failed

Todos:

Make sure to delete unnecessary points or to check all before merging:

  • Updated Changelog
  • Added / Updated Tests

@daniel-wer daniel-wer self-assigned this Aug 2, 2025
@daniel-wer daniel-wer added the bug label Aug 2, 2025
Copy link

github-actions bot commented Aug 2, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
9328 7863 84% 80% 🟢

New Files

No new covered files...

Modified Files

No covered modified files...

updated for commit: 4551be8 by action🐍

Copy link
Member

@valentin-pinkau valentin-pinkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@daniel-wer
Copy link
Member Author

@valentin-pinkau Did this turn out to be a good fix or did you encounter any other issue with this branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants