Skip to content

Difference between shutdown(wait=True) and shutdown(wait=False) #721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 16, 2025

Conversation

jan-janssen
Copy link
Member

@jan-janssen jan-janssen commented Jul 13, 2025

When shutdown(wait=True) is called - the default - then the executor waits until all future objects completed. In contrast when shutdown(wait=False) is called the future objects are cancelled on the queuing system. In addition, when the jupyter notebook is interrupted the submitted futures are not deleted from the queuing system.

jan-janssen and others added 2 commits July 13, 2025 12:04
When `shutdown(wait=True)` is called - the default - then the executor waits until all future objects completed. In contrast when `shutdown(wait=True)` is called the future objects are cancelled on the queuing system.
Copy link
Contributor

coderabbitai bot commented Jul 13, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This change removes the terminate_tasks_on_shutdown parameter and all related validation logic from several executors, input checking utilities, and the file-based task scheduler. The shutdown behavior in task execution is now only conditional on a "wait" flag, and associated tests for the removed parameter are deleted.

Changes

Files/Paths Change Summary
executorlib/executor/flux.py, executorlib/executor/slurm.py Removed terminate_tasks_on_shutdown parameter from executors and related validation logic.
executorlib/standalone/inputcheck.py Deleted the check_terminate_tasks_on_shutdown function.
executorlib/task_scheduler/file/task_scheduler.py Removed terminate_tasks_on_shutdown from create_file_executor and simplified shutdown logic.
executorlib/task_scheduler/file/shared.py Modified shutdown logic in execute_tasks_h5 to use a "wait" flag instead of the removed parameter.
tests/test_standalone_inputcheck.py Deleted the test for check_terminate_tasks_on_shutdown and removed its import.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Executor
    participant TaskScheduler

    User->>Executor: Initialize (no terminate_tasks_on_shutdown)
    User->>TaskScheduler: Submit shutdown with {"wait": True/False}
    TaskScheduler-->>TaskScheduler: If "wait"==True, wait for tasks to finish
    TaskScheduler->>TaskScheduler: Terminate tasks (if applicable)
    TaskScheduler->>User: Confirm shutdown
Loading

Possibly related PRs

Poem

A rabbit hopped through fields of code,
Where shutdown flags once proudly strode.
With a twitch of nose, and gentle sweep,
Those old parameters fell asleep.
Now tasks await, or not, on cue—
The garden's logic, fresh and new!
🐇✨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@jan-janssen
Copy link
Member Author

Correction to the changes in #715

Copy link

codecov bot commented Jul 13, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.90%. Comparing base (15f648f) to head (86514c8).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #721      +/-   ##
==========================================
- Coverage   96.91%   96.90%   -0.02%     
==========================================
  Files          30       30              
  Lines        1363     1357       -6     
==========================================
- Hits         1321     1315       -6     
  Misses         42       42              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jan-janssen jan-janssen requested a review from liamhuber July 13, 2025 10:07
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48be119 and ccf6064.

📒 Files selected for processing (6)
  • executorlib/executor/flux.py (0 hunks)
  • executorlib/executor/slurm.py (0 hunks)
  • executorlib/standalone/inputcheck.py (0 hunks)
  • executorlib/task_scheduler/file/shared.py (1 hunks)
  • executorlib/task_scheduler/file/task_scheduler.py (1 hunks)
  • tests/test_standalone_inputcheck.py (0 hunks)
💤 Files with no reviewable changes (4)
  • tests/test_standalone_inputcheck.py
  • executorlib/standalone/inputcheck.py
  • executorlib/executor/slurm.py
  • executorlib/executor/flux.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
  • GitHub Check: unittest_flux_openmpi
  • GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-mpich.yml)
  • GitHub Check: benchmark (ubuntu-latest, 3.13, .ci_support/environment-openmpi.yml)
  • GitHub Check: unittest_mpich (ubuntu-latest, 3.12)
  • GitHub Check: unittest_mpich (ubuntu-latest, 3.11)
  • GitHub Check: unittest_mpich (ubuntu-latest, 3.13)
  • GitHub Check: unittest_mpich (macos-latest, 3.13)
  • GitHub Check: unittest_old
  • GitHub Check: unittest_flux_mpich
  • GitHub Check: unittest_openmpi (ubuntu-latest, 3.12)
  • GitHub Check: unittest_openmpi (ubuntu-latest, 3.11)
  • GitHub Check: notebooks
  • GitHub Check: notebooks_integration
  • GitHub Check: unittest_openmpi (macos-latest, 3.13)
  • GitHub Check: unittest_openmpi (ubuntu-latest, 3.13)
  • GitHub Check: unittest_win
  • GitHub Check: minimal
🔇 Additional comments (1)
executorlib/task_scheduler/file/shared.py (1)

90-100: LGTM – no KeyError risk for "wait"
Verification shows every shutdown task dict (in tests and all queue-put calls across base.py, blockallocation.py, onetoone.py, communication.py, etc.) always includes a "wait" key. The conditional logic in shared.py is safe as written.

Comment on lines +115 to +118
if execute_function != execute_in_subprocess:
terminate_function = terminate_with_pysqa # type: ignore
elif terminate_tasks_on_shutdown and execute_function == execute_in_subprocess:
terminate_function = terminate_subprocess # type: ignore
else:
terminate_function = None # type: ignore
terminate_function = terminate_subprocess # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Good simplification! Note the breaking change.

The simplified logic for determining terminate_function is cleaner and more maintainable. The function selection now depends solely on the execute_function type rather than multiple conditional flags.

This removal of the terminate_tasks_on_shutdown parameter constitutes a breaking change for any external code that was using this parameter.


🏁 Script executed:

#!/bin/bash
# Search for any remaining references to terminate_tasks_on_shutdown parameter
rg -i "terminate_tasks_on_shutdown"

# Check for any imports or function calls that might still expect this parameter
ast-grep --pattern 'create_file_executor($$$, terminate_tasks_on_shutdown=$_, $$$)'

Length of output: 521


🏁 Script executed:

#!/bin/bash
# Locate create_file_executor definition
rg -n "def create_file_executor" -n executorlib/task_scheduler/file/task_scheduler.py

# Check if terminate_tasks_on_shutdown is still referenced in this file
rg -n "terminate_tasks_on_shutdown" -n executorlib/task_scheduler/file/task_scheduler.py

# Print lines around the function signature for context
sed -n '100,140p' executorlib/task_scheduler/file/task_scheduler.py

Length of output: 1549


Breaking Change: Remove terminate_tasks_on_shutdown References

The new create_file_executor no longer accepts or propagates a terminate_tasks_on_shutdown parameter, but several executors and tests still reference it. Please update or clean up these callers to match the simplified API:

• executorlib/executor/slurm.py

  • Remove the terminate_tasks_on_shutdown: bool = True parameter from the SlurmExecutor signature and any internal references.

• executorlib/executor/flux.py

  • Remove the terminate_tasks_on_shutdown: bool = True parameter from the FluxExecutor signature and any internal references.

• tests/test_fluxclusterexecutor.py

  • Drop the terminate_tasks_on_shutdown= arguments passed when instantiating executors.

Fix these to align with the new create_file_executor signature or reintroduce a no-op parameter if backward compatibility is required.

🤖 Prompt for AI Agents
In executorlib/task_scheduler/file/task_scheduler.py around lines 115 to 118,
the code and related executors still reference the obsolete
terminate_tasks_on_shutdown parameter which is no longer accepted by
create_file_executor. Remove the terminate_tasks_on_shutdown parameter from the
SlurmExecutor and FluxExecutor class signatures and their internal usage in
executorlib/executor/slurm.py and executorlib/executor/flux.py. Also, update
tests/test_fluxclusterexecutor.py to eliminate any terminate_tasks_on_shutdown
arguments when creating executor instances to align with the simplified API.

@jan-janssen
Copy link
Member Author

@liamhuber Once this is merged I am happy to release a new version, which hopefully allows you to complete the integration in pyiron_workflow.

@liamhuber
Copy link
Member

Super! What is here looks good to me, but it only succeeds at two of the three cases: it fails at killing the process and then restarting the same job. It fails by submitting a second slurm job.

It looks to me like we need more logic in task_scheduler.file.shared around line 135 for execute_tasks_h5. The problem is clear even to me: there is an "if output h5 file not exists" clause which then simply clears any existing input h5 file and runs everything again. I think what needs to happen is

  • Check also for an input file, if it's not there proceed with existing code as usual
  • Look for the cache identifier (directory+key?) to be identified in the slurm (or flux?) queue -- if it's not there, raise an error
  • Follow the usual registration process for the queue to wait to unpack the result serialized to the output h5 file

The nuts and bolts are not familiar enough to me to try this myself yet though.

How I tested:

import time

from executorlib import SlurmClusterExecutor

def foo(x):
    time.sleep(x)
    return x

with SlurmClusterExecutor(
    cache_directory="cache_sce", 
    resource_dict={
        "partition": "s.cmfe", 
        "cache_key": "foo",
    }
) as executor:
    future = executor.submit(foo, 30)
    print(future)
    
print(future.result())
print(future, "\n")
  • Python always running
    • Still works fine when I simply let it run
  • Restarting and recovering a finished job
    • When I start the run,
    • see it on the slurm queue,
    • kill the notebook,
    • wait for it to finish on the slurm queue (it persisted there fine),
    • and then re-run the notebook:
    • It runs fine, not blazing fast but way, way faster than my 30s sleep
  • Restarting and recovering a job still in the queue
    • When I start the run,
    • see it on the slurm queue,
    • kill the notebook,
    • restart the notebook before the slurm job is finished,
    • ...it submits a second job to the slurm queue

@jan-janssen
Copy link
Member Author

@liamhuber Good catch.

To me there are again two cases:

  • somebody submits a number of calculations and then realizes that the resources defined are not sufficient. So in that case the user might want to delete all the jobs and submit them again.
  • the alternative case is the user is happy with the setup of their calculation and they just want to check if they are already done or not.

For the first case it might be helpful to have a function to delete all the jobs for a given project. This information can be gathered from the executorlib.get_cache_data() function https://executorlib.readthedocs.io/en/latest/1-single-node.html#cache but implementing a separate delete function might be helpful.

@jan-janssen jan-janssen marked this pull request as draft July 14, 2025 19:14
@liamhuber
Copy link
Member

@jan-janssen, thanks for the fixes! Working on #732, my test example above works for all three of my cases (stopping running jobs is also of interest to me, but it's not urgent for me and I agree with your comment that this can be done in a different PR.

In addition to above, I can handle all three cases (continuous, restart and wait until finished, and restart and re-submit while running) also with an explicitly instantiated and shutdown executor:

import time

from executorlib import SlurmClusterExecutor

def foo(x):
    time.sleep(x)
    return x

executor = SlurmClusterExecutor(
    cache_directory="cache_sce", 
    resource_dict={
        "partition": "s.cmfe", 
        "cache_key": "foo",
    }
)
future = executor.submit(foo, 10)
print(future)
executor.shutdown()
print(future.result())
print(future)

And using my integration with pyiron_workflow:

import pyiron_workflow as pwf
from pyiron_workflow.executors.wrapped_executorlib import CacheSlurmClusterExecutor


wf = pwf.Workflow("with_clfe")
wf.n1 = pwf.std.UserInput(30)
wf.n2 = pwf.std.Sleep(wf.n1)
wf.n3 = pwf.std.UserInput(wf.n2)

wf.use_cache = False
wf.n2.use_cache = False

with CacheSlurmClusterExecutor(resource_dict={"partition": "s.cmfe"}) as exe:
    wf.n2.executor = exe
    wf.run()
wf.outputs.to_value_dict()

Interestingly, what stops working for me as of this PR is late and direct instantiation of the executor, which I handle in pyiron_workflow like this:

import pyiron_workflow as pwf
from pyiron_workflow.executors.wrapped_executorlib import CacheSlurmClusterExecutor


wf = pwf.Workflow("with_clfe")
wf.n1 = pwf.std.UserInput(30)
wf.n2 = pwf.std.Sleep(wf.n1)
wf.n3 = pwf.std.UserInput(wf.n2)

wf.use_cache = False
wf.n2.use_cache = False

wf.n2.executor = (CacheSlurmClusterExecutor, (), {"resource_dict": {"partition": "s.cmfe"}})
wf.run()

wf.outputs.to_value_dict()

On the one hand, I believe it's very likely that I need to go in and be more careful in how I manage the lifetime of the executor is managed in this case. On the other hand, the way it fails is very curious to me, especially given the simplicity of the diff in this PR. What happens is that the is it possible that this is _i.h5 file and run_queue.sh content gets written, but no slurm job is ever submit. Is it possible something in my usage represents a meaningful edge case for executorlib? Again, on main the final example is working fine (at least in the case of keeping the notebook running the whole time), so the culprit is in this limited diff...

@liamhuber
Copy link
Member

Interestingly, what stops working for me as of this PR is late and direct instantiation of the executor

This was resolved by being more responsible on my end: I delay the instantiation until later in my process, and since such executors are guaranteed to be used in a one-shot fashion by my code I add a shutdown call as a callback on the one-and-only future such executors are producing. All my tests pass locally and all three cases (run, shutdown and wait, shutdown and reconnect while running) are working beautifully with SLURM on the cluster.

The ability to query and kill running jobs will be useful for me going forward, but once this and #732 are merged down and released, I have everything I need for a beta release of executorlib-pyiron_workflow integration 🚀

@jan-janssen
Copy link
Member Author

Great, then I have another careful look tomorrow and get a new release ready.

@jan-janssen jan-janssen marked this pull request as ready for review July 16, 2025 14:09
@jan-janssen jan-janssen merged commit ba2c702 into main Jul 16, 2025
31 checks passed
@jan-janssen jan-janssen deleted the shutdown_case branch July 16, 2025 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants