Skip to content

Wrap executorlib executors #678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Jul 18, 2025
Merged

Wrap executorlib executors #678

merged 36 commits into from
Jul 18, 2025

Conversation

liamhuber
Copy link
Member

To exploit the new caching interface provided in executorlib-1.5.0 so that nodes can rely on their running state and lexical path to access previously executed results.

Locally with the SingleNodeExecutor everything is looking good, but that doesn't natively support terminating the process that submit the job. I'd like to play around with this using the SlurmClusterExecutor on the cluster before making further changes.

To exploit the new caching interface so that nodes can rely on their running state and lexical path to access previously executed results.

Locally with the SingleNodeExecutor everything is looking good, but that doesn't natively support terminating the process that submit the job. I'd like to play around with this using the SlurmClusterExecutor on the cluster before making further changes.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Copy link

Binder 👈 Launch a binder notebook on branch pyiron/pyiron_workflow/executor

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Copy link

codecov bot commented Jun 19, 2025

Codecov Report

Attention: Patch coverage is 88.88889% with 9 lines in your changes missing coverage. Please review.

Project coverage is 92.05%. Comparing base (4841e34) to head (f579159).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pyiron_workflow/node.py 70.00% 6 Missing ⚠️
pyiron_workflow/executors/wrapped_executorlib.py 96.29% 1 Missing ⚠️
pyiron_workflow/mixin/run.py 96.42% 1 Missing ⚠️
pyiron_workflow/nodes/composite.py 66.66% 1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (88.88%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #678      +/-   ##
==========================================
- Coverage   92.11%   92.05%   -0.07%     
==========================================
  Files          33       34       +1     
  Lines        3665     3725      +60     
==========================================
+ Hits         3376     3429      +53     
- Misses        289      296       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Since we now depend explicitly on a new feature

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
It causes a weird hang that blocks observability.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
And make the expected file independently accessible

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
And hide it behind its own boolean flag for testing

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
And make the file name fixed and accessible at the class level

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
@liamhuber liamhuber dismissed a stale review July 10, 2025 18:16

bot spam

The slurm executor populates this with a submission script, etc.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
From @jan-janssen in [this comment](pyiron/executorlib#708 (comment))

Co-authored-by: Jan Janssen
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
@liamhuber
Copy link
Member Author

This is currently working to all attempted behaviour when run together with pyiron/executorlib#712

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
The local file executor got directly included in executorlib as a testing tool.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
And always with-execute tuples since there is only ever one instance of this executor. If we have already been assigned an executor _instance_ then we trust the user to be managing its state and submit directly rather than wrapping in a with-clause

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Recent changes threw off the balance of times in the first vs second run, so rather compare to what you actually care about: that the second run is bypassing the sleep call.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Instead of a with-clause. This way the executor is still permitted to release the thread before the job is done, but we still guarantee that executors created by bespoke instructions get shutdown at the end of their one-future lifetime.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
There was necessarily only the one future, so don't wait at shutdown. This removes the need for accepting the runtime error and prevents the wrapped executorlib executors from hanging indefinitely.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
@liamhuber
Copy link
Member Author

ImportError: cannot import name 'TestClusterExecutor' from 'executorlib.api' (/home/runner/work/pyiron_workflow/pyiron_workflow/cached-miniforge/my-env/lib/python3.13/site-packages/executorlib/api.py). Did you mean: 'FluxClusterExecutor'?

Waiting on executorlib-1.5.3

@liamhuber liamhuber changed the title [WIP] Wrap executorlib executors Wrap executorlib executors Jul 15, 2025
@liamhuber
Copy link
Member Author

liamhuber commented Jul 15, 2025

Together with pyiron/executorlib#732 this is working very nicely on the cluster now. I can

  • Let the workflow run
  • Interrupt the workflow after the slurm job has started
    • Restart the workflow after the slurm job is complete
    • Restart the workflow while the slurm job is still going

and in all cases everything runs perfectly smoothly. I.e. I can start with this:

import pyiron_workflow as pwf
from pyiron_workflow.executors.wrapped_executorlib import CacheSlurmClusterExecutor


wf = pwf.Workflow("executor_test")
wf.n1 = pwf.std.UserInput(20)
wf.n2 = pwf.std.Sleep(wf.n1)
wf.n3 = pwf.std.UserInput(wf.n2)

wf.n2.executor = (CacheSlurmClusterExecutor, (), {"resource_dict": {"partition": "s.cmfe"}})

wf.run()

And then either let it run, or restart the kernel and follow up with this after the appropriate delay for the case I'm interested in:

import pyiron_workflow as pwf
from pyiron_workflow.executors.wrapped_executorlib import CacheSlurmClusterExecutor

wf = pwf.Workflow("executor_test")
wf.load(filename=wf.label + "/recovery.pckl")

wf.failed = False
wf.use_cache = False
wf.run()

Outside the scope of this PR but on the TODO list is:

Including the lower bound

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
And debug the error message

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Since that's the way users will typically interact with this field. I also had to change the inheritance order to make sure we were dealing with the user-facing executor and not the task scheduler, but this doesn't impact the submit loop.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
So we pass throught the Runnable._shutdown_executor_callback process

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
@liamhuber
Copy link
Member Author

Codecov complains that Node._clean_wrapped_executorlib_executor_cache is not being tested, but it's just flat out wrong. It is absolutely being invoked in tests.integration.test_wrapped_executorlib.TestWrappedExecutorlib.test_automatic_cleaning.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
@liamhuber
Copy link
Member Author

  • CI tests on Slurm

Actually, I'd be more comfortable with this PR if it included these. I'll still leave exposure in the API for later, but let's take a crack at robust testing right here.

* Test slurm submission

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Don't apply callbacks to cached returns

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Only validate submission-time resources

Otherwise we run into trouble where it loads saved executor instructions (that already have what it would use anyhow)

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Mark module

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Test cached result branch

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

---------

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
@liamhuber liamhuber marked this pull request as ready for review July 18, 2025 17:54
@liamhuber liamhuber merged commit 8e1acd8 into main Jul 18, 2025
37 of 39 checks passed
@liamhuber liamhuber deleted the executor branch July 18, 2025 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants