-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Labels
Description
Context
-
Affected library: cluster-tools
-
If there is an error during slurm job submission, for example if sbatch complains that the job submission script is invalid, the resulting error is not propagated to the caller, leading to a hanging program.
Exception in thread Thread-323:
Traceback (most recent call last):
File ".local/share/uv/python/cpython-3.11.10-linux-x86_64-gnu/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File ".venv/lib/python3.11/site-packages/cluster_tools/schedulers/slurm.py", line 577, in run
job_id = SlurmExecutor.submit_text(script, self.cfut_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.11/site-packages/cluster_tools/schedulers/slurm.py", line 248, in submit_text
job_id, stderr = chcall("sbatch --parsable {}".format(filename))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.11/site-packages/cluster_tools/_utils/call.py", line 47, in chcall
raise CommandError(command, code, stderr)
cluster_tools._utils.call.CommandError: 'sbatch --parsable <redacted>.sh' exited with status 1: 'sbatch: error: memory limit must be provided for shared jobs\nsbatch: error: Batch job submission failed: Invalid feature specification\n'
^C
- This bug was introduced with the use of job submission threads. Since the submission threads are never joined and there is no special error handling/communication, errors are not propagated.
Expected Behavior
- The caller of the slurm executor should be notified about the submission error through a raised error
Current Behavior
- No error is raised on the caller side and no more jobs are submitted leading to an indefinite hang of the program
Steps to Reproduce the bug
- Cannot reproduce the bug anymore / needs deeper investigation.
- Provoke an sbatch submission error, for example by specifying the slurm strategy and a time or mem resource that is too large or invalid
- Caller won't shut down and hang indefinitely
Your Environment for bug
- Operating System and version: Linux 5.14.21
- Version of webKnossos-libs (Release or Commit): 0.16.2