Skip to content

Errors during slurm job submission are not propagated #1237

@daniel-wer

Description

@daniel-wer

Context

  • Affected library: cluster-tools

  • If there is an error during slurm job submission, for example if sbatch complains that the job submission script is invalid, the resulting error is not propagated to the caller, leading to a hanging program.

Exception in thread Thread-323:
Traceback (most recent call last):
  File ".local/share/uv/python/cpython-3.11.10-linux-x86_64-gnu/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File ".venv/lib/python3.11/site-packages/cluster_tools/schedulers/slurm.py", line 577, in run
    job_id = SlurmExecutor.submit_text(script, self.cfut_dir)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/cluster_tools/schedulers/slurm.py", line 248, in submit_text
    job_id, stderr = chcall("sbatch --parsable {}".format(filename))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/cluster_tools/_utils/call.py", line 47, in chcall
    raise CommandError(command, code, stderr)
cluster_tools._utils.call.CommandError: 'sbatch --parsable <redacted>.sh' exited with status 1: 'sbatch: error: memory limit must be provided for shared jobs\nsbatch: error: Batch job submission failed: Invalid feature specification\n'
^C
  • This bug was introduced with the use of job submission threads. Since the submission threads are never joined and there is no special error handling/communication, errors are not propagated.

Expected Behavior

  • The caller of the slurm executor should be notified about the submission error through a raised error

Current Behavior

  • No error is raised on the caller side and no more jobs are submitted leading to an indefinite hang of the program

Steps to Reproduce the bug

  • Cannot reproduce the bug anymore / needs deeper investigation.
  1. Provoke an sbatch submission error, for example by specifying the slurm strategy and a time or mem resource that is too large or invalid
  2. Caller won't shut down and hang indefinitely

Your Environment for bug

  • Operating System and version: Linux 5.14.21
  • Version of webKnossos-libs (Release or Commit): 0.16.2

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions