Skip to content

Conversation

ozhanozen
Copy link
Contributor

Description

This PR introduces support for early stopping in Ray integration through the Stopper class. It enables trials to end sooner when they are unlikely to yield useful results, reducing wasted compute time and speeding up experimentation.

Previously, when running hyperparameter tuning with Ray integration, all trials would continue until the training configuration’s maximum iterations were reached, even if a trial was clearly underperforming. This wasn’t always efficient, since poor-performing trials could often be identified early on. With this PR, an optional early stopping mechanism is introduced, allowing Ray to terminate unpromising trials sooner and improve the overall efficiency of hyperparameter tuning.

The PR also includes a CartpoleEarlyStopper example in vision_cartpole_cfg.py. This serves as a reference implementation that halts a trial if the out_of_bounds metric doesn’t reduce after a set number of iterations. It’s meant as a usage example: users are encouraged to create their own custom stoppers tailored to their specific use cases.

Fixes #3270.

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@ozhanozen
Copy link
Contributor Author

Hi @garylvov, here is the PR as agreed.

I have noticed while re-testing that it sometimes works, sometimes doesn't. I think the issue is not the mechanism inside the PR, but rather that Isaac Sim does not always respond in time to the termination signal. When this happens, the next training does not start within process_response_timeout, hence halting all the ray main process.

One solution idea could be to check if a subprocess exists within a threshold and kill it, if it doesn't, but I do not know how to do this with stoppers as normally Ray handles this. Alternatively, would it be better to add a mechanism to execute_job such that if there is a halted subprocess for Isaac Sim, we kill it before starting a new subprocess no matter what?

@ozhanozen
Copy link
Contributor Author

@garylvov, a small update:

I was wrong to say it sometimes works. By coincidence, the processes were ending anyway shortly after the early stop signal, and that is why I thought it sometimes worked.

I have debugged it further and can confirm that even after a "trial" is marked as completed, the subprocess/training continues to the end. The following trial might fail, e.g., due to a lack of GPU memory.

@garylvov
Copy link
Collaborator

garylvov commented Aug 28, 2025

Hi thanks for your further investigation.

Alternatively, would it be better to add a mechanism to execute_job such that if there is a halted subprocess for Isaac Sim, we kill it before starting a new subprocess no matter what?

I think this could work, but I would be a little worried about it being too "kill happy", and erroneously shutting down processes that were experiencing an ephemeral stalling period. Perhaps we can just wait for a few moments, and if it's still halted, then kill it.

However, I think it may be non-optimal design to have a new ray process try to do cleanup of other processes before starting, as opposed to a ray process doing clean-up on its own process after it finishes.

I have debugged it further and can confirm that even after a "trial" is marked as completed, the subprocess/training continues to the end. The following trial might fail, e.g., due to a lack of GPU memory.

I would assume that Ray could do this well enough out of the box to stop the rogue processes, but I guess that's wishful thinking ;)

I will do some testing of this too. I think you may be onto something with some mechanism related to the Ray stopper. Maybe we can override some sort of cleanup method to aggressively SIGKILL the PID recovered by execute_job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Proposal] Early stopping while doing hyperparameter tuning with Ray integration
2 participants