Adds early stopping support for Ray integration #3276

ozhanozen · 2025-08-26T13:55:14Z

Description

This PR introduces support for early stopping in Ray integration through the Stopper class. It enables trials to end sooner when they are unlikely to yield useful results, reducing wasted compute time and speeding up experimentation.

Previously, when running hyperparameter tuning with Ray integration, all trials would continue until the training configuration’s maximum iterations were reached, even if a trial was clearly underperforming. This wasn’t always efficient, since poor-performing trials could often be identified early on. With this PR, an optional early stopping mechanism is introduced, allowing Ray to terminate unpromising trials sooner and improve the overall efficiency of hyperparameter tuning.

The PR also includes a CartpoleEarlyStopper example in vision_cartpole_cfg.py. This serves as a reference implementation that halts a trial if the out_of_bounds metric doesn’t reduce after a set number of iterations. It’s meant as a usage example: users are encouraged to create their own custom stoppers tailored to their specific use cases.

Fixes #3270.

Type of change

New feature (non-breaking change which adds functionality)

Checklist

I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

ozhanozen · 2025-08-26T14:07:00Z

Hi @garylvov, here is the PR as agreed.

I have noticed while re-testing that it sometimes works, sometimes doesn't. I think the issue is not the mechanism inside the PR, but rather that Isaac Sim does not always respond in time to the termination signal. When this happens, the next training does not start within process_response_timeout, hence halting all the ray main process.

One solution idea could be to check if a subprocess exists within a threshold and kill it, if it doesn't, but I do not know how to do this with stoppers as normally Ray handles this. Alternatively, would it be better to add a mechanism to execute_job such that if there is a halted subprocess for Isaac Sim, we kill it before starting a new subprocess no matter what?

ozhanozen · 2025-08-27T09:20:32Z

@garylvov, a small update:

I was wrong to say it sometimes works. By coincidence, the processes were ending anyway shortly after the early stop signal, and that is why I thought it sometimes worked.

I have debugged it further and can confirm that even after a "trial" is marked as completed, the subprocess/training continues to the end. The following trial might fail, e.g., due to a lack of GPU memory.

garylvov · 2025-08-28T14:17:02Z

Hi thanks for your further investigation.

Alternatively, would it be better to add a mechanism to execute_job such that if there is a halted subprocess for Isaac Sim, we kill it before starting a new subprocess no matter what?

I think this could work, but I would be a little worried about it being too "kill happy", and erroneously shutting down processes that were experiencing an ephemeral stalling period. Perhaps we can just wait for a few moments, and if it's still halted, then kill it.

However, I think it may be non-optimal design to have a new ray process try to do cleanup of other processes before starting, as opposed to a ray process doing clean-up on its own process after it finishes.

I have debugged it further and can confirm that even after a "trial" is marked as completed, the subprocess/training continues to the end. The following trial might fail, e.g., due to a lack of GPU memory.

I would assume that Ray could do this well enough out of the box to stop the rogue processes, but I guess that's wishful thinking ;)

I will do some testing of this too. I think you may be onto something with some mechanism related to the Ray stopper. Maybe we can override some sort of cleanup method to aggressively SIGKILL the PID recovered by execute_job

ozhanozen added 2 commits August 26, 2025 15:44

Adds early stopping support with stoppers

0a579e3

Adds early stopper example for cartpole

4305f85

garylvov mentioned this pull request Aug 29, 2025

[Question] Early stopping in RSLRL #3300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds early stopping support for Ray integration #3276

Adds early stopping support for Ray integration #3276

ozhanozen commented Aug 26, 2025

Uh oh!

ozhanozen commented Aug 26, 2025

Uh oh!

ozhanozen commented Aug 27, 2025

Uh oh!

garylvov commented Aug 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Adds early stopping support for Ray integration #3276

Are you sure you want to change the base?

Adds early stopping support for Ray integration #3276

Conversation

ozhanozen commented Aug 26, 2025

Description

Type of change

Checklist

Uh oh!

ozhanozen commented Aug 26, 2025

Uh oh!

ozhanozen commented Aug 27, 2025

Uh oh!

garylvov commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

garylvov commented Aug 28, 2025 •

edited

Loading