Skip to content

Conversation

allenwang28
Copy link
Contributor

@allenwang28 allenwang28 commented Oct 10, 2025

What this PR does:

  • Updates the Monarch wheel used in CI (does not have Extent nor HostMesh stuff)
  • Introduces the MONARCH_HOSTMESH_V1 environment variable, which defaults to False.
  • Changes launchers to set the default transport if needed (for HostMesh v1)
  • Adds codepaths for both HostMesh v0 and v1
  • Modifies APIs throughout the codebase to use the correct path. We will not support both v0 and v1 longer term - but until we're settled in all our use cases we will keep it hidden behind this environment variable.
  • For all configs, adds in the service name. This is nice for allocations (used to just be called alloc-0 on SLURM) but also because now TorchStore requires host-colocated procs. For now we just get the host by name from the provisioner, which is keeping track of the allocated hosts.
  • use_dcp is now controlled by whether or not RDMA is being used.

To run:

TORCHSTORE_RDMA_ENABLED=1 MONARCH_HOSTMESH_V1=1 python -m apps.grpo.main --config=apps/grpo/qwen3_32b_2.yaml 2>&1 | tee logs.log

This won't work until changes in TorchStore lands. Until then, the normal path works (i.e. defaults are TORCHSTORE_RDMA_ENABLED=0 and MONARCH_HOSTMESH_V1=0)

Proof of 32B running with this setup:

=== [0_p:10_r0] - METRICS STEP 1 ===
  buffer/add/count_episodes_added: 84.0
  buffer/evict/avg_policy_staleness: 0.0
  buffer/evict/max_policy_staleness: 0.0
  buffer/sample/avg_buffer_utilization: 64.0
  buffer/sample/avg_buffer_utilization_pct: 100.0
  buffer/sample/count_sample_requests: 836.0
  buffer_perf/sample/total_duration_avg_s: 5.4186574043037645e-05
  buffer_perf/sample/total_duration_max_s: 0.006779166171327233
  dataset/sample/avg_sample_len: 514.1304347826087
  dataset/sample/count_samples_generated: 46.0
  main/continuous_rollouts/count_rollout_iterations: 42.0
  main_perf/continuous_rollouts/compute_logprobs/duration_avg_s: 4.204719083472377e-05
  main_perf/continuous_rollouts/compute_logprobs/duration_max_s: 0.0002812119200825691
  main_perf/continuous_rollouts/data_loading/duration_avg_s: 0.035738113413875304
  main_perf/continuous_rollouts/data_loading/duration_max_s: 0.3613387381192297
  main_perf/continuous_rollouts/policy_generation/duration_avg_s: 8.360790666724954
  main_perf/continuous_rollouts/policy_generation/duration_max_s: 11.069030554965138
  main_perf/continuous_rollouts/reference_model_calculate_logprobs/duration_avg_s: 1.640121555077799
  main_perf/continuous_rollouts/reference_model_calculate_logprobs/duration_max_s: 5.62058763904497
  main_perf/continuous_rollouts/reward_evaluation/duration_avg_s: 0.015466002969159967
  main_perf/continuous_rollouts/reward_evaluation/duration_max_s: 0.46290697203949094
  main_perf/continuous_rollouts/total_duration_avg_s: 10.073797453704866
  main_perf/continuous_rollouts/total_duration_max_s: 17.083926575025544
  main_perf/continuous_training/push_weights/duration_avg_s: 9.764719574945047
  main_perf/continuous_training/push_weights/duration_max_s: 9.764719574945047
  main_perf/continuous_training/total_duration_avg_s: 236.4534942531027
  main_perf/continuous_training/total_duration_max_s: 236.4534942531027
  main_perf/continuous_training/train_step/duration_avg_s: 7.173950162017718
  main_perf/continuous_training/train_step/duration_max_s: 7.173950162017718
  main_perf/continuous_training/update_weights/duration_avg_s: 134.6413610370364
  main_perf/continuous_training/update_weights/duration_max_s: 134.6413610370364
  main_perf/continuous_training/waiting_for_buffer/duration_avg_s: 84.87346064811572
  main_perf/continuous_training/waiting_for_buffer/duration_max_s: 84.87346064811572
  policy/generate/avg_tokens_generated: 501.9761904761905
  policy/generate/count_requests: 46.0
  policy/generate/count_sequences_completed: 84.0
  policy/generate/sum_tokens_generated: 42166.0
  policy/update_weights/count_weight_updates: 1.0
  policy_perf/generate/generate/duration_avg_s: 8.1663033156622
  policy_perf/generate/generate/duration_max_s: 8.969009765625
  policy_perf/generate/process_inputs/duration_avg_s: 0.000976897522097542
  policy_perf/generate/process_inputs/duration_max_s: 0.004609471797943116
  policy_perf/generate/prompt_truncation/duration_avg_s: 1.2765714466305717e-05
  policy_perf/generate/prompt_truncation/duration_max_s: 0.00029414400458335874
  policy_perf/generate/total_duration_avg_s: 8.167422056994099
  policy_perf/generate/total_duration_max_s: 8.973469371676446
  policy_perf/update_weights/avg_pending_requests: 3.0
  policy_perf/update_weights/max_pending_requests: 3.0
  policy_worker_perf/update/total_duration_avg_s: 98.41008439671714
  policy_worker_perf/update/total_duration_max_s: 128.51622161176056
  reference_perf/forward/avg_sequence_length: 1024.0
  reference_perf/forward/compute_logprobs/duration_avg_s: 0.0009754613551887728
  reference_perf/forward/compute_logprobs/duration_max_s: 0.03818338899873197
  reference_perf/forward/count_forward_passes: 168.0
  reference_perf/forward/forward/duration_avg_s: 1.2508633632997295
  reference_perf/forward/forward/duration_max_s: 1.6371194100938737
  reference_perf/forward/garbage_collection/duration_avg_s: 0.01460896339821851
  reference_perf/forward/garbage_collection/duration_max_s: 0.021972813876345754
  reference_perf/forward/memory_delta_end_start_avg_gb: 0.5803492409842355
  reference_perf/forward/memory_peak_max_gb: 17.318827152252197
  reference_perf/forward/to_device/duration_avg_s: 9.279798472388869e-05
  reference_perf/forward/to_device/duration_max_s: 0.0002518759574741125
  reference_perf/forward/total_duration_avg_s: 1.266543488083352
  reference_perf/forward/total_duration_max_s: 1.66933400509879
  reward/evaluate_response/avg_MathReward_reward: 0.182142857142857
  reward/evaluate_response/avg_ThinkingReward_reward: 0.38095238095238065
  reward/evaluate_response/avg_total_reward: 0.28154761904761944
  reward/evaluate_response/count_MathReward_calls: 84.0
  reward/evaluate_response/count_ThinkingReward_calls: 84.0
  reward/evaluate_response/std_MathReward_reward: 0.28542395963670225
  reward/evaluate_response/std_ThinkingReward_reward: 0.33469111220582043
  reward/evaluate_response/sum_MathReward_reward: 15.299999999999986
  reward/evaluate_response/sum_ThinkingReward_reward: 31.999999999999975
  rl_trainer/avg_grpo_loss: 0.0013991296291351318
  rl_trainer/count_training_steps: 8.0
  rl_trainer/learning_rate: 0.001
  rl_trainer_perf/push_weights/flatten_state_dict/duration_avg_s: 0.0010600697132758796
  rl_trainer_perf/push_weights/flatten_state_dict/duration_max_s: 0.0011422648094594479
  rl_trainer_perf/push_weights/memory_delta_end_start_avg_gb: 0.0
  rl_trainer_perf/push_weights/memory_peak_max_gb: 22.949132919311523
  rl_trainer_perf/push_weights/to_hf/duration_avg_s: 0.0012380220578052104
  rl_trainer_perf/push_weights/to_hf/duration_max_s: 0.0012862770818173885
  rl_trainer_perf/push_weights/total_duration_avg_s: 9.38885561923962
  rl_trainer_perf/push_weights/total_duration_max_s: 9.761705880053341
  rl_trainer_perf/push_weights/ts_save/duration_avg_s: 9.38655402616132
  rl_trainer_perf/push_weights/ts_save/duration_max_s: 9.759349544998258
  rl_trainer_perf/step/forward_backward/duration_avg_s: 6.6754657930578105
  rl_trainer_perf/step/forward_backward/duration_max_s: 6.678493158891797
  rl_trainer_perf/step/memory_delta_end_start_avg_gb: 15.31917667388916
  rl_trainer_perf/step/memory_peak_max_gb: 30.577297687530518
  rl_trainer_perf/step/optimizer_step/duration_avg_s: 0.4459393950528465
  rl_trainer_perf/step/optimizer_step/duration_max_s: 0.44871496595442295
  rl_trainer_perf/step/save_checkpoint/duration_avg_s: 0.0274293027468957
  rl_trainer_perf/step/save_checkpoint/duration_max_s: 0.027538867201656103
  rl_trainer_perf/step/total_duration_avg_s: 7.148838073073421
  rl_trainer_perf/step/total_duration_max_s: 7.153811537194997
==============================

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 10, 2025
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 56.75676% with 16 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@3303af5). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/forge/controller/provisioner.py 55.55% 8 Missing ⚠️
src/forge/controller/launcher.py 42.85% 4 Missing ⚠️
src/forge/actors/policy.py 0.00% 2 Missing ⚠️
src/forge/observability/metric_actors.py 66.66% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #378   +/-   ##
=======================================
  Coverage        ?   64.44%           
=======================================
  Files           ?       79           
  Lines           ?     7735           
  Branches        ?        0           
=======================================
  Hits            ?     4985           
  Misses          ?     2750           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@allenwang28 allenwang28 marked this pull request as ready for review October 13, 2025 15:07
@allenwang28 allenwang28 changed the title [wip] Integrate Hostmesh v1 and RDMA Integrates Hostmesh v1 and RDMA Oct 13, 2025
@allenwang28 allenwang28 requested a review from LucasLLC October 13, 2025 15:08
Copy link
Member

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but should probably get a Monarch person's eyes on this.

dataset:
procs: 1
with_gpus: false
mesh_name: dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These have to be manually specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's probably a way to do this in a not silly way but alas

@allenwang28 allenwang28 merged commit b77d6ad into meta-pytorch:main Oct 13, 2025
8 checks passed
@allenwang28 allenwang28 deleted the hostmesh_rdma branch October 13, 2025 18:10
Comment on lines +23 to +24
from monarch._src.actor.v1.host_mesh import this_proc
from monarch._src.actor.v1.proc_mesh import get_or_spawn_controller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this in lieu of the public APIs which have moved to v1 by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants