Integrates Hostmesh v1 and RDMA #378

allenwang28 · 2025-10-10T21:09:13Z

What this PR does:

Updates the Monarch wheel used in CI (does not have Extent nor HostMesh stuff)
Introduces the MONARCH_HOSTMESH_V1 environment variable, which defaults to False.
Changes launchers to set the default transport if needed (for HostMesh v1)
Adds codepaths for both HostMesh v0 and v1
Modifies APIs throughout the codebase to use the correct path. We will not support both v0 and v1 longer term - but until we're settled in all our use cases we will keep it hidden behind this environment variable.
For all configs, adds in the service name. This is nice for allocations (used to just be called alloc-0 on SLURM) but also because now TorchStore requires host-colocated procs. For now we just get the host by name from the provisioner, which is keeping track of the allocated hosts.
use_dcp is now controlled by whether or not RDMA is being used.

To run:

TORCHSTORE_RDMA_ENABLED=1 MONARCH_HOSTMESH_V1=1 python -m apps.grpo.main --config=apps/grpo/qwen3_32b_2.yaml 2>&1 | tee logs.log

This won't work until changes in TorchStore lands. Until then, the normal path works (i.e. defaults are TORCHSTORE_RDMA_ENABLED=0 and MONARCH_HOSTMESH_V1=0)

Proof of 32B running with this setup:

=== [0_p:10_r0] - METRICS STEP 1 ===
  buffer/add/count_episodes_added: 84.0
  buffer/evict/avg_policy_staleness: 0.0
  buffer/evict/max_policy_staleness: 0.0
  buffer/sample/avg_buffer_utilization: 64.0
  buffer/sample/avg_buffer_utilization_pct: 100.0
  buffer/sample/count_sample_requests: 836.0
  buffer_perf/sample/total_duration_avg_s: 5.4186574043037645e-05
  buffer_perf/sample/total_duration_max_s: 0.006779166171327233
  dataset/sample/avg_sample_len: 514.1304347826087
  dataset/sample/count_samples_generated: 46.0
  main/continuous_rollouts/count_rollout_iterations: 42.0
  main_perf/continuous_rollouts/compute_logprobs/duration_avg_s: 4.204719083472377e-05
  main_perf/continuous_rollouts/compute_logprobs/duration_max_s: 0.0002812119200825691
  main_perf/continuous_rollouts/data_loading/duration_avg_s: 0.035738113413875304
  main_perf/continuous_rollouts/data_loading/duration_max_s: 0.3613387381192297
  main_perf/continuous_rollouts/policy_generation/duration_avg_s: 8.360790666724954
  main_perf/continuous_rollouts/policy_generation/duration_max_s: 11.069030554965138
  main_perf/continuous_rollouts/reference_model_calculate_logprobs/duration_avg_s: 1.640121555077799
  main_perf/continuous_rollouts/reference_model_calculate_logprobs/duration_max_s: 5.62058763904497
  main_perf/continuous_rollouts/reward_evaluation/duration_avg_s: 0.015466002969159967
  main_perf/continuous_rollouts/reward_evaluation/duration_max_s: 0.46290697203949094
  main_perf/continuous_rollouts/total_duration_avg_s: 10.073797453704866
  main_perf/continuous_rollouts/total_duration_max_s: 17.083926575025544
  main_perf/continuous_training/push_weights/duration_avg_s: 9.764719574945047
  main_perf/continuous_training/push_weights/duration_max_s: 9.764719574945047
  main_perf/continuous_training/total_duration_avg_s: 236.4534942531027
  main_perf/continuous_training/total_duration_max_s: 236.4534942531027
  main_perf/continuous_training/train_step/duration_avg_s: 7.173950162017718
  main_perf/continuous_training/train_step/duration_max_s: 7.173950162017718
  main_perf/continuous_training/update_weights/duration_avg_s: 134.6413610370364
  main_perf/continuous_training/update_weights/duration_max_s: 134.6413610370364
  main_perf/continuous_training/waiting_for_buffer/duration_avg_s: 84.87346064811572
  main_perf/continuous_training/waiting_for_buffer/duration_max_s: 84.87346064811572
  policy/generate/avg_tokens_generated: 501.9761904761905
  policy/generate/count_requests: 46.0
  policy/generate/count_sequences_completed: 84.0
  policy/generate/sum_tokens_generated: 42166.0
  policy/update_weights/count_weight_updates: 1.0
  policy_perf/generate/generate/duration_avg_s: 8.1663033156622
  policy_perf/generate/generate/duration_max_s: 8.969009765625
  policy_perf/generate/process_inputs/duration_avg_s: 0.000976897522097542
  policy_perf/generate/process_inputs/duration_max_s: 0.004609471797943116
  policy_perf/generate/prompt_truncation/duration_avg_s: 1.2765714466305717e-05
  policy_perf/generate/prompt_truncation/duration_max_s: 0.00029414400458335874
  policy_perf/generate/total_duration_avg_s: 8.167422056994099
  policy_perf/generate/total_duration_max_s: 8.973469371676446
  policy_perf/update_weights/avg_pending_requests: 3.0
  policy_perf/update_weights/max_pending_requests: 3.0
  policy_worker_perf/update/total_duration_avg_s: 98.41008439671714
  policy_worker_perf/update/total_duration_max_s: 128.51622161176056
  reference_perf/forward/avg_sequence_length: 1024.0
  reference_perf/forward/compute_logprobs/duration_avg_s: 0.0009754613551887728
  reference_perf/forward/compute_logprobs/duration_max_s: 0.03818338899873197
  reference_perf/forward/count_forward_passes: 168.0
  reference_perf/forward/forward/duration_avg_s: 1.2508633632997295
  reference_perf/forward/forward/duration_max_s: 1.6371194100938737
  reference_perf/forward/garbage_collection/duration_avg_s: 0.01460896339821851
  reference_perf/forward/garbage_collection/duration_max_s: 0.021972813876345754
  reference_perf/forward/memory_delta_end_start_avg_gb: 0.5803492409842355
  reference_perf/forward/memory_peak_max_gb: 17.318827152252197
  reference_perf/forward/to_device/duration_avg_s: 9.279798472388869e-05
  reference_perf/forward/to_device/duration_max_s: 0.0002518759574741125
  reference_perf/forward/total_duration_avg_s: 1.266543488083352
  reference_perf/forward/total_duration_max_s: 1.66933400509879
  reward/evaluate_response/avg_MathReward_reward: 0.182142857142857
  reward/evaluate_response/avg_ThinkingReward_reward: 0.38095238095238065
  reward/evaluate_response/avg_total_reward: 0.28154761904761944
  reward/evaluate_response/count_MathReward_calls: 84.0
  reward/evaluate_response/count_ThinkingReward_calls: 84.0
  reward/evaluate_response/std_MathReward_reward: 0.28542395963670225
  reward/evaluate_response/std_ThinkingReward_reward: 0.33469111220582043
  reward/evaluate_response/sum_MathReward_reward: 15.299999999999986
  reward/evaluate_response/sum_ThinkingReward_reward: 31.999999999999975
  rl_trainer/avg_grpo_loss: 0.0013991296291351318
  rl_trainer/count_training_steps: 8.0
  rl_trainer/learning_rate: 0.001
  rl_trainer_perf/push_weights/flatten_state_dict/duration_avg_s: 0.0010600697132758796
  rl_trainer_perf/push_weights/flatten_state_dict/duration_max_s: 0.0011422648094594479
  rl_trainer_perf/push_weights/memory_delta_end_start_avg_gb: 0.0
  rl_trainer_perf/push_weights/memory_peak_max_gb: 22.949132919311523
  rl_trainer_perf/push_weights/to_hf/duration_avg_s: 0.0012380220578052104
  rl_trainer_perf/push_weights/to_hf/duration_max_s: 0.0012862770818173885
  rl_trainer_perf/push_weights/total_duration_avg_s: 9.38885561923962
  rl_trainer_perf/push_weights/total_duration_max_s: 9.761705880053341
  rl_trainer_perf/push_weights/ts_save/duration_avg_s: 9.38655402616132
  rl_trainer_perf/push_weights/ts_save/duration_max_s: 9.759349544998258
  rl_trainer_perf/step/forward_backward/duration_avg_s: 6.6754657930578105
  rl_trainer_perf/step/forward_backward/duration_max_s: 6.678493158891797
  rl_trainer_perf/step/memory_delta_end_start_avg_gb: 15.31917667388916
  rl_trainer_perf/step/memory_peak_max_gb: 30.577297687530518
  rl_trainer_perf/step/optimizer_step/duration_avg_s: 0.4459393950528465
  rl_trainer_perf/step/optimizer_step/duration_max_s: 0.44871496595442295
  rl_trainer_perf/step/save_checkpoint/duration_avg_s: 0.0274293027468957
  rl_trainer_perf/step/save_checkpoint/duration_max_s: 0.027538867201656103
  rl_trainer_perf/step/total_duration_avg_s: 7.148838073073421
  rl_trainer_perf/step/total_duration_max_s: 7.153811537194997
==============================

codecov-commenter · 2025-10-13T14:52:06Z

Codecov Report

❌ Patch coverage is 56.75676% with 16 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@3303af5). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/forge/controller/provisioner.py	55.55%	8 Missing ⚠️
src/forge/controller/launcher.py	42.85%	4 Missing ⚠️
src/forge/actors/policy.py	0.00%	2 Missing ⚠️
src/forge/observability/metric_actors.py	66.66%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #378   +/-   ##
=======================================
  Coverage        ?   64.44%           
=======================================
  Files           ?       79           
  Lines           ?     7735           
  Branches        ?        0           
=======================================
  Hits            ?     4985           
  Misses          ?     2750           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

joecummings

LGTM but should probably get a Monarch person's eyes on this.

joecummings · 2025-10-13T18:02:54Z

apps/grpo/qwen3_1_7b.yaml

  dataset:
    procs: 1
    with_gpus: false
+    mesh_name: dataset


These have to be manually specified?

there's probably a way to do this in a not silly way but alas

vidhyav · 2025-10-13T21:49:50Z

src/forge/observability/metric_actors.py

+    from monarch._src.actor.v1.host_mesh import this_proc
+    from monarch._src.actor.v1.proc_mesh import get_or_spawn_controller


We should remove this in lieu of the public APIs which have moved to v1 by default.

allenwang28 added 14 commits October 9, 2025 12:44

start adding env constants

655b186

adds env_constants

467ac95

get_value on env var

3a86ad9

makes changes in the rest of the codebase

0bf5d0f

final fix

5cbc920

a few more cleanups

a2d8023

merge

9254a04

merge fixes

1a8d600

test fix

493ff0a

initial commit for hostmesh v1

b559831

more updates for testing

9c3bf46

changes for running

37cb96d

extent

248121a

merge

ad8d415

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 10, 2025

allenwang28 added 4 commits October 10, 2025 14:26

replace monarch build

88fdb92

rename wheel

fbbd710

fix

2f46967

fix docs

6a67910

felipemello1 mentioned this pull request Oct 10, 2025

Metric Logging updates 4/N - better actor name #351

Merged

allenwang28 added 5 commits October 10, 2025 19:15

grpo main still runs

5a8299d

fix docs build

21dd8d9

rename ci wheel

689e702

fix the wheel path

e74d57e

wheel name

4bdc10e

documentation

2069b3d

allenwang28 marked this pull request as ready for review October 13, 2025 15:07

allenwang28 changed the title ~~[wip] Integrate Hostmesh v1 and RDMA~~ Integrates Hostmesh v1 and RDMA Oct 13, 2025

allenwang28 requested a review from joecummings October 13, 2025 15:08

allenwang28 requested a review from LucasLLC October 13, 2025 15:08

fix remote_setup for MAST

04a1ebc

joecummings approved these changes Oct 13, 2025

View reviewed changes

allenwang28 merged commit b77d6ad into meta-pytorch:main Oct 13, 2025
8 checks passed

allenwang28 deleted the hostmesh_rdma branch October 13, 2025 18:10

vidhyav reviewed Oct 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrates Hostmesh v1 and RDMA #378

Integrates Hostmesh v1 and RDMA #378

allenwang28 commented Oct 10, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Oct 13, 2025

Uh oh!

joecummings left a comment

Uh oh!

joecummings Oct 13, 2025

Uh oh!

allenwang28 Oct 13, 2025

Uh oh!

Uh oh!

vidhyav Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		from monarch._src.actor.v1.host_mesh import this_proc
		from monarch._src.actor.v1.proc_mesh import get_or_spawn_controller

Integrates Hostmesh v1 and RDMA #378

Integrates Hostmesh v1 and RDMA #378

Conversation

allenwang28 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Oct 13, 2025

Codecov Report

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

joecummings Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

allenwang28 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vidhyav Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

allenwang28 commented Oct 10, 2025 •

edited

Loading