-
Notifications
You must be signed in to change notification settings - Fork 16
Integrates Hostmesh v1 and RDMA #378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #378 +/- ##
=======================================
Coverage ? 64.44%
=======================================
Files ? 79
Lines ? 7735
Branches ? 0
=======================================
Hits ? 4985
Misses ? 2750
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but should probably get a Monarch person's eyes on this.
dataset: | ||
procs: 1 | ||
with_gpus: false | ||
mesh_name: dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These have to be manually specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's probably a way to do this in a not silly way but alas
from monarch._src.actor.v1.host_mesh import this_proc | ||
from monarch._src.actor.v1.proc_mesh import get_or_spawn_controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove this in lieu of the public APIs which have moved to v1 by default.
What this PR does:
Extent
nor HostMesh stuff)MONARCH_HOSTMESH_V1
environment variable, which defaults to False.use_dcp
is now controlled by whether or not RDMA is being used.To run:
This won't work until changes in TorchStore lands. Until then, the normal path works (i.e. defaults are
TORCHSTORE_RDMA_ENABLED=0
andMONARCH_HOSTMESH_V1=0
)Proof of 32B running with this setup: