-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Add Deepspeed Zero 3 MiCS support #20461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #20461 +/- ##
=======================================
- Coverage 87% 87% -0%
=======================================
Files 268 268
Lines 23311 23318 +7
=======================================
- Hits 20307 20304 -3
- Misses 3004 3014 +10 |
This is great, thanks for the contribution @hehepig4 |
let's see how CI goes and we can proceed |
In the meantime @hehepig4, would you be willing to add a mention to this in the relevant docs section? Also ideally we could make the same exact change in Fabric as well: |
Thanks! I will first work on docs, then fabric. |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for the updates. I see CI is going out of memory, I need to further investigate.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions. |
assert self.config is not None | ||
# If we detect `'mics_shard_size' > 0` in `config['zero_optimization']`, use `deepspeed.zero.MiCS_Init(...)` instead of `deepspeed.zero.Init(...)` | ||
# https://deepspeed.readthedocs.io/en/latest/zero3.html#mics-configurations | ||
#! default deepspeed 0.9.0 is not compatible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the min version to support this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review. MICS was implemented after ds 0.9.2, and my test env is with ds 0.16.0.
Does this need help? I can take a crack at this problem. |
What does this PR do?
After deepspeed 0.9.2, they provided Mics support, which can specify how to split parameters across devices in Zero stage 3 reference.
To activate MiCS, users should add 'mics_shard_size' in deepspeed config, and use 'deepspeed.zero.MiCS_Init' instead of 'deepspeed.zero.Init' when initializing models. Issues occur when:
The core changes are done by altering zero.Init in DeepSpeedStrategy.model_sharded_context (line 519 in lightning.pytorch.strategies.deepspeed.py) when detecting 'mics_shard_size' in deepspeed zero_optmization config.
Fixes & adds support for #20378
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--20461.org.readthedocs.build/en/20461/