Skip to content

Conversation

@tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Nov 4, 2025

Summary:

  • we need to pass the global rank information to pytorch so that the pg name can include the pg information
  • this is necessary to differentiate the default pg's on different replicas
  • these need to different because flight recorder matches collectives based on pg name as well
  • add ft training to experiments folder, we'll move remaining pieces of ft to this gradually but make new features only available through this folder

Stack created with Sapling. Best reviewed with ReviewStack.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 4, 2025
@tushar00jain tushar00jain force-pushed the pr1986 branch 3 times, most recently from 2d5ff16 to 135d24c Compare November 6, 2025 18:49
@tushar00jain tushar00jain requested review from d4l3k and fduwjj November 6, 2025 18:50
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tushar00jain tushar00jain marked this pull request as ready for review November 6, 2025 19:25
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I guess you will need an approval from @tianyu-l @fegin or @wwwjn for this.

@tushar00jain tushar00jain force-pushed the pr1986 branch 2 times, most recently from e0eca3d to 6b5517c Compare November 10, 2025 15:53
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the linter before landing. Also can you confirm that ROCm test failure is not related? I'll re-stamp after the linter fix.

Summary:
- we need to pass the global rank information to pytorch so that the pg name can include the pg information
- this is necessary to differentiate the default pg's on different replicas
- these need to different because flight recorder matches collectives based on pg name as well
- add ft training to experiments folder, we'll move remaining pieces of ft to this gradually but make new features only available through this folder
@tushar00jain
Copy link
Contributor Author

@fegin fixed linter. the rocm test looks like it's having authentication issue

It looks like you might be trying to authenticate with OIDC. Did you mean to set the `id-token` permission? If you are not trying to authenticate with OIDC and the action is working successfully, you can ignore this message.
Error: Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers

@tushar00jain tushar00jain requested a review from fegin November 10, 2025 19:56
@tushar00jain tushar00jain merged commit 20fcfd7 into pytorch:main Nov 11, 2025
8 of 15 checks passed
ahoffman-aws pushed a commit to drcanchi-aws/torchtitan that referenced this pull request Nov 11, 2025
Summary:
- we need to pass the global rank information to pytorch so that the pg
name can include the pg information
- this is necessary to differentiate the default pg's on different
replicas
- these need to different because flight recorder matches collectives
based on pg name as well
- add ft training to experiments folder, we'll move remaining pieces of
ft to this gradually but make new features only available through this
folder

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1986).
* pytorch#1988
* pytorch#1987
* __->__ pytorch#1986

Co-authored-by: Tushar Jain <tushar00jain@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants