Skip to content
Discussion options

You must be logged in to vote

After extensive doc-searching and even more extensive trial-and-error, I believe I have a good understanding of this issue. Unfortunately, unless there is a flag within pytorch, ddp, lightning, or CUDA governing GPU scheduling determinism that I don't know about, my main problem of forcing exact reproducibility seems impossible (or at least largely impractical) for reasons I'll summarize below. I am moving on from this problem, by simply avoiding the model stop/restart process I mentioned above via other methods. But I'm posting the information I have discovered in case it helps anyone with a similar problem.

For starters, my second comment is more-or-less correct. Each rank (i.e., each p…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by bardsleypt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant