Fix NCCL broadcast error on CPU tensors in distributed inference #257
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes a runtime error in distributed inference with the NCCL backend:
RuntimeError: No backend type associated with device type cpu
Root Cause:
When using NCCL, collective operations require CUDA tensors. The code attempted to run:
dist.broadcast(length_tensor, src=0)
while length_tensor was on CPU. This caused the runtime error on non-zero ranks.
Fix:
Before broadcasting, the small metadata tensor is moved to the local CUDA device if dist.get_backend() == "nccl". After the broadcast, it is converted back to CPU to extract the Python integer.
Testing:
I do not have access to a Linux multi-GPU setup, so I could not reproduce the original crash.
Since the issue provides reproduction steps (#252), I’d appreciate if maintainers or contributors could verify this fix in that environment.
Notes
This change preserves NCCL performance while ensuring compatibility.
Fixes #252.