Skip to content

Conversation

Pratham-Nayak1
Copy link

This PR fixes a runtime error in distributed inference with the NCCL backend:
RuntimeError: No backend type associated with device type cpu

Root Cause:
When using NCCL, collective operations require CUDA tensors. The code attempted to run:
dist.broadcast(length_tensor, src=0)
while length_tensor was on CPU. This caused the runtime error on non-zero ranks.

Fix:
Before broadcasting, the small metadata tensor is moved to the local CUDA device if dist.get_backend() == "nccl". After the broadcast, it is converted back to CPU to extract the Python integer.

Testing:
I do not have access to a Linux multi-GPU setup, so I could not reproduce the original crash.
Since the issue provides reproduction steps (#252), I’d appreciate if maintainers or contributors could verify this fix in that environment.

Notes
This change preserves NCCL performance while ensuring compatibility.
Fixes #252.

@Pratham-Nayak1
Copy link
Author

@kmk142789 Thanks for the review and approval! Sorry for the late reply — really appreciate your time and feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG: No backend type associated with device type cpu

2 participants