Skip to content

Conversation

@cthi
Copy link
Contributor

@cthi cthi commented Nov 11, 2025

If the input tensors use a device that differs from the current device, it would cause the wrong device to be used for things such as workspace allocation (when using cutlass::device_memory::allocation) and kernel to run on the wrong stream. Either would break the kernel. As a fix we add the CUDAGuard to ensure correct device is used.

  • cutlass::device_memory::allocation is a wrapper around cudaMalloc, but this would bypass PyTorch CCA. We replace all usages with torch tensor allocation instead which would be less error prone and allow proper memory reuse.

Differential Revision: D86768064

@netlify
Copy link

netlify bot commented Nov 11, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 722f8b6
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/69138f65a8a80c00083158a9
😎 Deploy Preview https://deploy-preview-5113--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-cla meta-cla bot added the cla signed label Nov 11, 2025
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 11, 2025

@cthi has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86768064.

@cthi cthi changed the title Use torch allocation instead of cutlass::device_memory::allocation Add CUDAGuard to ensure correct device Nov 11, 2025
Summary:
X-link: facebookresearch/FBGEMM#2119


If the input tensors use a device that differs from the current device, it would cause the wrong device to be used for things such as workspace allocation (when using `cutlass::device_memory::allocation`) and kernel to run on the wrong stream. Either would break the kernel. As a fix we add the `CUDAGuard` to ensure correct device is used.
- `cutlass::device_memory::allocation` is a wrapper around [`cudaMalloc`](https://github.com/NVIDIA/cutlass/blob/2252254ce2c3f11ef5cfff9721ebbe7bd62cf8cb/tools/util/include/cutlass/util/device_memory.h#L56), but this would bypass PyTorch CCA. We replace all usages with torch tensor allocation instead which would be less error prone and allow proper memory reuse.

Differential Revision: D86768064
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 12, 2025

This pull request has been merged in 62bdc5f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants