Skip to content

Add Semaphore Support for cp.async loads (Non-TMA Load Patterns) #97

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

SohamGovande
Copy link

@SohamGovande SohamGovande commented Mar 5, 2025

This PR introduces semaphore support for non-TMA load_async operations by leveraging the PTX instruction cp.async.mbarrier.arrive.noinc.shared::cta.b64. The change aims to simplify producer-consumer kernels with non-standard load patterns that cannot be completed by the TMA.

Background and Motivation

Working with @DanFu09, I developed sparse matmul kernels that required using cp.async instead of TMA because of our unique memory layout. Currently, producer-consumer kernels force the producer to call cp.async.wait_all and manually signal the semaphore (e.g. FFTConv kernel). Our tests show that manually waiting on a semaphore with cp.async.wait_all plus an explicit arrive(bar) is over 200 TFLOPS slower than allowing cp.async to automatically signal the semaphore.

Note on Semaphores:
The PTX instruction cp.async.mbarrier.arrive.noinc.shared::cta.b64 ensures that once all non-committed cp.async operations from the current thread finish, that thread automatically arrives at the semaphore. Until then, it can work on other tasks. For example, when warpgroup::load_async is called with a semaphore, the expected arrival count is 128 (32 threads per warp * 4 warps). Detailed explanations are provided in the updated library comments.

What's New

  • Non-TMA load_async operations can now automatically work with semaphores by accepting an optional semaphore parameter.
  • Updated load strategies in 4 areas:
    • Tile - warp level
    • Tile - group level
    • Vector - warp level
    • Vector - group level
  • Added tests to ensure correctness of the new operations.

@SohamGovande SohamGovande changed the title Add Semaphore Support for cp.async global-to-shared loads (Non-TMA Load Patterns) Add Semaphore Support for cp.async loads (Non-TMA Load Patterns) Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant