Add Semaphore Support for `cp.async` loads (Non-TMA Load Patterns) #97

SohamGovande · 2025-03-05T05:18:08Z

This PR introduces semaphore support for non-TMA load_async operations by leveraging the PTX instruction cp.async.mbarrier.arrive.noinc.shared::cta.b64. The change aims to simplify producer-consumer kernels with non-standard load patterns that cannot be completed by the TMA.

Background and Motivation

Working with @DanFu09, I developed sparse matmul kernels that required using cp.async instead of TMA because of our unique memory layout. Currently, producer-consumer kernels force the producer to call cp.async.wait_all and manually signal the semaphore (e.g. FFTConv kernel). Our tests show that manually waiting on a semaphore with cp.async.wait_all plus an explicit arrive(bar) is over 200 TFLOPS slower than allowing cp.async to automatically signal the semaphore.

Note on Semaphores:
The PTX instruction cp.async.mbarrier.arrive.noinc.shared::cta.b64 ensures that once all non-committed cp.async operations from the current thread finish, that thread automatically arrives at the semaphore. Until then, it can work on other tasks. For example, when warpgroup::load_async is called with a semaphore, the expected arrival count is 128 (32 threads per warp * 4 warps). Detailed explanations are provided in the updated library comments.

What's New

Non-TMA load_async operations can now automatically work with semaphores by accepting an optional semaphore parameter.
Updated load strategies in 4 areas:
- Tile - warp level
- Tile - group level
- Vector - warp level
- Vector - group level
Added tests to ensure correctness of the new operations.

Soham Govande added 2 commits March 4, 2025 17:08

Semaphore support for cp.async

99bd50b

Semaphore tests

2c1f16b

SohamGovande changed the title ~~Add Semaphore Support for cp.async global-to-shared loads (Non-TMA Load Patterns)~~ Add Semaphore Support for cp.async loads (Non-TMA Load Patterns) Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Semaphore Support for `cp.async` loads (Non-TMA Load Patterns) #97

Add Semaphore Support for `cp.async` loads (Non-TMA Load Patterns) #97

Uh oh!

SohamGovande commented Mar 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add Semaphore Support for cp.async loads (Non-TMA Load Patterns) #97

Are you sure you want to change the base?

Add Semaphore Support for cp.async loads (Non-TMA Load Patterns) #97

Uh oh!

Conversation

SohamGovande commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background and Motivation

What's New

Uh oh!

Uh oh!

Add Semaphore Support for `cp.async` loads (Non-TMA Load Patterns) #97

Add Semaphore Support for `cp.async` loads (Non-TMA Load Patterns) #97

SohamGovande commented Mar 5, 2025 •

edited

Loading