diff --git a/crates/cuda_std/src/thread.rs b/crates/cuda_std/src/thread.rs index 3a447c95..449cc972 100644 --- a/crates/cuda_std/src/thread.rs +++ b/crates/cuda_std/src/thread.rs @@ -1,23 +1,62 @@ //! Functions for dealing with the parallel thread execution model employed by CUDA. //! -//! # CUDA Thread model +//! # CUDA thread model //! -//! The CUDA thread model is based on 3 main structures: +//! CUDA organizes execution into three hierarchical levels: //! - Threads -//! - Thread Blocks +//! - Thread blocks //! - Grids //! //! ## Threads //! -//! Threads are the fundamental element of GPU computing. Threads execute the same kernel -//! at the same time, controlling their task by retrieving their corresponding global thread ID. +//! Threads are the fundamental unit of execution. Every thread runs the same kernel +//! code, typically operating on different data. Threads identify their work via +//! their indices and the dimensions of their block and grid. //! -//! # Thread Blocks +//! ## Thread blocks +//! +//! Threads are arranged into one-, two-, or three-dimensional blocks. The dimensionality +//! of a block usually mirrors the data layout (e.g., 2D blocks for images). The number of +//! threads per block is configurable and device-dependent (commonly up to 1024 total threads). +//! +//! Thread blocks are the primary unit of scheduling. Any block can be scheduled on any of the +//! GPU’s streaming multiprocessors (SMs). If no SM is available, the block waits in a queue. +//! Because blocks may execute in any order and at different times, they must be designed to run +//! independently of one another. +//! +//! Threads within the same block can cooperate via shared memory and block-wide barriers. +//! The kernel can retrieve a thread’s index within its block via `thread_idx_x`, `thread_idx_y`, +//! and `thread_idx_z`, and the block’s dimensions via `block_dim_x`, `block_dim_y`, and +//! `block_dim_z`. +//! +//! ## Grids +//! +//! A grid is an array (1D/2D/3D) of thread blocks. Grids define how many blocks are launched +//! and how they are arranged. +//! +//! The kernel can retrieve the block’s index within the grid via `block_idx_x`, `block_idx_y`, +//! and `block_idx_z`, and the grid’s dimensions via `grid_dim_x`, `grid_dim_y`, and `grid_dim_z`. +//! Combined with the `thread_*` and `block_dim_*` values, these indices are used to compute +//! which portion of the input data a thread should process. +//! +//! ## Computing global indices (examples) +//! +//! 1D global thread index: +//! ```rust +//! use cuda_std::thread; +//! let gx = thread::block_idx_x() * thread::block_dim_x() + thread::thread_idx_x(); +//! ``` +//! +//! 2D global coordinates (x, y): +//! ```rust +//! use cuda_std::thread; +//! let x = thread::block_idx_x() * thread::block_dim_x() + thread::thread_idx_x(); +//! let y = thread::block_idx_y() * thread::block_dim_y() + thread::thread_idx_y(); +//! ``` +//! +//! Note: Hardware limits for block dimensions, grid dimensions, and total threads per block +//! vary by device. Query device properties when you need exact limits. //! -//! The most important structure after threads, thread blocks arrange - -// TODO: write some docs about the terms used in this module. - use cuda_std_macros::gpu_only; use glam::{UVec2, UVec3};