Skip to content

Commit 33664c0

Browse files
james-mchughLegNeato
authored andcommitted
docs: revisions for better wording and added examples for computing global indices
Signed-off-by: James Riley McHugh <mchugh.james1@gmail.com>
1 parent ce79eef commit 33664c0

File tree

1 file changed

+42
-31
lines changed

1 file changed

+42
-31
lines changed

crates/cuda_std/src/thread.rs

Lines changed: 42 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,61 @@
11
//! Functions for dealing with the parallel thread execution model employed by CUDA.
22
//!
3-
//! # CUDA Thread model
3+
//! # CUDA thread model
44
//!
5-
//! The CUDA thread model is based on 3 main structures:
5+
//! CUDA organizes execution into three hierarchical levels:
66
//! - Threads
7-
//! - Thread Blocks
7+
//! - Thread blocks
88
//! - Grids
99
//!
1010
//! ## Threads
1111
//!
12-
//! Threads are the fundamental element of GPU computing. Threads execute the same kernel
13-
//! at the same time, controlling their task by retrieving their corresponding global thread ID.
12+
//! Threads are the fundamental unit of execution. Every thread runs the same kernel
13+
//! code, typically operating on different data. Threads identify their work via
14+
//! their indices and the dimensions of their block and grid.
1415
//!
15-
//! ## Thread Blocks
16+
//! ## Thread blocks
1617
//!
17-
//! The most important structure after threads. Thread blocks arrange threads into one-dimensional,
18-
//! two-dimensional, or three-dimensional blocks. The dimensionality of the thread block
19-
//! typically corresponds to the dimensionality of the data being worked with. The number of
20-
//! threads in the block is configurable. The maximum number of threads in a black is
21-
//! device-specific, but 1024 is a typical maximum on current GPUs.
18+
//! Threads are arranged into one-, two-, or three-dimensional blocks. The dimensionality
19+
//! of a block usually mirrors the data layout (e.g., 2D blocks for images). The number of
20+
//! threads per block is configurable and device-dependent (commonly up to 1024 total threads).
2221
//!
23-
//! Thread blocks the primary elements for GPU scheduling. A thread block may be scheduled for
24-
//! execution on any of the GPUs available streaming multiprocessors. If a GPU does not have
25-
//! a streaming multiprocessor available to run the block, it will be queued for scheduling. Because
26-
//! thread blocks are the fundamental scheduling element, they are required to execute
27-
//! independently and in any order.
22+
//! Thread blocks are the primary unit of scheduling. Any block can be scheduled on any of the
23+
//! GPU’s streaming multiprocessors (SMs). If no SM is available, the block waits in a queue.
24+
//! Because blocks may execute in any order and at different times, they must be designed to run
25+
//! independently of one another.
2826
//!
29-
//! Threads within a block can share data between each other via shared memory and barrier
30-
//! synchronization.
31-
//!
32-
//! The kernel can retrieve the index of a given thread within a block via the
33-
//! `thread_idx_x`, `thread_idx_y`, and `thread_idx_z` functions (depending on the dimensionality
34-
//! of the thread block).
27+
//! Threads within the same block can cooperate via shared memory and block-wide barriers.
28+
//! The kernel can retrieve a thread’s index within its block via `thread_idx_x`, `thread_idx_y`,
29+
//! and `thread_idx_z`, and the block’s dimensions via `block_dim_x`, `block_dim_y`, and
30+
//! `block_dim_z`.
3531
//!
3632
//! ## Grids
3733
//!
38-
//! Multiple thread blocks make up the grid, the highest level of the CUDA thread model. Like thread
39-
//! blocks, grids can arrange thread blocks into one-dimensional, two-dimensional, or
40-
//! three-dimensional grids.
34+
//! A grid is an array (1D/2D/3D) of thread blocks. Grids define how many blocks are launched
35+
//! and how they are arranged.
36+
//!
37+
//! The kernel can retrieve the block’s index within the grid via `block_idx_x`, `block_idx_y`,
38+
//! and `block_idx_z`, and the grid’s dimensions via `grid_dim_x`, `grid_dim_y`, and `grid_dim_z`.
39+
//! Combined with the `thread_*` and `block_dim_*` values, these indices are used to compute
40+
//! which portion of the input data a thread should process.
41+
//!
42+
//! ## Computing global indices (examples)
43+
//!
44+
//! 1D global thread index:
45+
//! ```rust
46+
//! use cuda_std::thread;
47+
//! let gx = thread::block_idx_x() * thread::block_dim_x() + thread::thread_idx_x();
48+
//! ```
49+
//!
50+
//! 2D global coordinates (x, y):
51+
//! ```rust
52+
//! use cuda_std::thread;
53+
//! let x = thread::block_idx_x() * thread::block_dim_x() + thread::thread_idx_x();
54+
//! let y = thread::block_idx_y() * thread::block_dim_y() + thread::thread_idx_y();
55+
//! ```
4156
//!
42-
//! The kernel can retrieve the index of a given block within a grid via the
43-
//! `block_idx_x`, `block_idx_y`, and `block_idx_z` functions (depending on the dimensionality
44-
//! of the grid). Additionally, the dimensionality of the block can be retrieved via the
45-
//! `block_dim_x`, `block_dim_y`, and `block_dim_z` functions. These functions, along with the
46-
//! `thread_*` functions mentioned previously, can be used to identify portions of the data the
47-
//! kernel should operate on.
57+
//! Note: Hardware limits for block dimensions, grid dimensions, and total threads per block
58+
//! vary by device. Query device properties when you need exact limits.
4859
//!
4960
use cuda_std_macros::gpu_only;
5061
use glam::{UVec2, UVec3};

0 commit comments

Comments
 (0)