|
1 | 1 | //! Functions for dealing with the parallel thread execution model employed by CUDA.
|
2 | 2 | //!
|
3 |
| -//! # CUDA Thread model |
| 3 | +//! # CUDA thread model |
4 | 4 | //!
|
5 |
| -//! The CUDA thread model is based on 3 main structures: |
| 5 | +//! CUDA organizes execution into three hierarchical levels: |
6 | 6 | //! - Threads
|
7 |
| -//! - Thread Blocks |
| 7 | +//! - Thread blocks |
8 | 8 | //! - Grids
|
9 | 9 | //!
|
10 | 10 | //! ## Threads
|
11 | 11 | //!
|
12 |
| -//! Threads are the fundamental element of GPU computing. Threads execute the same kernel |
13 |
| -//! at the same time, controlling their task by retrieving their corresponding global thread ID. |
| 12 | +//! Threads are the fundamental unit of execution. Every thread runs the same kernel |
| 13 | +//! code, typically operating on different data. Threads identify their work via |
| 14 | +//! their indices and the dimensions of their block and grid. |
14 | 15 | //!
|
15 |
| -//! ## Thread Blocks |
| 16 | +//! ## Thread blocks |
16 | 17 | //!
|
17 |
| -//! The most important structure after threads. Thread blocks arrange threads into one-dimensional, |
18 |
| -//! two-dimensional, or three-dimensional blocks. The dimensionality of the thread block |
19 |
| -//! typically corresponds to the dimensionality of the data being worked with. The number of |
20 |
| -//! threads in the block is configurable. The maximum number of threads in a black is |
21 |
| -//! device-specific, but 1024 is a typical maximum on current GPUs. |
| 18 | +//! Threads are arranged into one-, two-, or three-dimensional blocks. The dimensionality |
| 19 | +//! of a block usually mirrors the data layout (e.g., 2D blocks for images). The number of |
| 20 | +//! threads per block is configurable and device-dependent (commonly up to 1024 total threads). |
22 | 21 | //!
|
23 |
| -//! Thread blocks the primary elements for GPU scheduling. A thread block may be scheduled for |
24 |
| -//! execution on any of the GPUs available streaming multiprocessors. If a GPU does not have |
25 |
| -//! a streaming multiprocessor available to run the block, it will be queued for scheduling. Because |
26 |
| -//! thread blocks are the fundamental scheduling element, they are required to execute |
27 |
| -//! independently and in any order. |
| 22 | +//! Thread blocks are the primary unit of scheduling. Any block can be scheduled on any of the |
| 23 | +//! GPU’s streaming multiprocessors (SMs). If no SM is available, the block waits in a queue. |
| 24 | +//! Because blocks may execute in any order and at different times, they must be designed to run |
| 25 | +//! independently of one another. |
28 | 26 | //!
|
29 |
| -//! Threads within a block can share data between each other via shared memory and barrier |
30 |
| -//! synchronization. |
31 |
| -//! |
32 |
| -//! The kernel can retrieve the index of a given thread within a block via the |
33 |
| -//! `thread_idx_x`, `thread_idx_y`, and `thread_idx_z` functions (depending on the dimensionality |
34 |
| -//! of the thread block). |
| 27 | +//! Threads within the same block can cooperate via shared memory and block-wide barriers. |
| 28 | +//! The kernel can retrieve a thread’s index within its block via `thread_idx_x`, `thread_idx_y`, |
| 29 | +//! and `thread_idx_z`, and the block’s dimensions via `block_dim_x`, `block_dim_y`, and |
| 30 | +//! `block_dim_z`. |
35 | 31 | //!
|
36 | 32 | //! ## Grids
|
37 | 33 | //!
|
38 |
| -//! Multiple thread blocks make up the grid, the highest level of the CUDA thread model. Like thread |
39 |
| -//! blocks, grids can arrange thread blocks into one-dimensional, two-dimensional, or |
40 |
| -//! three-dimensional grids. |
| 34 | +//! A grid is an array (1D/2D/3D) of thread blocks. Grids define how many blocks are launched |
| 35 | +//! and how they are arranged. |
| 36 | +//! |
| 37 | +//! The kernel can retrieve the block’s index within the grid via `block_idx_x`, `block_idx_y`, |
| 38 | +//! and `block_idx_z`, and the grid’s dimensions via `grid_dim_x`, `grid_dim_y`, and `grid_dim_z`. |
| 39 | +//! Combined with the `thread_*` and `block_dim_*` values, these indices are used to compute |
| 40 | +//! which portion of the input data a thread should process. |
| 41 | +//! |
| 42 | +//! ## Computing global indices (examples) |
| 43 | +//! |
| 44 | +//! 1D global thread index: |
| 45 | +//! ```rust |
| 46 | +//! use cuda_std::thread; |
| 47 | +//! let gx = thread::block_idx_x() * thread::block_dim_x() + thread::thread_idx_x(); |
| 48 | +//! ``` |
| 49 | +//! |
| 50 | +//! 2D global coordinates (x, y): |
| 51 | +//! ```rust |
| 52 | +//! use cuda_std::thread; |
| 53 | +//! let x = thread::block_idx_x() * thread::block_dim_x() + thread::thread_idx_x(); |
| 54 | +//! let y = thread::block_idx_y() * thread::block_dim_y() + thread::thread_idx_y(); |
| 55 | +//! ``` |
41 | 56 | //!
|
42 |
| -//! The kernel can retrieve the index of a given block within a grid via the |
43 |
| -//! `block_idx_x`, `block_idx_y`, and `block_idx_z` functions (depending on the dimensionality |
44 |
| -//! of the grid). Additionally, the dimensionality of the block can be retrieved via the |
45 |
| -//! `block_dim_x`, `block_dim_y`, and `block_dim_z` functions. These functions, along with the |
46 |
| -//! `thread_*` functions mentioned previously, can be used to identify portions of the data the |
47 |
| -//! kernel should operate on. |
| 57 | +//! Note: Hardware limits for block dimensions, grid dimensions, and total threads per block |
| 58 | +//! vary by device. Query device properties when you need exact limits. |
48 | 59 | //!
|
49 | 60 | use cuda_std_macros::gpu_only;
|
50 | 61 | use glam::{UVec2, UVec3};
|
|
0 commit comments