From 17ede2677b33474b716cbd26e1f45e6b6b5c28e3 Mon Sep 17 00:00:00 2001
From: James Riley McHugh <mchugh.james1@gmail.com>
Date: Fri, 15 Aug 2025 22:00:52 -0400
Subject: [PATCH 1/2] docs: added sections to thread.rs on thread blocks and
 grids

Signed-off-by: James Riley McHugh <mchugh.james1@gmail.com>
---
 crates/cuda_std/src/thread.rs | 38 ++++++++++++++++++++++++++++++-----
 1 file changed, 33 insertions(+), 5 deletions(-)

diff --git a/crates/cuda_std/src/thread.rs b/crates/cuda_std/src/thread.rs
index 3a447c95..88831e03 100644
--- a/crates/cuda_std/src/thread.rs
+++ b/crates/cuda_std/src/thread.rs
@@ -12,12 +12,40 @@
 //! Threads are the fundamental element of GPU computing. Threads execute the same kernel
 //! at the same time, controlling their task by retrieving their corresponding global thread ID.
 //!
-//! # Thread Blocks
+//! ## Thread Blocks
+//!
+//! The most important structure after threads. Thread blocks arrange threads into one-dimensional,
+//! two-dimensional, or three-dimensional blocks. The dimensionality of the thread block
+//! typically corresponds to the dimensionality of the data being worked with. The number of
+//! threads in the block is configurable. The maximum number of threads in a black is
+//! device-specific, but 1024 is a typical maximum on current GPUs.
+//!
+//! Thread blocks the primary elements for GPU scheduling. A thread block may be scheduled for
+//! execution on any of the GPUs available streaming multiprocessors. If a GPU does not have
+//! a streaming multiprocessor available to run the block, it will be queued for scheduling. Because
+//! thread blocks are the fundamental scheduling element, they are required to execute
+//! independently and in any order.
+//!
+//! Threads within a block can share data between each other via shared memory and barrier
+//! synchronization.
+//!
+//! The kernel can retrieve the index of a given thread within a block via the
+//! `thread_idx_x`, `thread_idx_y`, and `thread_idx_z` functions (depending on the dimensionality
+//! of the thread block).
+//!
+//! ## Grids
+//!
+//! Multiple thread blocks make up the grid, the highest level of the CUDA thread model. Like thread
+//! blocks, grids can arrange thread blocks into one-dimensional, two-dimensional, or
+//! three-dimensional grids.
+//!
+//! The kernel can retrieve the index of a given block within a grid via the
+//! `block_idx_x`, `block_idx_y`, and `block_idx_z` functions (depending on the dimensionality
+//! of the grid). Additionally, the dimensionality of the block can be retrieved via the
+//! `block_dim_x`, `block_dim_y`, and `block_dim_z` functions. These functions, along with the
+//! `thread_*` functions mentioned previously, can be used to identify portions of the data the
+//! kernel should operate on.
 //!
-//! The most important structure after threads, thread blocks arrange
-
-// TODO: write some docs about the terms used in this module.
-
 use cuda_std_macros::gpu_only;
 use glam::{UVec2, UVec3};
 

From a6c24c705cf30ba9fe00042d5b007579a691b952 Mon Sep 17 00:00:00 2001
From: James Riley McHugh <mchugh.james1@gmail.com>
Date: Fri, 15 Aug 2025 22:06:07 -0400
Subject: [PATCH 2/2] docs: revisions for better wording and added examples for
 computing global indices

Signed-off-by: James Riley McHugh <mchugh.james1@gmail.com>
---
 crates/cuda_std/src/thread.rs | 73 ++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 31 deletions(-)

diff --git a/crates/cuda_std/src/thread.rs b/crates/cuda_std/src/thread.rs
index 88831e03..449cc972 100644
--- a/crates/cuda_std/src/thread.rs
+++ b/crates/cuda_std/src/thread.rs
@@ -1,50 +1,61 @@
 //! Functions for dealing with the parallel thread execution model employed by CUDA.
 //!
-//! # CUDA Thread model
+//! # CUDA thread model
 //!
-//! The CUDA thread model is based on 3 main structures:
+//! CUDA organizes execution into three hierarchical levels:
 //! - Threads
-//! - Thread Blocks
+//! - Thread blocks
 //! - Grids
 //!
 //! ## Threads
 //!
-//! Threads are the fundamental element of GPU computing. Threads execute the same kernel
-//! at the same time, controlling their task by retrieving their corresponding global thread ID.
+//! Threads are the fundamental unit of execution. Every thread runs the same kernel
+//! code, typically operating on different data. Threads identify their work via
+//! their indices and the dimensions of their block and grid.
 //!
-//! ## Thread Blocks
+//! ## Thread blocks
 //!
-//! The most important structure after threads. Thread blocks arrange threads into one-dimensional,
-//! two-dimensional, or three-dimensional blocks. The dimensionality of the thread block
-//! typically corresponds to the dimensionality of the data being worked with. The number of
-//! threads in the block is configurable. The maximum number of threads in a black is
-//! device-specific, but 1024 is a typical maximum on current GPUs.
+//! Threads are arranged into one-, two-, or three-dimensional blocks. The dimensionality
+//! of a block usually mirrors the data layout (e.g., 2D blocks for images). The number of
+//! threads per block is configurable and device-dependent (commonly up to 1024 total threads).
 //!
-//! Thread blocks the primary elements for GPU scheduling. A thread block may be scheduled for
-//! execution on any of the GPUs available streaming multiprocessors. If a GPU does not have
-//! a streaming multiprocessor available to run the block, it will be queued for scheduling. Because
-//! thread blocks are the fundamental scheduling element, they are required to execute
-//! independently and in any order.
+//! Thread blocks are the primary unit of scheduling. Any block can be scheduled on any of the
+//! GPU’s streaming multiprocessors (SMs). If no SM is available, the block waits in a queue.
+//! Because blocks may execute in any order and at different times, they must be designed to run
+//! independently of one another.
 //!
-//! Threads within a block can share data between each other via shared memory and barrier
-//! synchronization.
-//!
-//! The kernel can retrieve the index of a given thread within a block via the
-//! `thread_idx_x`, `thread_idx_y`, and `thread_idx_z` functions (depending on the dimensionality
-//! of the thread block).
+//! Threads within the same block can cooperate via shared memory and block-wide barriers.
+//! The kernel can retrieve a thread’s index within its block via `thread_idx_x`, `thread_idx_y`,
+//! and `thread_idx_z`, and the block’s dimensions via `block_dim_x`, `block_dim_y`, and
+//! `block_dim_z`.
 //!
 //! ## Grids
 //!
-//! Multiple thread blocks make up the grid, the highest level of the CUDA thread model. Like thread
-//! blocks, grids can arrange thread blocks into one-dimensional, two-dimensional, or
-//! three-dimensional grids.
+//! A grid is an array (1D/2D/3D) of thread blocks. Grids define how many blocks are launched
+//! and how they are arranged.
+//!
+//! The kernel can retrieve the block’s index within the grid via `block_idx_x`, `block_idx_y`,
+//! and `block_idx_z`, and the grid’s dimensions via `grid_dim_x`, `grid_dim_y`, and `grid_dim_z`.
+//! Combined with the `thread_*` and `block_dim_*` values, these indices are used to compute
+//! which portion of the input data a thread should process.
+//!
+//! ## Computing global indices (examples)
+//!
+//! 1D global thread index:
+//! ```rust
+//! use cuda_std::thread;
+//! let gx = thread::block_idx_x() * thread::block_dim_x() + thread::thread_idx_x();
+//! ```
+//!
+//! 2D global coordinates (x, y):
+//! ```rust
+//! use cuda_std::thread;
+//! let x = thread::block_idx_x() * thread::block_dim_x() + thread::thread_idx_x();
+//! let y = thread::block_idx_y() * thread::block_dim_y() + thread::thread_idx_y();
+//! ```
 //!
-//! The kernel can retrieve the index of a given block within a grid via the
-//! `block_idx_x`, `block_idx_y`, and `block_idx_z` functions (depending on the dimensionality
-//! of the grid). Additionally, the dimensionality of the block can be retrieved via the
-//! `block_dim_x`, `block_dim_y`, and `block_dim_z` functions. These functions, along with the
-//! `thread_*` functions mentioned previously, can be used to identify portions of the data the
-//! kernel should operate on.
+//! Note: Hardware limits for block dimensions, grid dimensions, and total threads per block
+//! vary by device. Query device properties when you need exact limits.
 //!
 use cuda_std_macros::gpu_only;
 use glam::{UVec2, UVec3};