# Programming and execution models The overview of an execution of a CUDA program is basically this: 1. **Data Transfer**: The main program runs on the CPU, loading data, transmitting it to the device. 2. **Kernel Execution**: The GPU program (kernel) is loaded and executed. 3. **Result Retrieval**: Computed results are transferred back from GPU to CPU memory. Kernels execute across a set of parallel threads, which are organized into thread blocks. These blocks are organized into a grid that can be up to three-dimensional. $\text{kernel} \rightarrow \text{(2d or 3d) grid of blocks} \rightarrow \text{each blocks contains a (2d or 3d) grid of threads}$ More specifically the execution model is: 1) Block - Blocks are assigned to Streaming Multiprocessor (SM) - Multiple blocks can be assigned to the same SM - Blocks are executed in any order, independent of each other 2) Warp - SM divides the block into warps (the number of warps depends on the number of threads per block) - Each warp is generally composed by 32 threads - SM interleaves the execution of several warps - While the programmer perceives the threads of a grid as concurrent at the application level, the actual execution order of blocks and warps is determined by the CUDA runtime and the GPU hardware at the architecture level. 3) SM Occupancy - High resource utilization is obtained by interleaving many warps in the SM - SM occupancy = `(Active Warps) / (Maximum Warps)` - Higher occupancy leads to better resource utilization: when a warp stalls, the SM executes another warp (latency hiding) The SM divides the block in active warps (32 threads each) • A warp is executed with a SIMT approach • Execution of several warps is interleaved Each active warp may be classified as: - Selected warp if currently executed - Stalled warp if not ready to be executed - Eligible warp if ready to be executed More specifically the CUDA latency hiding consists in having a large number of active threads (or warps) per streaming multiprocessor (SM) and interleaved them. While some threads are waiting for memory operations to complete (when a thread issues a memory request it may have to wait hundreds of clock cycles for the data to be fetched), other threads can be scheduled to perform **arithmetic operations**, effectively hiding the memory latency. This ratio indicates how many arithmetic (compute) instructions are executed for every memory instruction A higher ratio implies that the program does more computation relative to the amount of data it loads or stores, whereas a lower ratio suggests that the program is more memory-bound. In summary, the larger the arithmetic/memory instruction ratio, the lower the number of streams needed to hide memory stalls. This is because a higher ratio implies more arithmetic work per memory operation, allowing each stream to effectively utilize the SM while waiting for memory operations to complete. ## CUDA Thread Hierarchy CUDA organizes threads hierarchically into grids, with each grid divided into blocks. - Blocks are assigned to a streaming multiprocessor (SM) for execution - Blocks can be executed in any order and are managed by SM - The SM schedules execution of **warps**, which are groups of 32 threads. In any moment a warp inside a SM can be: - Selected warp: if currently executed - Stalled warp: if not ready to be executed - Eligible warp: if ready to be executed In each warp, each thread is identified by a block number and thread number. In the code to correctly map each thread to a single data element and its corresponding input elements we have to create a flattened index for accessing the input and output arrays. We can use these variables: - `blockldx`: id of the block in the grid - `blockDim`: size of the block (#threads) - `threadldx`: id of the thread in the block - `gridDim`: size of the grid (#blocks) The thread grid in CUDA can have up to three dimensions, selected based on the problem's nature: - 1D problems involve vector operations - 2D for matrices or images - 3D for volumetric data Actually even if the grid can be defined with up to 3 dimensions, the grid is internally linearized with a row major layout. Warp divergence is one factor of performance degradation: It happens when threads in the same warp takes different directions during the execution of a branch or loop statement Strategies to minimize divergence include organizing data to ensure threads in the same warp perform same operations. The flow of CUDA calls involves enqueuing commands in a stream, with synchronization points used to manage execution order. | Function Qualifier | Executed on | Only Callable From | |-------------------|-------------|--------------------| | `__device__` | Device | Device | | `__global__` | Host | Device | | `__host__` | Host | Host | In CUDA programming, when transferring data from the host to the device for arithmetic instructions on the CPU side, we use the `cudaMemcpy()` function with the transmission direction set to "from host to device." The function is a **blocking** function, meaning control is returned to the main program once the transmission is completed. Typically, prefixes like `dev_` or `d_` are used to denote device (GPU) variables, but the important thing is to be consistent. ```c++ cudaMalloc (&d_va, N*sizeof (int)); // CPU—>GPU data transmission cudaMemcpy(d_va, h_va, N*sizeof (int), cudaMemcpyHostToDevice); // kernel launch dim3 blocksPerGrid (N/ 256, 1,1); dim3 threadsPerBIock (256, 1,1); vsumKerne1<<<blocksPerGrid, threadsPerBIock>>> (d_va, d_vb, d_vc); // GPU—>CPU data transmission * / cudaMemcpy(h_vc, d_vc, N*sizeof (int), cudaMemcpyDeviceToHost) ; // device memory freeing cudaFree (d_va) ; cudaFree (d_vb) ; cudaFree (d_vc) ; ``` CUDA calls may be: - Blocking: e.g `cudaMemcpy` - Non-blocking: it continues asynchronously its execution e.g `kernel launch` A barrier can be used to force the host to wait the termination of the kernel execution. Inside blocks, synchronization in CUDA is explicitly invoked by the programmer as the GPU is designed for high performance with no automatic synchronization mechanism. Logically, threads are concurrent, but the execution order is unpredictable. Synchronization is performed using barriers that synchronize threads in the same block. There is no barrier call applicable among threads of different blocks at grid level. `__syncthreads()` is the barrier call.