# Reduction and scan ## Reduction The concept of reduction involves collecting values from an array or collection and reducing them to a single value. Examples are the sum or finding the maximum value in an array. ```cpp // CPU version of the reduction kernel float reduce_cpu(const float* data, const int length) { float sum = 0; for (int i = 0; i < length; i++) { sum += data[i]; } return sum; } ``` The naive GPU implementation that divides the input into blocks of threads and has each thread process two elements. The threads then combine their results until only one thread remains, which performs the final reduction operation. ![](images/Pasted%20image%2020240501180528.png) The key change is using the stride index in reverse order, dividing it by 2 at each iteration to maintain the same approach as before. ![](images/Pasted%20image%2020240501182020.png) ```cpp // GPU version of the reduction kernel __global__ void reduce_gpu(double* __restrict__ input, double* __restrict__ output) { const unsigned int i = STRIDE_FACTOR * threadIdx.x; // Apply the offset input += blockDim.x * blockIdx.x * STRIDE_FACTOR; output += blockIdx.x; for (unsigned int stride = 1; stride <= blockDim.x; stride *= STRIDE_FACTOR) { if (threadIdx.x % stride == 0) { input[i] += input[i + stride]; } __syncthreads(); } // Write result for this block to global memory if (threadIdx.x == 0) { // You could have used only a single memory location and performed an atomicAdd *output = input[0]; } } ``` To reduce overhead, thread coarsening can be employed, allowing for more efficient use of hardware resources. Actually two solutions depending on the `COARSE_FACTOR`: ![Image illustrating trade-offs and performance considerations](images/Pasted%20image%2020240501184452.png) Regarding spawning fewer threads: ![Image showing detailed thread processing](images/Pasted%20image%2020240501184314.png) The method involves serializing these thread blocks ourselves in a more efficient manner than the hardware would. The segment for each block is identified by multiplying with the `COARSE_FACTOR`, and threads are tasked with more than just adding two elements. - Threads begin by calculating their own unique starting point in the input array by adjusting for block and thread indices multiplied by the `COARSE_FACTOR`. - Each thread sums multiple elements from the input array, determined by the `COARSE_FACTOR` - This sum is initially calculated using regular registers due to their faster access times compared to shared memory, although this approach increases register pressure which can potentially reduce parallelism. - Subsequent reduction of these sums is performed in shared memory, consolidating the data further through synchronized strides, reducing the number of active threads in each step, and enhancing memory access patterns within the GPU's architecture. ```cpp // GPU version of the reduction kernel __global__ void reduce_coarsening_gpu(const double* __restrict__ input, double* __restrict__ output) { __shared__ double input_s[BLOCK_DIM]; const unsigned int t = threadIdx.x; // Apply the offset // NOTE: input const means that the content is const, the pointer can change input += blockDim.x * blockIdx.x * COARSE_FACTOR; output += blockIdx.x; double sum = input[t]; // Here the hardware is fully utilized for (unsigned int tile = 1; tile < COARSE_FACTOR; ++tile) sum += input[t + tile * BLOCK_DIM]; // You could have used directly the shared memory // Registers are faster though // High register pressure leads to lower parallelism input_s[t] = sum; // Perform reduction in shared memory for (unsigned int stride = blockDim.x / STRIDE_FACTOR; stride >= 1; stride /= STRIDE_FACTOR) { __syncthreads(); if (t < stride) { input_s[t] += input_s[t + stride]; } } // Write result for this block to global memory if (threadIdx.x == 0) { // You could have used only a single memory location and performed an atomicAdd *output = input_s[0]; } } ``` Here, each thread processes `COARSE_FACTOR` elements instead of just one, reducing the total number of threads and improving efficiency. Key takeaways include the significant impact of adjusting the coarse factor and block size on performance, the benefits of using shared memory for accessing input elements which enhances parallel processing efficiency, and the optimization of reduction operations in global memory through atomic operations or synchronization mechanisms. The balance between high register usage and memory efficiency is crucial for optimizing GPU kernel performance. ## Scan A scan operation, also known as a prefix sum, is similar to reduction but produces an array of results instead of a single value. ![](images/Pasted%20image%2020240501191013.png) **Two types of scan: - **Inclusive scan**: includes current element in partial reduction. - **Exclusive scan**: excludes current element in partial reduction, partial reduction is of all prior elements prior to current element. The choice between inclusive and exclusive scans can affect how the parallelization strategy is implemented due to the initial value settings and propagation of sum values through the dataset. ```cpp // CPU version of the scan kernel EXCLUSIVE void scan_cpu(const float* input, float* output, const int length) { output[0] = 0; for (int i = 1; i < length; ++i) { output[i] = output[i - 1] + input[i - 1]; } } // Difference with reduction operation: float reduce_cpu(const float* data, const int length) { float sum = 0; for (int i = 0; i < length; i++) { sum += data[i]; } return sum; } ``` There is the risk to try to parallelize a sequential version which has the bottleneck of the longest path (leading to $O(n^2)$): each thread has to look at all previous elements, leading to redundant work and poor utilization of the GPU's algorithmic capabilities: ```cpp __global__ void naive_scan_gpu(const float* __restrict__ input, float* __restrict__ output, const int length) { const unsigned int index = blockIdx.x * blockDim.x + threadIdx.x; float sum{0}; for (unsigned int i = 0; i < index; i++) { sum += input[i]; } output[index] = sum; return; } ``` The Kogge-Stone algorithm is used for efficiently summing elements in parallel on a GPU. Each thread sums a range of elements and then combines these sums with results from other threads. *Basically I increase the parallelization with a tradeoff of synchronization.* ![](images/Pasted%20image%2020240508212724.png) ```cpp // GPU version of the scan kernel EXCLUSIVE Kogge Stone __global__ void kogge_stone_scan_gpu(const float* __restrict__ input, float* __restrict__ output, const int length) { __shared__ float input_s[BLOCK_DIM]; const unsigned int tid = threadIdx.x; input_s[tid] = input[tid - 1]; for (unsigned int stride = 1; stride < length; stride *= STRIDE_FACTOR) { __syncthreads(); float temp; if (tid >= stride) temp = input_s[tid] + input_s[tid - stride]; __syncthreads(); if (tid >= stride) input_s[tid] = temp; } output[tid] = input_s[tid]; return; } ``` - **Initial State**: Each `output[i]` starts with the corresponding input element x. - **Iterations**: After `k` iterations, `output[i]` will contain the sum of `2^k` input elements. - **Thread ID Adjustment**: `input_s[tid] = input[tid - 1];` initializes each thread with the input shifted by one index back. This is critical for managing edge cases where `tid = 0`. - **Loop for Summation**: The loop increases the `stride` geometrically, reducing the number of active threads at each step. This reduction is typical in parallel reduction algorithms to minimize active threads as the problem size effectively decreases. - **Synchronization**: Twice per loop iteration to avoid hazards: - Before calculating the temporary sum to ensure all threads have the correct values (`__syncthreads()`). - After storing the temporary sum to ensure that no writes occur until all calculations in the current stride are complete (`__syncthreads()`). note that in this version - All threads must wait at synchronization points to ensure data integrity, which can introduce delays or "stalls" where threads are idle, waiting for others to reach the synchronization point. Another version is the improved Kogge-Stone algorithm using a double buffering technique: double the shared memory for each block to alternate between reading input and writing output across iterations. This technique, known as "ping-pong", alternates the roles of input and output buffers, reducing the need for synchronization barriers typically required in single-buffer implementations. ```cpp // Double buffering approach scan __global__ void kogge_stone_scan_gpu(const float* __restrict__ input, float* __restrict__ output, const int length) { __shared__ float input_s[BLOCK_DIM * 2]; const unsigned int tid = threadIdx.x; unsigned int pout = 0, pin = 1; input_s[tid] = input[tid]; input_s[length + tid] = input_s[tid]; __syncthreads(); for (unsigned int stride = 1; stride < length; stride *= STRIDE_FACTOR) { pout = 1 - pout; // swap double buffer indices pin = 1 - pin; if (tid >= stride) //making reduction input_s[pout * length + tid] = input_s[pin * length + tid] + input_s[pin * length + tid - stride]; else //otherwise simply copy the previous value input_s[pout * length + tid] = input_s[pin * length + tid]; __syncthreads(); } output[tid] = input_s[pout * length + tid]; return; } ``` - it's a way to overlap computation and memory transfers - In one iteration, a thread reads from one buffer (input) and writes the result to the other buffer (output). - In the next iteration, the roles of the buffers are switched: the previous output buffer becomes the new input buffer, and vice versa. - This swapping allows continuous use of data without waiting for other threads to sync up, thereby minimizing synchronization points. The double buffering approach minimizes the need for `__syncthreads()` calls, which are typically used to prevent read-after-write hazards. This results in fewer stalls and more continuous computation.