# Graph Traversal Before diving in graph traversal , let's talk on ways to efficiently store sparse matrix. Sparse matrix computation is crucial in scientific, engineering, and financial modelling problems, where the majority of matrix elements are zeros. ### Sparse Matrix Formats Storing and processing these zeros wastes memory and bandwidth. To address this, various sparse matrix formats have been developed to compactly represent the non-zero elements, influencing GPU performance. These formats include: - CSR (Compressed Sparse Row) - COO (Coordinate Format) - ELL (ELLpack Format) With this parameters: - **M**: Number of rows in the matrix - **N**: Number of columns in the matrix - **K**: Number of nonzero entries in the densest row - **S**: Sparsity level `[0-1]`, `1` being fully-dense | Format | Storage Requirements | | --------------------------- | -------------------- | | Dense | $MN$ | | Compressed Sparse Row (CSR) | $2MNS + M + 1$ | | ELL | $2MK$ | | Coordinate (COO) | $3MNS$ | | Hybrid ELL / COO (HYB) | $2MK << 3MNS$ | #### CSR The CSR format stores only the non-zero values a single vector: additional storage is required to locate the start of each row and the column of each element in the values vector. ![](images/Pasted%20image%2020240518183331.png) The space requirements are defined as $2MNS + M + 1$, and CSR is space-saving if $S < \left(1 - \frac{1}{N}\right) / 2$. Assigning a thread to each row for parallel SpMV (Sparse Matrix-Vector multiplication) ensures each thread handles a distinct output value, but consecutive threads accessing distant memory locations result in inefficient memory bandwidth utilization and potential control flow divergence. ### ELL format Originating from ELLPACK, relies on the maximum number of non-zero elements per row and **padding** elements. It's more flexible than CSR since non-zero elements can be added by replacing padding elements. ![](images/Pasted%20image%2020240518184106.png) ELL uses memory bandwidth efficiently due to coalesced memory access but can lead to load imbalance. The space requirements for ELL are $2MK$, and it saves space if $K < \frac{N}{2}$. ### COO COO format stores both column and row indexes for every non-zero element, leading to higher space requirements but offering flexibility in adding non-zero elements. Parallel SpMV in COO assigns each thread to compute a non-zero element, balancing the workload but requiring atomic operations for synchronization. The space requirements for COO are $3MNS$, and it saves space if $S < \frac{1}{3}$. ```cpp struct SparseMatrixCOO { int* row_indices; int* column_indices; float* values; int num_elements; } ``` ![](images/Pasted%20image%2020240518184710.png) ```cpp void SpMV_COO(const SparseMatrixCOO* A, const float* x, float* y) { for (int element = 0; element < A->num_elements; ++element) { const int column = A->column_indices[element]; const int row = A->row_indices[element]; y[row] += A->values[element] * x[column]; } } ``` And in CUDA: ```cpp __global__ void SpMV_COO_gpu( const int* __restrict__ row_indices, const int* __restrict__ column_indices, const float* __restrict__ values, const int num_elements, const float* x, float* y) { for (int element = threadIdx.x + blockIdx.x * blockDim.x; element < num_elements;element += blockDim.x * gridDim.x) { const int column = column_indices[element]; const int row = row_indices[element]; atomicAdd(&y[row], values[element] * x[column]); } } ``` ### Hybrid ELL/COO A hybrid ELL/COO format addresses ELL's inefficiency in handling rows with many non-zeros by placing these non-zeros in a COO format, leading to a more efficient ELL representation for the remaining elements. ![](images/Pasted%20image%2020240518185804.png) This approach enhances space efficiency and flexibility, making it beneficial for iterative solvers despite the overhead of using two different formats. The code is basically the same but just combines the two previous methods. ### Extra Other formats include: - **Jagged Diagonal Storage (JDS):** - Groups similarly dense rows into partitions. - Represents partitions independently using either CSR or ELL. - Sorts rows by density, avoiding padding and improving space efficiency. - Allows coalesced memory access but lacks flexibility in adding new elements. - **Diagonal (DIA):** - Stores only a sparse set of dense diagonal vectors. - For each diagonal, the offset from the main diagonal is stored. - **Packet (PKT):** - Reorders rows and columns to concentrate nonzeros into roughly diagonal submatrices. - Improves cache performance as nearby rows access nearby x elements. - **Dictionary of Keys (DOK):** - Stores the matrix as a map from (row, column) index pairs to values. - Useful for building or querying a sparse matrix, but iteration is slow. - **Compressed Sparse Column (CSC):** - Similar to CSR but stores a dense set of sparse column vectors. - Useful when column sparsity is much more regular than row sparsity. - **Blocked CSR:** - Divides the matrix into blocks stored using CSR with the indices of the upper left corner. - Useful for block-sparse matrices. In general, the FLOPS ratings achieved by both CPUs and GPUs are much lower for sparse matrix computation than for dense matrix computation. For example, in SpMV (Sparse Matrix-Vector multiplication) computation, there is no data reuse in the sparse matrix, which results in a low operational intensity. The operational intensity (OP/B) is essentially 0.25, significantly limiting the achievable FLOPS rate to a small fraction of the peak performance compared to dense matrix computations. ## Graph traversal Graphs are intrinsically related to sparse matrices, with an intuitive representation being an adjacency matrix. Consequently, graph computation can be formulated in terms of sparse matrix operations: GPU Solutions for graph Traversal are: - **Iterative Vertex Assignment** - Method: Iterate or assign each thread to a vertex. - For each iteration, check all incoming edges to see if the source vertex was visited in the last iteration. - If visited, mark the vertex as visited in the current iteration. - Efficiency: Not very work efficient. - Complexity: \(O(VL)\), where \(V\) is the number of vertices and \(L\) is the length of the longest path. - Challenge: Difficult to detect a stopping criterion. - **Frontier-Based Thread Assignment** - Method: Assign threads to frontier vertices from the previous iteration. - Add all non-visited neighbors to the next frontier. - The source will be the first element in the frontier. - Parallel Execution: Threads execute in parallel. - Issue: A variable number of unnecessary threads are launched. - Explanation: This happens because the frontier size can vary greatly between iterations, leading to **imbalanced workloads** among threads. Efficient load balancing in such dynamic scenarios is challenging, resulting in the launch of more threads than necessary to ensure all possible work is covered. Regarding frontier output interface: ![](images/Pasted%20image%2020240527183133.png) - **Without Synchronization**: - Both threads may attempt to visit the same vertex (C), leading to redundancy. - This results in two entries for vertex C in the current queue. - **With `atomicExch`**: - The `atomicExch` ensures that only one thread successfully marks vertex C as visited. - This prevents redundant entries, leading to a single entry for vertex C in the current queue. With synchronization both **visitation check** (check if a vertex has been visited and marks it as visited atomically) and **queue indexing** (each thread gets a unique index in the `currentfrontier` array) are managed. ```cpp if (t < *previousFrontierSize) { const int vertex = previousFrontier[t]; for (int i = rowPointers[vertex]; i < rowPointers[vertex + 1]; ++i) { // Check visitation atomically, avoiding redundant expansion const int alreadyVisited = atomicExch(&visited[i], 1); if (!alreadyVisited) { // We're visiting a new vertex: get a spot in line atomically const int queueIndex = atomicAdd(&currentFrontierSize, 1); distances[destinations[i]] = distances[vertex] + 1; // Place the vertex in line currentFrontier[queueIndex] = destinations[i]; } } } ``` ### Privatization With privatization, each block maintains its local queue, reducing the contention significantly and allowing more efficient use of atomic operations. ![](images/Pasted%20image%2020240527184742.png) ```cpp __global__ void BFS_Bqueue_kernel(const int *previousFrontier, const int *previousFrontierSize, int *currentFrontier, int *currentFrontierSize, const int *rowPointers, const int *destinations, int *distances, int *visited) { __shared__ int sharedCurrentFrontier[BLOCK_QUEUE_SIZE]; __shared__ int sharedCurrentFrontierSize, blockGlobalQueueIndex; if (threadIdx.x == 0) sharedCurrentFrontierSize = 0; __syncthreads(); const int t = threadIdx.x + blockDim.x * blockIdx.x; if (t < *previousFrontierSize) { const int vertex = previousFrontier[t]; for (int i = rowPointers[vertex]; i < rowPointers[vertex + 1]; ++i) { const int alreadyVisited = atomicExch(&(visited[destinations[i]]), 1); if (!alreadyVisited) { distances[destinations[i]] = distances[vertex] + 1; const int sharedQueueIndex = atomicAdd(&sharedCurrentFrontierSize, 1); if (sharedQueueIndex < BLOCK_QUEUE_SIZE) { // there is space in the local queue sharedCurrentFrontier[sharedQueueIndex] = destinations[i]; } else { // go directly to the global queue sharedCurrentFrontierSize = BLOCK_QUEUE_SIZE; const int globalQueueIndex = atomicAdd(currentFrontierSize, 1); currentFrontier[globalQueueIndex] = destinations[i]; } } } } __syncthreads(); if (threadIdx.x == 0) blockGlobalQueueIndex = atomicAdd(currentFrontierSize, sharedCurrentFrontierSize); __syncthreads(); for (int i = threadIdx.x; i < sharedCurrentFrontierSize; i += blockDim.x) { currentFrontier[blockGlobalQueueIndex + i] = sharedCurrentFrontier[i]; } } ``` ### Texture Memory bitches Texture memory, which is useful for real-time texture interpolation in graphical applications is also a good fit for this use case because it allows for **efficient unpredictable access** to the graph data. ![](images/Pasted%20image%2020240808163245.png) The memory access pattern in this scenario is **irregular**. ![](images/Pasted%20image%2020240527184905.png) ```cpp // allocate texture memory cudaTextureObject_t rowPointersTexture; cudaArray *texArray = 0; cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<int>(); CHECK(cudaMallocArray(&texArray, &channelDesc, num_vals)); CHECK(cudaMemcpy2DToArray(texArray, 0, 0, rowPointers, sizeof(int) * num_vals, sizeof(int) * num_vals, 1, cudaMemcpyHostToDevice)); ``` And then: ```cpp const int t = threadIdx.x + blockDim.x * blockIdx.x; if (t < *previousFrontierSize) { const int vertex = previousFrontier[t]; for (int i = tex1D<int>(rowPointersTexture, vertex); i < tex1D<int>(rowPointersTexture, vertex + 1); ++i) { // check visitation atomically, avoiding redundant expansion const int alreadyVisited = atomicExch(&(visited[destinations[i]]), 1); if (!alreadyVisited) { // we’re visiting a new vertex: get a spot in line atomically const int queueIndex = atomicAdd(currentFrontierSize, 1); distances[destinations[i]] = distances[vertex] + 1; // place the vertex in line line currentFrontier[queueIndex] = destinations[i]; } } } ``` - `tex1D<int>(rowPointersTexture, vertex)`: Accesses the row pointer for the given vertex using texture memory. This allows efficient, unpredictable access to the graph data stored in `rowPointersTexture`. - `tex1D<int>(rowPointersTexture, vertex + 1)`: Accesses the next row pointer to determine the range of outgoing edges for the vertex. ### Hybrid GPU-CPU Computation and Memory Management At the beginning of BFS, the frontier (set of vertices to be explored) is often quite small. The **overhead of launching GPU kernels** can outweigh the benefits of parallel computation in such cases. As the frontier grows, the parallelism offered by the GPU becomes more advantageous. Therefore, switching between CPU and GPU based on frontier size can optimize performance. The process involves transferring data between the host (CPU) and the device (GPU). This "ping-pong" effect requires careful management to minimize overhead. Overall: 1) Start on CPU 2) at a certain point, when the frontier size **starts to grow** and remains **stable** (variance on the size must not cause ping-pong) data is transferred on GPU 3) at a certain point, when the frontier **starts to shrink** return to CPU 4) Finish on CPU