07.Tools CUDA compiler, profiler and debugger

# Tools CUDA compiler, profiler and debugger Nvidia's documentation is comprehensive and well-maintained: - Use the [CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) extensively for detailed information. - For specific CUDA API references, consult the [Runtime API Documentation](https://docs.nvidia.com/cuda/cuda-runtime-api/index.html). - Access the [full CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) for an in-depth understanding. **CUDA API** exposes both the Runtime and the Device API. - We will focus on the **CUDA Runtime API**: - The Runtime API simplifies device code management by providing implicit initialization, context management, and module management, leading to simpler code but less control than the Driver API. - The Device API allows for more fine-grained control, particularly over contexts and module loading, making kernel launches more complex. It's more bare metal. The important difference is that you use the device API to build all the abstraction you will need. ### Best Practices when programming with CUDA - Use the `__restrict__` keyword to suggest that pointers point to unique memory. - If something is constant, use `const`. - Prefer compile-time computations: utilize `macros` and `constexpr` for efficiency. - Utilize `CHECK()` functions to verify execution status: but enable checks only in debug mode for better performance. - `inline` is a keyword in C and C++ used to suggest that the compiler replace a function call with the function's actual code at the call site to potentially improve performance by reducing function call overhead. cautiously: - Typically used for small, frequently called functions where the overhead of a function call might be significant compared to the function's execution time. In CUDA programming, inline functions can be particularly useful for device functions that are called frequently within kernels, as they can help reduce register usage and improve performance. - Compilers treat it as a hint rather than a guarantee. - For force inlining, use `__attribute__((always_inline))`. "inline" Here's a summary of its key aspects: - Be mindful of `#pragma unroll`: - It can boost performance but also increase register usage. - Pay attention to synchronization functions: - Use `__syncthreads` or `__syncwarp` where necessary. - Fences (`__threadfence` and its variants) might also be options. ## Compiling CUDA programs are compiled using the `nvcc` compiler. To compile: `nvcc vector_sum.cu -o vector_sum` To execute: `./vector_sum` Compile for virtual and specific architectures using `nvcc` with flags like `-arch=compute_50` or `-arch=sm_50`. Compilation Types: - **Just-in-Time (JIT):** PTX is included, runtime Cubin generation introduces overhead. - **Ahead-of-Time (AOT):** Cubin is directly in the executable, avoiding runtime compilation. ## Debugging - **Error Handling**: Every CUDA API function returns a flag which can be used for error handling. CUDA API calls return error codes that must be checked to ensure correct execution. - **To measure kernel performance**: CPU timers or a GPU timer can be used to record the execution time of a kernel, including the launch time and execution time on the GPU. - **Profiling and Tuning**: CUDA programs often require profiling and tuning to identify and address bottlenecks. This process involves iterative adjustments to grid and block sizes, memory access patterns, and other critical parameters, guided by profiling tools and best practices. **CUDA-GDB:** Interactive debugger for CUDA code. - Requires `-g` and `-G` flags during compilation for debugging kernel code. - Supports breakpoints, switching execution context, and examining variables within kernels. ## Profiling and Metrics The old way is: - NVPROF command-line tool for profiling data. - NVIDIA Visual Profiler (NVVP) for visualization and optimization. The support ended with the Volta architecture. Now the **Profiling Tools** is **Nsight Profiling Tools (NCU):** - **Nsight systems** for system-wide profiling. - **Nsight compute** for interactive kernel profiling. ![](images/Pasted%20image%2020240418131204.png) ## Roofline The Roofline Performance Model is a visual tool for analysing performance bottlenecks and potential optimizations: - **Computational Rooflines:** Represent limits based on operation type (floating-point, integer, fused multiply-add). - **Memory Rooflines:** Indicate bandwidth limitations for different memory hierarchies (DRAM, L1/L2 caches, shared memory). Example: ![](images/Pasted%20image%2020240418131915.png) The roofline model offers a sophisticated visual framework: it shows performance boundaries by illustrating the balance between computational power and memory bandwidth. It is based on the concept of Arithmetic (or operational) intensity: $\text{Arithmetic Intensity} = \frac{\text{Number of Floating Point Operations}}{\text{Bytes Transferred}}=[FLOPS / byte]$ The roofline model provides performance estimates based on both computational throughput and bandwidth peak performance. This graphical approach allows developers to pinpoint whether a program is compute-bound or memory-bound. In practice, achieving optimal performance involves maximizing the operational intensity, thereby pushing the application towards the upper performance limits of the available hardware. With Nsight Compute is possible to generate a roofline analysis for any given kernel. The aim is to move data as close to the computational core as possible, enhancing the use of faster cache layers to reduce the latency and bandwidth constraints imposed by slower global memory accesses. Visually, the model shows **multiple rooflines** corresponding to different levels of memory hierarchy. This overlay provides a clear depiction of the achievable performance based on the arithmetic intensity of the application and the specific memory hierarchy being utilized.