14.Brief overview on OpenACC and OpenCL

# Brief overview of OpenACC and OpenCL ## OpenACC OpenACC is a directive-based language used to accelerate code on various architectures, not just GPUs. It simplifies the process compared to CUDA, especially in terms of restructuring code and data management through concepts like unified memory. The goal is to make code acceleration more efficient and **less time-consuming**: - **compiler-based** acceleration: annotations to guide the compiler in restructuring the code for acceleration. - **high portability**: the compiler can target different architectures like GPUs or multicore CPUs just by recompiling. The device and host may be the same physical device in some cases. This allows quick acceleration even for non-skilled programmers, reducing time to market. While OpenACC provides high programmability through simple annotations of existing code, it may offer lower performance compared to CUDA due to its support for multiple devices and higher level of abstraction. In OpenACC, the parallelism model involves: - **Gangs**: Corresponding to CUDA blocks, fully independent execution units. - **Workers**: Each elaborates a vector of work, can synchronize within a gang. **Workers as CUDA Threads**: Each worker executes the same instructions on multiple data. - **Vectors**: Execute the same instruction on multiple data, similar to SIMD execution. Each vector element is considered a CUDA thread. Note that each worker could be interpreted as: - a CUDA thread, executing the same instructions on multiple data - as a warp, with each vector element corresponding to a CUDA thread Compiling with OpenACC-capable compiler (for example NVIDIA's nvc compiler) looks like this: ```bash # Basic compilation for NVIDIA GPU nvc -fast -ta=tesla -Minfo=accel -o program_name source_file.c # -fast: Use all possible optimizations # -ta=tesla: Target NVIDIA Tesla GPU # -Minfo=accel: Provide information on how code has been parallelized # To target a multicore CPU instead of GPU nvc -fast -ta=multicore -Minfo=accel -o program_name source_file.c # To use CUDA managed memory (if supported) nvc -fast -ta=tesla:managed -Minfo=accel -o program_name source_file.c ``` ### OpenACC Directives OpenACC directives in C/C++ are specified as pragmas, which are ignored by compilers that don't support them. ```c #pragma acc kernels for(int i = 0; i < N; i++) C[i] = A[i] + B[i]; ``` The `kernels` directive lets the compiler decide on parallelization, while the parallel directive requires the programmer to specify it for each loop. ```c #pragma acc parallel { #pragma acc loop for(int i = 0; i < N; i++) C[i] = A[i] + B[i]; } ``` The OpenACC compiler can improve performance by collapsing nested loops and utilizing kernel and `parallel` directives. ```c #pragma acc parallel loop collapse(2) for(int i = 0; i < N; i++) for(int j = 0; j < N; j++) C[i*N+j] = A[i*N+j] + B[i*N+j]; ``` Without `collapse(2)` parallelization might be limited by the outer loop while with it the two loops are collapsed into a single loop and parallelization can be applied across $N\cdot N$ iterations. **`seq` Directive**: Executes a loop sequentially within a parallel region. Applying a sort of **coarsening** strategy. ```c #pragma acc parallel loop for (int i = 0; i < N; i++) { #pragma acc loop seq for (int j = 0; j < N; j++) { C[i*N + j] = A[i*N + j] + B[i*N + j]; } } ``` **`reduction` Directive**: Performs parallel reductions with supported operators `+`, `*`, `max`, `min`, `&`, `|`, `&&`, `||`. ```c int sum = 0; #pragma acc parallel loop reduction(+:sum) for (int i = 0; i < N; i++) { sum += A[i]; } ``` **`gang(x)` clause**: divides a loop into `x` independent gangs for parallel execution. Each gang can run independently on different processing units. `worker` then optionally used to specify how to parallelize loops. ```c #pragma acc parallel loop gang(16) for (int i = 0; i < N; i++) { #pragma acc loop worker for (int j = 0; j < N; j++) { C[i*N + j] = A[i*N + j] + B[i*N + j]; } } ``` Except for advanced cases, the organization of gangs, workers, and vectors can be left entirely to the compiler. Proper handling of private variables with clauses like `private` is essential to avoid issues of global access. ```c #pragma acc parallel loop private(swap) for(int i = 0; i < N; i++){ int swap = A[i]; A[i] = B[i]; B[i] = swap; } ``` Different levels of parallelism can be specified for each loop using directives like `gang`. Commands like `copyin`, `copyout`, and `create` handle data transfer tasks: - **`copyin`**: Data are copied from the host to the device at the beginning of the parallel region, and memory is freed at the end of the region. - **`copyout`**: Device memory is allocated at the beginning of the parallel region and copied to the host memory at the end of the region. - **`copy`**: Combines `copyin` and `copyout` behaviors. - **`create`**: Allocates device memory. The compiler then decides how and when to move data between the host and the device. ```c #pragma acc data copyin(A[0:N], B[0:N]) copyout(C[0:N]) { #pragma acc kernels for(int i = 0; i < N; i++) C[i] = A[i] + B[i]; } ``` ### Example of Jacobi iterative method Jacobi iterative method that solves the Laplace equation to compute the heating transfer on a 2D surface. ```c #pragma acc data create(Anew[0:n][0:m]) copy(A[0:n][0:m]) while (err > tol && iter < iter_max) { err = 0.0; #pragma acc parallel loop reduction(max:err) collapse(2) for(int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { //temperature of each single point is computed as // linear comb of 4 neighbours Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(Anew[j][i] - A[j][i])); } } #pragma acc parallel loop collapse(2) for(int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { A[j][i] = Anew[j][i]; } } iter++; } ``` ### Parallelization Considerations for Loops - **While loop**: Has data dependencies among iterations and cannot be parallelized. - **Nested for loops**: Iterations are independent and can be parallelized so `collapse(2)` - **Error computation**: `err` is computed using a reduction clause. - **Data transfer**: - `A`: Transferred to the device at the beginning and back to the host at the end. `copy(A[0:n][0:m])` - `Anew`: Used only on the device, so `create(Anew[0:n][0:m])` ## OpenCL OpenCL acts as a runtime that enables the programming of diverse computing resources using a single language. While CUDA is designed by NVIDIA and it's tailored to **specific** accelerators while OpenCL is a comprehensive solution which: - supports different types of accelerators beyond just GPUs, including those found in desktops, HPC systems, service machines, mobile devices, and **even FPGAs**. - optimizes performance while considering energy consumption, especially in mobile markets where efficiency and trade-offs between compute units like CPUs and GPUs are crucial. The development of OpenCL was spearheaded by Apple and is now managed by the Khronos Group, a consortium of hardware and software companies. OpenCL supported Parallelism Models: - **Single-Instruction-Multiple-Data (SIMD)** - The kernel is composed of sequential instructions. - The instruction stream is executed in lock-step on all involved processing elements. - Generally used on GPU. - **Single-Program-Multiple-Data (SPMD)** - The kernel contains loops and conditional instructions. - Each processing element has its own program counter. - Each instance of the kernel has a different execution flow. - Generally used on CPU. - **Data-Parallel Programming Model** - The same sequential instruction stream is executed in lock-step on different data. - Generally used on GPU. - **Task-Parallel Programming Model** - The program issues many kernels in an out-of-order fashion. The platform model involves multiple compute devices, each comprising compute units and processing elements for parallel processing. OpenCL's program structure mirrors CUDA, with host code written in C++ for sequential tasks and device code in OpenCL C for managing kernels and data parallel execution. ![](images/Pasted%20image%2020240521213927.png) OpenCL's memory model is comparable to CUDA's, including private, local, global, and constant memory types. | OpenCL Term | CUDA Equivalent | Description | | ----------- | --------------- | -------------------------------------------------------------------------------- | | NDrange | CUDA grid | N-dimensional grid with a local and global indexing, spanned by work-groups | | Work-group | CUDA block | Group of contiguous work-items | | Work-item | CUDA thread | Single execution of the kernel on a data instance (i.e., the basic unit of work) | | | | | ![](images/Pasted%20image%2020240521214245.png) The program object encapsulates both source and compiled code, which is compiled at runtime based on the selected device, using just-in-time compilation. Memory objects like buffers and images are employed for data transmission. ![](images/Pasted%20image%2020240521214629.png) The context object encapsulates various features like command queues, unlike CUDA, where much is managed by the runtime due to its focus on a single accelerator **Just-in-time** compilation in OpenCL provides flexibility across different devices, compiling device-side code at runtime using LLVM. ### Code and directives Very very similar to CUDA to advantage engineers to easily switch between the 2 languages. OpenCL kernels are implemented in OpenCL C, an extension of C99 with some restrictions: - No function pointers - No recursion - No variable-length arrays New data types include: - Scalar types: `half`, ... - Image types: `image2d_t`, `image3d_t`, `sampler_t` - Vector types: `char2`, `ushort4`, `int8`, `float16`, `double2`, ... Address space qualifiers: - `__global`, `__local`, `__constant`, `__private` Built-in functions: - Work-item functions: `get_global_id`, `get_local_id`, ... - Math functions: `sin`, `cos`, `exp`, ... - Common functions: `min`, `clamp`, `max`, ... - Geometric functions: `cross`, `dot`, `normalize`, ... - Vector load/store functions: `vload4`, `vstore4`, ... - Synchronization functions: `barrier` - Atomic functions: `atomic_add`, `atomic_sub`, ... ## Example OpenCL kernel ```c __kernel void dp_mult(__global const float* a, __global const float* b, __global float* c) { int i = get_global_id(0); c[i] = a[i] * b[i]; } ``` ## Example with problem size parameter ```c __kernel void dp_mult(__global float* a, __global float* b, __global float* c, int N) { int i = get_global_id(0); if(i < N) c[i] = a[i] * b[i]; } ``` ## Matrix multiplication example ```c __kernel void parMultiply1 ( __global float *A, __global float *B, __global float *C) { // Vector element index const int m = get_global_id(0); const int n = get_global_id(1); const int M = get_global_size(0); C[m + M * n] = 0; for (int k = 0; k < M; k++) C[m+M*n] += A[m+M*k] * B[k+M*n]; } ``` ## Matrix multiplication with local memory ```c __kernel void parMultiply2 (__global float *A, __global float *B, __global float *C, int K) { const int row = get_local_id(0); // Local row ID (max: TS) const int col = get_local_id(1); // Local col ID (max: TS) const int globalRow = TS*get_group_id(0) + row; // Row ID of C (0..M) const int globalCol = TS*get_group_id(1) + col; // Col ID of C (0..N) const int M = get_global_size(0); const int N = get_global_size(1); // Local memory to fit a tile of TS*TS elements of A and B __local float Asub[TS][TS]; __local float Bsub[TS][TS]; float acc = 0.0f; const int numTiles = K/TS; for (int t=0; t<numTiles; t++) { const int tiledRow = TS*t + row; const int tiledCol = TS*t + col; Asub[col][row] = A[tiledCol*M + globalRow]; Bsub[col][row] = B[globalCol*K + tiledRow]; barrier(CLK_LOCAL_MEM_FENCE); for (int k=0; k<TS; k++) { acc += Asub[k][row] * Bsub[col][k]; } barrier(CLK_LOCAL_MEM_FENCE); } C[globalCol*M + globalRow] = acc; } ``` ### Workflow of an OpenCL Application The host code in OpenCL is more elaborate than in CUDA: 1. **Discovering the Platforms and Devices** 1. **Creating a Context** 2. **Creating a Command Queue per Device** 3. **Creating Memory Objects to Hold Data** 4. **Copying the Input Data onto the Device** 5. **Creating and Compiling a Program from the OpenCL C Source Code** 6. **Generating a Kernel of the Program and Specifying Parameters** 7. **Executing the Kernel** 8. **Copying Output Data Back to the Host** 9. **Releasing the OpenCL Resources** Here's a more detailed breakdown of the workflow: 1. **Discovering the Platforms and Devices** - Platforms correspond to vendor-specific libraries - Use functions like `clGetPlatformID` and `clGetPlatformInfo` to identify available platforms - Each platform can have multiple devices - Use `clGetDeviceID` and `clGetDeviceInfo` to select device IDs for acceleration tasks ```c cl_uint numPlatforms; cl_platform_id *platformIds; clGetPlatformIDs(0, NULL, &numPlatforms); platformIds = (cl_platform_id *)malloc(sizeof(cl_platform_id) * numPlatforms); clGetPlatformIDs(numPlatforms, platformIds, NULL); cl_uint numDevices; cl_device_id *deviceIds; clGetDeviceIDs(platformIds[0], CL_DEVICE_TYPE_ALL, 0, NULL, &numDevices); deviceIds = (cl_device_id *)malloc(sizeof(cl_device_id) * numDevices); clGetDeviceIDs(platformIds[0], CL_DEVICE_TYPE_ALL, numDevices, deviceIds, NULL); ``` 2. **Creating a Context** - A context is a container for associated devices, program objects, kernels, memory objects, and command queues ```c cl_context context; cl_context_properties properties[] = {CL_CONTEXT_PLATFORM, (cl_context_properties)platformIds[0], 0}; context = clCreateContext(properties, 1, &deviceIds[0], NULL, NULL, NULL); ``` 3. **Creating a Command Queue per Device** - Command queues are used for issuing commands to a device ```c cl_command_queue queue; queue = clCreateCommandQueue(context, deviceIds[0], 0, NULL); ``` 4. **Creating Memory Objects to Hold Data** - Memory objects are used to transmit data to/from a device ```c cl_mem bufA, bufB, bufC; bufA = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float)*NUM_DATA, NULL, NULL); bufB = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float)*NUM_DATA, NULL, NULL); bufC = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float)*NUM_DATA, NULL, NULL); ``` 5. **Copying the Input Data onto the Device** - Transfer data from host memory to device memory ```c clEnqueueWriteBuffer(queue, bufA, CL_TRUE, 0, sizeof(float)*NUM_DATA, a, 0, NULL, NULL); clEnqueueWriteBuffer(queue, bufB, CL_TRUE, 0, sizeof(float)*NUM_DATA, b, 0, NULL, NULL); ``` 6. **Creating and Compiling a Program from the OpenCL C Source Code** - Load and compile the kernel source code ```c const char *programSource = "__kernel void example(...) { ... }"; cl_program program = clCreateProgramWithSource(context, 1, &programSource, NULL, NULL); clBuildProgram(program, 1, &deviceIds[0], NULL, NULL, NULL); ``` 7. **Generating a Kernel of the Program and Specifying Parameters** - Create a kernel from the compiled program and set its arguments ```c cl_kernel kernel = clCreateKernel(program, "example", NULL); clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufA); clSetKernelArg(kernel, 1, sizeof(cl_mem), &bufB); clSetKernelArg(kernel, 2, sizeof(cl_mem), &bufC); ``` 8. **Executing the Kernel** - Launch the kernel execution on the device ```c size_t globalWorkSize[1] = { NUM_DATA }; size_t localWorkSize[1] = { 64 }; clEnqueueNDRangeKernel(queue, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL); ``` 9. **Copying Output Data Back to the Host** - Transfer results from device memory back to host memory ```c float results[NUM_DATA]; clEnqueueReadBuffer(queue, bufC, CL_TRUE, 0, sizeof(float)*NUM_DATA, results, 0, NULL, NULL); ``` 10. **Releasing the OpenCL Resources** - Free all allocated OpenCL resources ```c clReleaseMemObject(bufA); clReleaseMemObject(bufB); clReleaseMemObject(bufC); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(queue); clReleaseContext(context); ```