# Task level parallelism streams, events, dynamic parallelism
In the first classes, we commented that CUDA supports two different types of parallelism: data parallelism and task parallelism.
During the first part of this course, we focused on data parallelism because the GPU is designed to exploit it through its parallel architecture. We dissected a single function to parallelize the execution on each data chunk or item.
Now let's talk about **task-level** parallelism: different tasks are performed simultaneously for more complex orchestrations of computations and data management.
## Streams and Concurrency
CUDA introduces the concept of **streams** for task-level parallelism: a stream is a sequence of commands (like kernel executions or memory operations) that execute in order.
When multiple streams are used, the CUDA driver manages the execution order across different hardware queues. This was limited to a single queue before Fermi GPUs: creating potential false dependencies among tasks.
With the evolution of GPU architectures from Fermi to Kepler and beyond, NVIDIA introduced the ability to handle multiple hardware queues, thus enabling more sophisticated management of parallel tasks through the use of multiple streams.
Modern GPUs typically support 32 hardware streams. If the number of declared streams exceeds this limit, multiple software streams may map to the same hardware queue, potentially leading to contention and reduced performance.
In each stream, commands are pushed and popped using a first-in-first-out policy and forwarded to the GPU. On the GPU side, everything is managed directly in hardware, with commands executed sequentially per stream:
- Operations on the same stream: fifo, no overlap
- Operations on different streams : unordered and overlap
- PCIe executes a single transfer per direction:: multiple concurrent memory ops in same direction are serialized
- Parallelize data transfer and kernel execution on diff streams: split data and work in chunks.
- events can be used for synchronization
By using **streams**, programmers can effectively overlap data transfers and kernel executions, optimizing the utilization of both CPU and GPU resources.
**Responsibility**: It's up to the programmer to design the application to take full advantage of task-level parallelism by carefully scheduling operations and managing dependencies.
CUDA provides two types of streams:
- the default stream
- non-default streams.
The default stream, or NULL stream, is automatically handled and requires no explicit intervention from the programmer.
To utilize non-default streams, you first declare and initialize them using CUDA API calls:
```cpp
cudaStream_t stream1;
cudaStreamCreate(&stream1);
```
These streams allow for the parallel execution of tasks such as kernel execution and data transfers, enhancing the GPU's ability to perform multiple operations concurrently.
```cpp
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(devPtr1, hostPtr1, size, cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(devPtr2, hostPtr2, size, cudaMemcpyHostToDevice, stream2);
kernel1<<<gridSize, blockSize, 0, stream1>>>(arguments1);
kernel2<<<gridSize, blockSize, 0, stream2>>>(arguments2);
```
`cudaMemcpyAsync` is the asynchronous counterpart of `cudaMemcpy`, pushing the data transfer command into a CUDA stream. This function allows the host thread to continue execution without waiting for the data transfer to complete.
```cpp
cudaMemcpyAsync(destination, source, size, cudaMemcpyHostToDevice, stream);
```

### Pinned Memory
Since the virtual memory mapping (introduced in Fermi architecture) and the rotation of pages by the OS can take to undefined behaviour, CUDA runtime uses pinned (locked) pages.
Pinned memory plays a crucial role in optimizing data transfers between the host and GPU, offering a significant performance boost by avoiding unnecessary copies and enabling asynchronous operations. When memory is allocated as "pinned" or "page-locked," it cannot be swapped out by the operating system, ensuring the memory address remains constant and accessible for DMA (Direct Memory Access) operations.
We can force a pinned page directly from the code using functions like `cudaMallocHost()` or `cudaHostAlloc()`, and freed with `cudaFreeHost()`. These functions are used instead of regular `malloc()` to allocate host memory that is page-locked and accessible to the device.
Pinned Memory Benefits:
- **Increased Data Transfer Performance**: Direct access by the DMA engine eliminates the need for copying data to a temporary pageable buffer before transferring to the GPU.
- **Enabling Asynchronous Data Transfer**: Asynchronous transfers allow the GPU to perform data transfer concurrently with other tasks, thereby maximizing the utilization of hardware resources.
```cpp
cudaMallocHost(&ptr, size); // Allocates pinned memory on the host
```
This replaces typical `malloc` or automatic/static array declarations with `cudaMallocHost`, which directly communicates with the OS to allocate memory in a pinned manner.
However, using pinned memory has its drawbacks:
- **Resource Constraints**: Pinned memory can limit the flexibility of the system’s virtual memory management by restricting the OS from dynamically managing the memory space.
- **Potential Impact on Overall System Performance**: Overuse of pinned memory might degrade the performance of other applications or the system as a whole due to reduced memory flexibility.
## Synchronization Techniques
Synchronization is necessary to ensure that data dependencies are respected among various operations. CUDA provides several mechanisms to synchronize operations both within a single stream and across multiple streams.
### Implicit Synchronization in CUDA Function Calls
Regarding synchronization between the host (CPU) and the device (GPU): certain CUDA function calls inherently synchronize the host with the device, meaning that the host execution will block until the specified CUDA operations are completed.
The most common functions that implicitly **synchronize** host and device:
• Memory Allocation:
- `cudaMallocHost()`
- `cudaHostAlloc()`
- `cudaMalloc()`
• Memory Copy:
- `cudaMemcpy()`(non-async versions)
• Memory Setting:
- `cudaMemset()` (non-async versions)
• Device Configuration:
- `cudaDeviceSetCacheConfig()`
These functions block the host until the GPU operation is complete, ensuring implicit synchronization can be helpful, obviously locking the host can prevent it from performing other tasks that could otherwise overlap with GPU operations, reducing the overall efficiency of the application.
To minimize the performance impact of these synchronization points, consider the following strategies:
- **Asynchronous Operations**: Where possible, use asynchronous versions of memory operations (`cudaMemcpyAsync()`, `cudaMemsetAsync()`) along with CUDA streams to allow overlap of memory transfers with computation, both on the host and device.
- **Stream Synchronization**: Use `cudaStreamSynchronize()` to synchronize only specific streams rather than blocking the entire host, allowing other operations to continue concurrently.
### Events for synchronization
CUDA events are markers that can be placed in streams to record execution status. They are particularly useful for profiling and for coordinating between the host and device:
- `cudaEvent_t event` is marker that can be pushed in the stream
- `cudaEventCreate(&event);
- `cudaEventDestroy(event);`
- Device records a timestamp when it reaches the event
- `cudaEventRecord(event, stream1);`
- `cudaStreamWaitEvent(stream2, event, 0);`
```cpp
cudaEvent_t event;
cudaEventCreate(&event);
cudaMemcpyAsync(devPtr, hostPtr, size, cudaMemcpyHostToDevice, stream1);
cudaEventRecord(event, stream1);
cudaStreamWaitEvent(stream2, event, 0); // Make stream2 wait for event
cudaEventDestroy(event);
```
### Global and fine-grained synchronization
As said before, CUDA provides two types of streams:
- the default stream
- non-default streams.
Operations in the default stream are executed only after all operations in other streams are complete, and vice versa. This ensures a level of synchronization without explicit barriers but can limit concurrency.
To avoid blocking behavior typical of the default stream, use the `cudaStreamCreateWithFlags` function to create non-blocking streams:
```cpp
cudaStream_t stream3;
cudaStreamCreateWithFlags(&stream3, cudaStreamNonBlocking);
```
Non-blocking streams allow operations within the stream to proceed without having to wait for operations in the default stream to complete, further enhancing concurrency.
While to ensure that all commands in a stream complete before the CPU proceeds we can use:
```cpp
cudaStreamSynchronize(stream1);
```
This places a barrier, ensuring the CPU only continues once all commands in `stream1` are complete.
`cudaDeviceSynchronize()`: Waits for the completion of all device activities and is considered a global barrier.
## Dynamic parallelism
Dynamic parallelism (introduced in 2012 by Kepler) allows any thread within a kernel to invoke other instances of themselves to tackle more complex instance of the problem:
- **Adaptive Thread Management**: Useful for adjusting the number of threads for data sets with unpredictable sizes, like graphs or sparse matrices, which may vary during processing.
- **Reduced Host-Device Communication**: Eliminates the need for constant signaling between the host and device when additional tasks are found, allowing threads to directly launch new grids and bypassing the usual host intervention.

Specifically CUDA kernels can:
- Launch other kernels
- Allocate global memory
- Synchronize with child grids
- Create streams
However, the range of functions callable from within a device is more limited compared to those callable from the host.
- Max recursive depth is 24 levels theoretically, but practical limits due to GPU resource constraints are much lower.
- The number of pending kernels is capped at 2048, and virtualization involves moving data to global memory, which can degrade performance.
- Each thread can independently launch a child kernel, which is asynchronous.
- But to prevent launching a child grid per thread, only a specific thread (e.g., the first thread in a block or grid) should be programmed to launch the kernel to avoid creating excessive child grids.
- There's no predefined execution order between parent and child threads; the order is determined dynamically.
- cudaDeviceSync can be used to wait until all the child grids of current block ends.
- Implicit barrier in parent thread: child grid always terminates before parent thread / block / grid
- Child grids can access the parent grid's mobile, texture, and constant memory. However, simultaneous operations on the same data structure by both grids are not recommended due to the lack of guaranteed execution order (weak consistency).
- By default, launching a kernel without specifying a stream uses the null stream, leading to serialized execution of child grids launched by multiple threads in the same block. To enable parallel execution, kernels should be launched in separate streams created with the CUDA stream non-blocking flag. Proper synchronization is necessary before stream destruction.