10.Parallel Patterns - polimi-notes

# Parallel Patterns A Parallel Pattern is a recurring combination of task distribution and data access that solves a specific problem in parallel algorithm design. Patterns are universal, they can be used in any parallel programming system. Parallel patterns will be classified under these macro-classes: - nesting pattern - parallel control patterns - parallel data management patterns - other patterns ## Nesting Pattern ![](images/6e85937742c3f7a7f16b97218d26de4e.png){width=50%} ## Parallel control patterns Recap of **Serial** Control Patterns: - sequence pattern - iteration pattern - selection pattern - recursion pattern ### Fork-Join pattern ![](images/d108328a4f478877087adea69e52904e.png){width=50%} The fork-join pattern is a common parallelization technique used to decompose a sequence pattern into smaller subproblems that can be solved in parallel. The basic idea is to fork the original problem into multiple tasks and then combine the results (join) of these tasks later to obtain the final solution. ### Map The map is a foreach loop where each iteration is independent. It's embarrassingly parallel. The key concept is that a map is an operation applied to each element without knowledge of neighbors elements. Also a map function should be pure and should not modify any shared state. ![](images/4b36e51f055c1455c4b15420c6dc95c5.png){width=50%} Purism means perfect independence and determinism, no data-races and no segfaults. ![](images/4e9bd736a96f70997575efc15dc236f0.png) ### Stencil An generalization of the map: a stencil is a function which accesses a set of "neighbors": the neighbors are a set of cells obtained by a fixed "offsets/distance" relative to the output position. Stencils can operate on one dimensional and multidimensional data so the neighborhoods can range from compact to sparse, square to cube, and anything else. ![](images/8d640d2a1c2719331b535d3578774418.png){width=50%} ![](images/55dc48f4ee697522802e6e49e37721a9.png){width=50%} #### Stencil and ghost cells. ![](images/2c87f918c04d02dfeb72da448f39f797.png){width=50%} Since computing the value of each point requires the values of other points these computations are **not** embarrassingly parallel. Specifically, the points at the borders of a chunk require the values of points from the neighboring chunks. In order to apply the stencil operation consistently to all points in a computational domain, additional space is allocated for a series of **ghost cells** around the edges of each chunk. These ghost cells form a **halo** around each chunk, containing replicates of the borders of all immediate neighbors. **The ghost cells are not updated locally, but provide the necessary stencil values when updating the borders of the chunk.** ![](images/120ddf1bb186c9b47060f99d70c6dfa0.png){width=50%} ![](images/6041f6d9c81e9e1a809bd30d23315c80.png){width=50%} The use of ghost cells in stencil parallel applications can improve the accuracy and efficiency of computations but also involves a **trade-off** between **computation** and **communication**. ![](images/adc14e78fc0756aedee9d797b3734622.png){width=50%} The ghost cells to total cells ratio will rapidly increase causing a greater demand on memory if we increase the number of threads. #### Possible optimizations in Stencil pattern - We could double or triple our **ghost cell boundary**: we do **extra computation** and we reduce the number of border exchanges, decreasing the associated communication overhead. This allows us to perform several iterations without stopping for a ghost cell update. - A possible **latency hiding** could be to compute interior of stencil while waiting for ghost cell updates. - In some specific cases (iterative codes in computer simulations of physical systems) **SOR**, or **Successive Over-Relaxation**, could be useful. **Red/Black SOR** is a variant of the SOR method that uses a special ordering of the grid points to speed up the process. Fully parallelizable since in the first step we only read from black and we write on red. Vice versa on second step. We actually "predict" the values of the next iteration when we are in a step...but at least they are completely parallelizable. It consists on an **interpolation** and **approximation**. ![](images/624f4614e7deca536f019ec1b97d63dd.png){width=20%} - Cache optimizations are based on assumptions over cache-lines: for example we can assume that rows of a 2D array are contiguous in memory. This means that horizontally related data will tend to belong to the same cache line while vertical offset accesses will most likely result in cache misses. **Strip-mining** is an optimization which is based on considerations over cache lines: dividing the overall grid of data that needs to be processed into smaller **strips** . Strips are groups elements in a way that avoids redundant memory accesses and aligns memory accesses with cache lines. ### Reduction Reduction combines every element in a collection using an associative function. If the function isn't associative is not possible to parallelize a reduction operation. The associative property allows us to ''split'' and change the order of operations of the reduction. Note that addition, multiplication, maximum, minimum and boolean AND, OR, XOR are all associative. ![](images/ccfdb39d17c5182fcc7a7e04a83d2ce9.png){width=20%} Even a single processor can perform ''vectorization". For example without doing any parallelization we can still have a speed up because we can make an operation with 2 elements in a cycle (so we have speedup of 2): ![](images/1192f17c158d56e0e2589896b86f8cee.png){width=20%} The main concept of reduction parallelization is **tiling** : breaking chunks of work to reduce serially. ![](images/66cd13f0fa0b23178d0e9e4e42e676a2.png) Reduce example is the dot product is an essential operation in physics, graphics and videogames. The dot production of $\vec a$ and $\vec b$ is $|\vec a||\vec b|cos(\alpha)$ but also $\sum_i a_i b_i$ (the vector components) , so it's easy to parallelized it into ### Scan Scan computes all partial reductions of a collection. Like the reduction, if the function is **associative**, the scan can be parallelized. Scan is not obvious at first, because of the dependencies to previous iterations. ![](images/7d7c8da95f4e377690c674d847dd09cd.png){width=30%} Note that in the parallel scan is necessary to perform more work than the serial version (look previous image). Two types of scan: - Inclusive scan: includes current element in partial reduction. - Exclusive scan: excludes current element in partial reduction, partial reduction is of all prior elements prior to current element. We can also identify two phases of the algorithm: - Up sweep: compute reduction - Down sweep: compute intermediate results ### Recurrence The recurrence parallel pattern is a more complex version of the map parallel pattern, in which the loop iterations can depend on one another. This means that, unlike in the map pattern, the elements of the input data are not necessarily independent of each other, and the outputs of some elements may be used as inputs to other elements. As a result, the order in which the elements are computed may be important, and the pattern typically involves the use of a serial ordering to ensure that the elements can be computed correctly. This pattern is used to structure the parallel execution of code that involves recursion. This can still be parallelized! Trick: find a plane that cuts through grid of intermediate result ![](images/266511d2481c38d9ba3ad16f8f4e5f6d.png){width=50% } ## Parallel Data Management Patterns Serial Data Management Patterns are the classical ones: - stack allocation - heap allocation - objects - random read and write Often the **bottleneck** is not the computation but the **data movement**. Considering [the Principle of Locality](../../Advanced%20Computer%20Architectures/src/13.Memory%20Hierarchy.md#Principle%20of%20Locality) is very important: it's better from an **hardware** perspective to keep data "local" and closer to the CPU to exploit the **cache**. **Transferring data across memory layers is costly since it can take many cycles. ### Geometric decomposition Geometric Decomposition is a common pattern to arrange data into subcollections (chunks) which can be overlapped or not. This pattern doesn't necessarily move data, it just gives us another view of it ![](images/d69a30a8ba162a1cd55dde15e7ee104d.png) Special cases are: - partitioning pattern: sub-collections are same-sized and not-overlapping - segmentation pattern: non-uniformly sized partitions ### Gather Gather reads a collection of data given a collection of indices. Think of a combination of map and random serial reads. The output collection shares the same type as the input collection, but it share the same shape as the indices collection. ![](images/a210fbc50ac0d58e357f6bef352e234c.png) **Read locations provided as input**. Remember that the output will have the same size of the index array. #### Zip It's a special case of Gather, we start from two arrays and combine them. #### Unzip Reverses a zip: extracts sub arrays at certain offsets from a given input. ### Scatter Scatter is the inverse of gather. A set of input and indices is required, but each element of the input is written to the output at the given index instead of read from the input at the given index. This is different from Gather! Race conditions because write of same location are possible. Race conditions can occur when we have two writes to the same location! ![](images/19c330fc349f96fb923c7bf7a2cff7cf.png) **Write locations provided as input**. In case of collision we can have some rules like: - in case of associative and commutative operators can merge colliders. - we could associate to each value a priority. Example of this case in 3D graphics rendering. - In case there aren't collisions the output is just a permutation, so no problem. ### Pack Pack is used to eliminate unused space in a collection. Elements marked false are discarded, the remaining elements are placed in a contiguous sequence in the same order. ![](images/526a61ae1f5b97aa9230000bfe65cb76.png) ### Split ![](images/527accb7d3c8a960b6c4133bee17b05c.png) Generalization of **pack** pattern, where the isn't information lose. There is also the ''inverse operation'' pattern: unsplit. ### Bin There is also the **Bin** parallel pattern which is the generalization of split: simply split which support more categories. ![](images/26d431ec07632d0e83ef29def142c91f.png) ### Pipeline Pipeline connects tasks in a producer-consumer manner, which is very common. A linear pipeline is the basic pattern idea, but a pipeline in a DAG is also possible. ## Other Patterns - Expand: a combination of pack and map, where each element can output multiple elements. ![](images/7e183f6c98e56b7dd35d6ca71a4a5503.png){width=20%} - Superscalar Sequences: write a sequence of tasks, ordered only by dependencies - Futures: similar to fork-join, but tasks do not need to be nested hierarchically - Speculative Selection: general version of serial selection where the condition and both outcomes can all run in parallel - Workpile: general map pattern where each instance of elemental function can generate more instances, adding to the "pile" of work - Search: finds some data in a collection that meets some criteria - Category Reduction: Given a collection of elements each with a label, find all elements with same label and reduce them # Different way to store things ## Array of structures AoS An array containing the different instances of a data structure. Extremely difficult to access memory for reads (gathers) and writes (scatters) but it can be useful if data is accessed randomly. ## Structure of arrays SoA A single data structure where in each property/attribute is stored all the values of all the different instances (using an array). Typically better for vectorization and avoidance of false sharing. Separate arrays for each structure-field, keeps memory access contiguous when vectorization is performed over structure instances. ![](images/5f2d709303c357aa7e5a57784c37a744.png) The padding at the end indicates which is the size of a data structure.