01.Introduction to heterogeneous systems

# Introduction to heterogeneous systems An **heterogeneous System** comprises multiple computing units, each with distinct characteristics. Heterogeneous systems are necessary since the need to run applications in various fields with differing performance requirements (Email, browser, videogames , cad ... etc). Main motivations are: 1. The need for high performance and energy efficiency in modern devices, which are **power-constrained**. 2. Many compute-intensive applications can be partitioned into parts with different characteristics, each of which can be efficiently accelerated by a different type of computing unit. **Challenges**: 1. **Integration** of many different units and **communication** between them. 2. **Programmability**, as each type of unit has different programming languages and paradigms. 3. **Resource management**, including distributing a multi-programmed dynamic workload, configuring software/hardware knobs, and achieving required performance at a power-efficient level. Real-world examples are infinite: mobile phones, laptops, desktop computers, embedded and edge devices, and supercomputers. ## Evolution of Computing System Architectures The evolution of computing system architectures can be summarized as: - **Single-Core Era**: This era was characterized by increasing voltage/frequency scaling until the power wall was reached in 2004. - **Multi-Core Era**: This era saw the integration of multiple cores in the same chip. However, performance limits were encountered due to power consumption and scalability issues. - **Heterogeneous Systems Era**: marked by the integration of heterogeneous units in the same chip, allowing for parallelization of various applications. Different parts of the applications may benefit from specialized computing units. A useful classification to remember: **Task Parallelism**: This involves the execution of many independent tasks/functions in parallel. **Data Parallelism**: This involves operations on data composed of many items that can be processed simultaneously. Flynn's taxonomy, introduced by Michael J. Flynn in 1966, and have been instrumental in understanding and developing parallel processing architectures: - **SISD (Single Instruction, Single Data)**: Represents the classical von Neumann architecture where a single processing unit executes a single instruction stream on a single data stream. This is typical of traditional uniprocessor machines. - **SIMD (Single Instruction, Multiple Data)**: Multiple processing units perform the same operation simultaneously on different data items. This architecture is often used in vector machines, where each processing unit executes the same instruction but on different pieces of data. Example: vector processors. - **MISD (Multiple Instruction, Single Data)**: An uncommon architecture where multiple instructions operate on a single data stream. This type of architecture is rarely used in practice. - **MIMD (Multiple Instruction, Multiple Data)**: Multiple autonomous processors simultaneously execute different instructions on different data items. This is common in multi-core machines where each core can run different instructions independently. Example: multi-core processors. - **SPMD (Single Program, Multiple Data)**: Multiple autonomous processing units simultaneously execute a single program on different data items. While the program is the same across all units, the execution path can vary for each data point. This is the approach of CUDA: - The kernel function includes the code executed by each thread - For example a `for` loop is replaced by a grid of threads, each one working on a single data element - **MPMD (Multiple Program, Multiple Data)**: Multiple autonomous processing units simultaneously execute at least two independent programs on different data items. This architecture allows for more complex and varied computation as different units can run entirely different programs. ### Refresh of architectural solutions for parallel computing In a computer architecture parallelism can be extracted at different levels: - Instruction-Level Parallelism (ILP) - The architecture is organized in different stages - Each instruction “occupies” a single stage to permit overlapping between others - In a superscalar architecture (composed by several ALUs) instructions are scheduled out of order. - In a Very Long Instruction Word (VLIW) architecture Dependency analysis is performed by the compiler to schedule long instructions - Data-Level Parallelism (DLP) (SIMD) - The architecture contains groups of several ALUs of the same type - Vectorized instruction are used to make the same operation executed on different data - Thread-Level Parallelism (TLP): - Independent threads (no dependencies between each other) executes all together multiple instruction streams - The architecture stores the execution data (called context) of all the threads - The architecture contains several cores (multicore architecture) where each of them executes at least one thread. - A need for a cache hierarchy to avoid memory accesses to be the bottleneck. Actually all these techniques are used in the GPUs but the most important is Data-Level Parallelism (DLP): where groups of ALUs execute the same operation on different data items simultaneously.