11.Multithreading Architectures

# Multithreading Architectures ## Superscalar architectures A scalar architecture is a single issue architecture while superscalar architectures allow for multiple instructions to be executed per clock cycle. Remember that ILP $\ne$ parallel processing: - ILP improves throughput by pipelining operations - Parallel processing is non-user-transparent way of executing programs ![](images/803583c8b1e3b6844826abb1bfbe66b3.png){width=50%} Superscalar architectures have limitations. ILP has an upper bound: - Dynamic scheduling is expensive and difficult to design and verify - Register renaming has its limits - Jump and branch predictions are not always accurate - Memory latency is a huge issue. - Hazards prevent too many instructions from being issued simultaneously. ## Temporal Multithreading Using Temporal Multithreading is possible to further increase parallelism by alternately switching between tasks. Temporal Multi-threading has a high cost due to each thread having its own context and there are different techniques to achieve it: - **Fine-grained** multi-threading involves switching between different threads at a very fine level of granularity, such as after every instruction. This allows for maximum utilization of resources but can also lead to increased overhead due to **frequent context switching**. - **Coarse-grained** multi-threading involves switching between different threads at a coarser level of granularity, such as after completing a group of instructions from one thread before moving on to another thread. This reduces the overhead associated with context switching but may not fully utilize available resources. - **Simultaneous multi-threading** (SMT) is similar to fine-grained multi-threading but goes further by allowing multiple instructions from different threads to be executed simultaneously within each clock cycle. ![](images/218bf5ba28c0ebc96af3e08b3d4472ce.png){width=50%} There are drawbacks in each approach: - **Fine-grain** multithreading has empty issue windows, causing idle time. - **Coarse-grained** multithreading architecture is not commonly used, since it requires flushing the pipeline and can result in starvation. - **SMT** requires more complex hardware than either fine- or coarse-grained multithreading but can provide higher levels of parallelism and better resource utilization. ![](images/0200d6edcb5abd3a937bb3fc3ef24606.png){width=50%} ## ARM Cortex-a53 pipeline example What actually happens in real world? Basically different systems are combined to achieve the most performance: for example cpu alone is not enough for tasks such as playing games in 4K or streaming on Twitch: GPUs are a common addition to systems. This is called **heterogeneous architecture** since it combines different types of processors. An example of heterogeneous architecture is the big.LITTLE architecture. Cortex-a53 (also the Raspberry Pi 3 CPU) can be used alone as an energy-efficient alternative to the Cortex-A57, or in a big.LITTLE configuration alongside a more powerful microarchitecture. The big.LITTLE configuration architecture is multicore and heterogeneous with a shared ISA, combining battery-efficient (LITTLE) and power-hungry (big) cores. Typically, only one side will be active, but all cores can access the same memory. ![](images/ac4aebabc3d048275d8e7ba4ac5dd6b2.png)