# GPU Architectures

Prof. Onur Mutlu

ETH Zürich Spring 2023 11 May 2023

STUDENTS-HUB.com

Uploaded By: Jibreel Bornat

# GPUs are SIMD Engines Underneath

- The instruction pipeline operates like an SIMD pipeline (e.g., an array processor)
- However, the programming is done using threads, NOT SIMD instructions
- Let us distinguish between
  - Programming Model (Software)

VS.

Execution Model (Hardware)

Programming Model vs. Hardware Execution Model

- Programming Model refers to how the programmer expresses the code
  - E.g., Sequential (von Neumann), Data Parallel (SIMD), Dataflow, Multi-threaded (MIMD, SPMD), ...
- Execution Model refers to how the hardware executes the code underneath
  - E.g., Out-of-order execution, Vector processor, Array processor, Dataflow processor, Multiprocessor, Multithreaded processor, ...
- Execution Model can be very different from the Programming Model
  - E.g., von Neumann model implemented by an OoO processor
  - E.g., SPMD model implemented by a SIMD processor (a GPU)

STUDENTS-HUB.com

Uploaded By: Jibreel Borrat

## How Can You Exploit Parallelism Here?



Scalar Sequential Code C[i] = A[i] + B[i];



STUDENTS-HUB.com

Let's examine three programming options to exploit instruction-level parallelism present in this sequential code:

- 1. Sequential (SISD)
- 2. Data-Parallel (SIMD)

3. Multithreaded (MIMD/SPMD)

Uploaded By: Jibreel Borfat

### Prog. Model 1: Sequential (SISD)



Can be executed on a:

- Pipelined processor
- Out-of-order execution processor
  - Independent instructions executed when ready
  - Different iterations are present in the instruction window and can execute in parallel in multiple functional units
  - In other words, the loop is dynamically unrolled by the hardware
- Superscalar or VLIW processor
  - Can fetch and execute multiple instructions per cycle

STUDENTS-HUB.com

Uploaded By: Jibreel Bornat

# Prog. Model 2: Data Parallel $(SIMD)^{for (i=0; i < N; i++)}_{C[i] = A[i] + B[i];}$



STUDENTS-HUB.com

Uploaded By: Jibreel Borhat

#### Prog. Model 3: Multithreaded for (i=0; i < N; i++) c[i] = A[i] + B[i];



STUDENTS-HUB.com

Uploaded By: Jibreel Bornat

#### for (i=0; i < N; i++)Prog. Model 3: Multithreaded C[i] = A[i] + B[i];



STUDENTS-HUB.com

reel Bornat

# A GPU is a SIMD (SIMT) Machine

- Except it is not programmed using SIMD instructions
- It is programmed using threads (SPMD programming model)
  - Each thread executes the same code but operates a different piece of data
  - Each thread has its own context (i.e., can be treated/restarted/executed independently)
- A set of threads executing the same instruction are dynamically grouped into a warp by the hardware
   A warp is essentially a SIMD operation formed by hardware!

# Warp Terminology



#### Source: Wikipedia

STUDENTS-HUB.com

# SPMD on SIMT Machine



# SIMD vs. SIMT Execution Model

- SIMD: A single sequential instruction stream of SIMD instructions → each instruction specifies multiple data inputs
   [VLD, VLD, VADD, VST], VLEN
- SIMT: Multiple instruction streams of scalar instructions → threads grouped dynamically into warps
   [LD, LD, ADD, ST], NumThreads

#### Two Major SIMT Advantages:

- □ Can treat each thread separately → i.e., can execute each thread independently (on any type of scalar pipeline) → MIMD processing
- □ Can group threads into warps flexibly → i.e., can group threads that are supposed to *truly* execute the same instruction → dynamically obtain and maximize benefits of SIMD processing

STUDENTS-HUB.com

### Fine-Grained Multithreading of Warps

- Assume a warp consists of 32 threads
- If you have 32K iterations, and 1 iteration/thread  $\rightarrow$  1K warps
- Warps can be interleaved on the same pipeline → Fine grained multithreading of warps



STUDENTS-HUB.com

### Fine-Grained Multithreading (FGMT) of Warps

for (i=0; i < N; i++)
C[i] = A[i] + B[i];</pre>

- Assume a warp consists of 32 threads
- If you have 32K iterations, and 1 iteration/thread  $\rightarrow$  1K warps
- Warps can be interleaved on the same pipeline → Fine grained multithreading of warps



All threads in a warp are independent of each other

 $\rightarrow$  They be executed seamlessly in a fine-grained multithreaded pipeline

STUDENTS-HUB.com

Uploaded By: Jibreel Bolfat

### Fine-Grained Multithreading: Basic Idea



#### Each pipeline stage has an instruction from a different, completely-independent thread

STUDENTS-Webneed a PC and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes and a register file for each thread + muxes

# Warps and Warp-Level FGMT

- Warp: A set of threads that execute the same instruction (on different data elements) → SIMT (Nvidia-speak)
- All threads run the same code
- Warp: The threads that run lengthwise in a woven fabric ...



SINGHON TO Set alB. CONVIDIA Tesla: A Unified Graphics and Computing Architeros 2008.

# High-Level View of a GPU



SINGHONMSetHalB. CON/IDIA Tesla: A Unified Graphics and Computing Architectored IEE EliMieros 2008.

# Latency Hiding via Warp-Level FGMT

- Warp: A set of threads that execute the same instruction (on different data elements)
- Fine-grained multithreading
  - One instruction per thread in pipeline at a time (No interlocking)
  - Interleave warp execution to hide latencies
- Register values of all threads stay in register file
- FGMT enables simple pipeline & long latency tolerance
  - Millions of threads operating on the same large image/video



## Recall: Vector Instruction Execution



STide Dedik KrStelAsaBytcom

Uploaded By: Jibreel Bolhat

# Warp Execution (Recall the Previous Slide)



STide Dealth KrStel Asal Bytcom

Uploaded By: Jibreel Borhat

## Recall: Vector Unit Structure



STIde Deci KrSelAsa Bxcom

Uploaded By: Jibreel Borhat

# GPU SIMD Execution Unit Structure



STIde Deal KrSelAsabytcom

Uploaded By: Jibreel Bozhat

# SIMT Memory Access (Loads and Stores)

 Same instruction in different threads uses thread id to index and access different data elements

Let's assume N=16, 4 threads per warp  $\rightarrow$  4 warps Threads 9 13 14 ÷ 9 11 12 13 8 10 14 Data elements Warp 3 Warp 1 Warp 0 Warp 2

For maximum performance, memory should provide enough bandwidth (i.e., elements per cycle throughput to match computation unit throughput)

# Warps not Exposed to GPU Programmers

- CPU threads and GPU kernels
  - Sequential or modestly parallel sections on CPU
  - Massively parallel sections on GPU: Blocks of threads

Serial Code (host)

Parallel Kernel (device)
KernelA<<<nBlk, nThr>>>(args);

Serial Code (host)

Parallel Kernel (device)
KernelB<<<nBlk, nThr>>>(args);



Slide credit: Hwu & Kirk

# Sample GPU SIMT Code (Simplified)



# Sample GPU Program (Less Simplified)

#### **CPU Program**

#### **GPU Program**

```
void add matrix
(float *a, float* b, float *c, int N) {
  int index;
  for (int i = 0; i < N; ++i)
     for (int j = 0; j < N; ++j) {
       index = i + j*N;
       c[index] = a[index] + b[index];
                                           }
int main () {
  add matrix (a, b, c, N);
                                           }
```

```
__global__ add_matrix
(float *a, float *b, float *c, int N) {
int i = blockldx.x * blockDim.x + threadldx.x;
Int j = blockIdx.y * blockDim.y + threadIdx.y;
int index = i + j*N;
if (i < N && j < N)
 c[index] = a[index]+b[index];
Int main() {
 dim3 dimBlock( blocksize, blocksize);
 dim3 dimGrid (N/dimBlock.x, N/dimBlock.y);
 add_matrix<<<dimGrid, dimBlock>>>( a, b, c, N);
```

# From Blocks to Warps

- GPU core: A SIMD pipeline
  - Streaming Processor (SP)
  - Many such SIMD Processors
    - Streaming Multiprocessor (SM)
- Blocks are divided into warps
   SIMD/SIMT unit (32 threads)





**NVIDIA Fermi architecture** 

# Warp-based SIMD vs. Traditional SIMD

- Traditional SIMD contains a single thread
  - Sequential instruction execution; lock-step operations in a SIMD instruction
  - □ Programming model is SIMD (no extra threads) → SW needs to know vector length
  - ISA contains vector/SIMD instructions
- Warp-based SIMD consists of multiple scalar threads executing in a SIMD manner (i.e., same instruction executed by all threads)
  - Does not have to be lock step
  - Each thread can be treated individually (i.e., placed in a different warp)
    - → programming model not SIMD
    - SW does not need to know vector length
    - Enables multithreading and flexible dynamic grouping of threads
  - □ ISA is scalar  $\rightarrow$  SIMD operations can be formed dynamically
  - Essentially, it is SPMD programming model implemented on SIMD hardware

STUDENTS-HUB.com

# SPMD

- Single procedure/program, multiple data
  - This is a programming model rather than computer organization
- Each processing element executes the same procedure, except on different data elements
  - □ Procedures can synchronize at certain points in program, e.g. barriers
- Essentially, multiple instruction streams execute the same program
  - Each program/procedure 1) works on different data, 2) can execute a different control-flow path, at run-time
  - Many scientific applications are programmed this way and run on MIMD hardware (multiprocessors)
  - Modern GPUs programmed in a similar way on a SIMD hardware

# SIMD vs. SIMT Execution Model

- SIMD: A single sequential instruction stream of SIMD instructions → each instruction specifies multiple data inputs
   [VLD, VLD, VADD, VST], VLEN
- SIMT: Multiple instruction streams of scalar instructions → threads grouped dynamically into warps
   [LD, LD, ADD, ST], NumThreads
- Two Major SIMT Advantages:
  - □ Can treat each thread separately → i.e., can execute each thread independently on any type of scalar pipeline → MIMD processing
  - Can group threads into warps flexibly → i.e., can group threads that are supposed to *truly* execute the same instruction → dynamically obtain and maximize benefits of SIMD processing

STUDENTS-HUB.com

#### Threads Can Take Different Paths in Warp-based SIMD

- Each thread can have conditional control flow instructions
- Threads can execute different control flow paths



# Control Flow Problem in GPUs/SIMT

- A GPU uses a SIMD pipeline to save area on control logic
  - Groups scalar threads into warps
- Branch divergence

occurs when threads inside warps branch to different execution paths



This is the same as conditional/predicated/masked execution. Recall the Vector Mask and Masked Vector Operations

# Remember: Each Thread Is Independent

- Two Major SIMT Advantages:
  - □ Can treat each thread separately  $\rightarrow$  i.e., can execute each thread independently on any type of scalar pipeline  $\rightarrow$  MIMD processing
  - Can group threads into warps flexibly → i.e., can group threads that are supposed to *truly* execute the same instruction → dynamically obtain and maximize benefits of SIMD processing

- If we have many threads
- We can find individual threads that are at the same PC
- And, group them together into a single warp dynamically
- This reduces "divergence"  $\rightarrow$  improves SIMD utilization
  - SIMD utilization: fraction of SIMD lanes executing a useful operation (i.e., executing an active thread)

# Dynamic Warp Formation/Merging

- Idea: Dynamically merge threads executing the same instruction, i.e., at the same PC (after branch divergence)
- Form new warps from warps that are waiting
  - Enough threads branching to each path enables the creation of full new warps



# Dynamic Warp Formation/Merging

 Idea: Dynamically merge threads executing the same instruction, i.e., at the same PC (after branch divergence)



Fung et al., "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," MICRO 2007.

STUDENTS-HUB.com

Uploaded By: Jibreel Boanat

# Dynamic Warp Formation Example



### Large Warps and Two-Level Warp Scheduling

- Two main reasons for GPU resources be underutilized
  - Branch divergence
  - Long latency operations



Round Robin Scheduling, 16 total warps

Narasiman et al., "Improving GPU Performance via Large Warps and Two-Level Warp stopeduling Uploaded By: Jibreel Boffdat

# Two-Level Scheduling of Warps



Narasiman et al., "Improving GPU Performance via Large Warps and Two-Level Warp STODERUINGRO 2011. Uploaded By: Jibreel Bornat