# Multicycle Implementation

STUDENTS-HUB.com

# Drawbacks of Single Cycle Processor

#### Long cycle time

★ All instructions take as much time as the slowest

Arithmetic & Logical

| Instruction Fetch        | Register Read | ALU | Reg Write    |           |
|--------------------------|---------------|-----|--------------|-----------|
| Load — longest delay — > |               |     |              |           |
| Instruction Fetch        | Register Read | ALU | Memory Read  | Reg Write |
| Store                    |               |     |              |           |
| Instruction Fetch        | Register Read | ALU | Memory Write |           |
| Branch                   |               |     |              |           |
| Instruction Fetch        | Register Read | ALU |              |           |

- Functional units are duplicated raising cost
  - ★ Each functional unit can be used once per clock cycle

Multicycle Implementation

STUDENTS-HUB.com

© Muhamed Mudawar, COE 308 – KFUPM

Slide 2 Uploaded By: Jibreel Bornat

# Solution = Multicycle Implementation

- Break instruction execution into five steps
  - ★ Instruction fetch
  - ★ Instruction decode and register read
  - ★ Execution, memory address calculation, or branch completion
  - ★ Memory access or ALU instruction completion
  - ★ Load instruction completion
- One step = One clock cycle (clock cycle is reduced)
  - ★ First 2 steps are the same for all instructions

| Instruction | # cycles | Instruction | # cycles |
|-------------|----------|-------------|----------|
| ALU         | 4        | Branch      | 3        |
| Load        | 5        | Store       | 4        |

Multicycle Implementation

STUDENTS-HUB.com

© Muhamed Mudawar, COE 308 - KFUPM

Slide 3 Uploaded By: Jibreel Bornat

# MIPS Multicycle Datapath



Multicycle Implementation STUDENTS-HUB.com © Muhamed Mudawar, COE 308 – KFUPM

Slide 4

# Multicycle Datapath Changes

- Eliminating some of the components
  - \* Single memory unit for both instructions and data
  - **\*** Single ALU eliminating branch address adder and PC adder

Note: modern CPUs maintain separate instruction and data memories as well as separate address adders, but we reduce them here because the same component can be used for different purposes in different cycles

- Adding temporary registers
  - ★ Instruction Register: IR

Multicycle Implementation

STUDENTS-HUB.com

- ★ Memory Data Register: MDR
- ★ Register file output data registers: A and B
- ★ ALU output register: ALUout
- \* Required to store major unit output values for use in next cycle

# Multicycle Datapath Changes - cont'd

- This multicycle design can accommodate
  - ✤ One memory access per cycle
    - $\diamond$  IR register saves fetched instruction
    - $\diamond\,$  MDR register saves the read memory data
  - ✤ One register file access per cycle
    - $\diamond\,$  Two registers can be read concurrently into A and B registers
  - ✤ One ALU operation per cycle
    - ♦ ALUout register saves the ALU output
- Additional multiplexers are also needed
  - ★ Mux before the memory address to select PC or ALUout address
  - ★ Mux before 1<sup>st</sup> ALU input to select PC to increment or A register
  - \* Extended mux before PC to increment PC, branch, or jump

Multicycle Implementation STUDENTS-HUB.com © Muhamed Mudawar, COE 308 – KFUPM

Slide 6

# Multicycle Datapath + Control Signals



### **Control Signals**

| Signal   | Effect when '0'                                  | Effect when '1'                               |
|----------|--------------------------------------------------|-----------------------------------------------|
| RegDst   | Destination register = Rt                        | Destination register = Rd                     |
| RegWrite | None                                             | Register(RW) ← BusW                           |
| ExtOp    | 16-bit immediate is zero-extended                | 16-bit immediate is sign-extended             |
| ALUSrcA  | 1 <sup>st</sup> ALU operand is PC (upper 30-bit) | 1 <sup>st</sup> ALU operand is the A register |
| ALUSrcB  | 2 <sup>nd</sup> ALU operand is the B register    | 2 <sup>nd</sup> ALU input is extended-imm16   |
| MemRead  | None                                             | MemData ← Memory[address]                     |
| MemWrite | None                                             | Memory[address] ← Data_in                     |
| MemtoReg | BusW = ALUout                                    | BusW = MDR                                    |
| lorD     | Memory Address = PC                              | Memory Address = ALUout                       |
| IRWrite  | None                                             | IR ← MemData                                  |
| PCWrite  | None                                             | PC ← NextPC                                   |

| Signal   | Value | Effect                                                           |
|----------|-------|------------------------------------------------------------------|
|          | 00    | NextPC = PC[31:2] + 1 (increment upper 30 bits of PC)            |
| PCSource | 01    | NextPC = ALUout = PC[31:2] + 1 + sign-extend(imm16) (for branch) |
|          | 10    | NextPC = PC[31:28], imm26 (for jump)                             |

Multicycle Implementation

STUDENTS-HUB.com

© Muhamed Mudawar, COE 308 – KFUPM

Slide 8

### 1. Instruction Fetch Cycle



STUDENTS-HUB.com

### 2. Decode and Register Fetch



### 3a. Execute Cycle for R-type



# 3b. Compute Address for Load/Store



# 3c. Branch Completion



#### 4a. ALU Instruction Completion



# 4b. Memory Access for Load & Store



#### 5. Load Instruction Completion



STUDENTS-HUB.com

### Instruction Execution Summary

| Cycle                                                                                            | Action                                                                                                                                          | Register Transfers                                                                                                                                                                                                              |
|--------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1                                                                                                | Fetch instruction                                                                                                                               | $IR \leftarrow Memory[PC]$ , $PC \leftarrow PC + 4$                                                                                                                                                                             |
| 2                                                                                                | Decode instruction<br>Fetch registers<br>Compute branch address in advance<br>Jump completion (case of a jump)                                  | Generate control signals<br>$A \leftarrow \text{Reg}[\text{Rs}], B \leftarrow \text{Reg}[\text{Rt}]$<br>$ALUout \leftarrow \text{PC}[31:2] + \text{sign-extend}(\text{Imm16})$<br>$PC \leftarrow \text{PC}[31:28], \text{Im26}$ |
| 3                                                                                                | Case 1: Execute R-type ALU<br>Case 2: Execute I-type ALU<br>Case 3: Compute load/store address<br>Case 4: Branch completion                     | ALUout $\leftarrow$ A funct B<br>ALUout $\leftarrow$ A op extend(Imm16)<br>ALUout $\leftarrow$ A + sign-extend(Imm16)<br>if (Branch) PC $\leftarrow$ ALUout                                                                     |
| 4                                                                                                | Case 1: Write ALU result for R-type<br>Case 2: Write ALU result for I-type<br>Case 3: Access memory for load<br>Case 4: Access memory for store | Reg[Rd] ← ALUout<br>Reg[Rt] ← ALUout<br>MDR ← Memory[ALUout]<br>Memory[ALUout] ← B                                                                                                                                              |
| 5                                                                                                | Load instruction completion                                                                                                                     | Reg[Rt] ← MDR                                                                                                                                                                                                                   |
| Multicycle Implementation © Muhamed Mudawar, COE 308 – KFUPM Slide 17   FNTS Uplesded Dyy libres |                                                                                                                                                 |                                                                                                                                                                                                                                 |

STUDENTS-HUB.com

# Defining the Control

- Control for multicycle datapath is more complex
  - ★ Because instruction is executed as a sequence of steps
- Values of control signals depend upon:
  - ★ What instruction is being executed
  - ★ Which cycle is being performed
- Multicycle control is a Finite State Machine (FSM)
  - ★ While single-cycle control is a combinational logic
- Two implementation techniques for multicycle control
  - \* Set of states and transitions implemented directly in logic
  - \* Microprogramming: a programming representation for control

Multicycle Implementation STUDENTS-HUB.com © Muhamed Mudawar, COE 308 – KFUPM

#### Multicycle Datapath + Control



#### High Level View of FSM Control



Multicycle Implementation STUDENTS-HUB.com © Muhamed Mudawar, COE 308 – KFUPM

Slide 20 Uploaded By: Jibreel Bornat

# State Diagram for Multicycle Control



#### Finite State Machine Controller



STUDENTS-HUB.com

#### Finite State Machine for Control

#### □ Implementation:



STUDENTS-HUB.com

#### ALU Control Block



STUDENTS-HUB.com

#### **ROM** Implementation

- □ ROM = "Read Only Memory"
  - \* values of memory locations are fixed ahead of time
- □ A ROM can be used to implement a truth table
  - \* if the address is m-bits, we can address  $2^{m}$  entries in the ROM.
  - \* our outputs are the bits of data that the address points to.



m is the "heigth", and n is the "width"

STUDENTS-HUB.com

# **ROM** Implementation

How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines

(i.e.,  $2^{10} = 1024$  different addresses)

- How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs
- ✤ ROM is 2<sup>10</sup> x 20 = 20K bits
- Rather wasteful, since for lots of the entries, the outputs are the same

- i.e., opcode is often ignored

#### **PLA Implementation**

#### Programmable Logic Array



STUDENTS-HUB.com

### Performance multi cycle

Assume the following latencies for the data path components.

- Memory : 20ns. Register File: 12ns, ALU: 16 ns, Temp. Reg: 2ns.
- MUX time and all other components is Ons.
- Assume a program has a  $2x10^8$  and frequencies of instruction types as follows: ALU 50%, Load 20%, Store 10% and Branch 20%.
- Compare processor performance assume single cycle and multicycle data paths. CPI single cycle is 1 and Cycle Time is (LW)  $2 \times 20 + 2 \times 12 + 16 = 80$  ns.
- Average CPI multi cycle is = .5x 4 + .2x 5 + .1x 4 + .2x 3 = 4
- Cycle time multicycle is Max (20, 12, 16, 20, 12) + 2 = 22 ns.
- ET single cycle =  $2 \times 10^8 \times 1 \times 80 \times 10^{-9}$  = 16 seconds.
- ET multicycle =  $2 \times 10^8 \times 4 \times 22 \times 10^{-9} = 17.6$  seconds.

Which is faster why? Multi cycle is slower due the large variations in stage times. The stage times must be chosen to have equal or very close latencies to each others 80/5 = 16 ns per stage and 2 ns for temp registers. Then ideal cycle time is 18ns. ET multicycle = 2 x 10<sup>8</sup> x 4 x 18 x 10<sup>-9</sup> = 14.4 seconds.

Speedup = 16/14.4 = 1.1111% faster multicycle than single cycle.STUDENTS-HUB.comUploaded By: Jibreel Bornat

# Multicycle Implementation Summary

- Reduces hardware
  - **\*** One unified memory for instruction and data, and one ALU
- Breaks instruction execution into steps (step = 1 cycle)
- ✤ Internal registers in data path
  - \* Save intermediate data for later cycles
- ✤ Finite State Machine (FSM) specification of control
- Implementation of control
  - **\*** Hardwired control  $\Rightarrow$  a sequential machine
  - \* Microprogramming (See textbook)
- ✤ Reduces clock cycle and time

**Cycle time** = Maximum delay due to any stage + Delay for writing state registers When compared to single-cycle implementation (Maximum instruction delay)

*Multicycle Implementation* STUDENTS-HUB.com © Muhamed Mudawar, COE 308 – KFUPM