# Single Cycle Processor Design

STUDENTS-HUB.com

## **Presentation Outline**

Designing a Processor: Step-by-Step

Datapath Components and Clocking

Assembling an Adequate Datapath

Controlling the Execution of Instructions

Main, ALU, and PC Control

STUDENTS-HUB.com

## **CPU Organization (Design)**

#### Datapath Design:

- Capabilities & performance characteristics of principal Functional Units (FUs) needed by ISA instructions
- ♦ (e.g., Registers, ALU, Shifters, Logic Units, ...)
- ♦ Ways in which these components are interconnected (buses connections, multiplexors, etc.).
- $\diamond$  How information flows between components.

#### Control Unit Design:

- $\diamond$  Logic and means by which such information flow is controlled.
- Control and coordination of FUs operation to realize the targeted Instruction Set Architecture to be implemented (can either be implemented using a finite state machine or a microprogram).

#### Hardware description with a suitable language, possibly using Register Transfer Notation (RTN).

STUDENTS-HUB.com

# Designing a Processor: Step-by-Step

- 1. Analyze instruction set => datapath requirements
  - ♦ The meaning of each instruction is given by the register transfers
  - ♦ Datapath must include storage elements for ISA registers
  - ♦ Datapath must support each register transfer
- 2. Select datapath components and clocking methodology
- 3. Assemble datapath meeting the requirements
- 4. Analyze implementation of each instruction
  - ♦ Determine the setting of control signals for register transfer
- 5. Assemble the control logic

# **Review of MIPS Instruction Formats**

- ✤ All instructions are 32-bit wide
- Three instruction formats: R-type, I-type, and J-type

| Op <sup>6</sup> | Rs⁵                   | Rt⁵ | Rd5sa5funct6            |  | funct <sup>6</sup> |
|-----------------|-----------------------|-----|-------------------------|--|--------------------|
| Op <sup>6</sup> | Rs⁵                   | Rt⁵ | immediate <sup>16</sup> |  |                    |
| Op <sup>6</sup> | address <sup>26</sup> |     |                         |  |                    |

- ♦ Op<sup>6</sup>: 6-bit opcode of the instruction
- ♦ Rs<sup>5</sup>, Rt<sup>5</sup>, Rd<sup>5</sup>: 5-bit source and destination register numbers
- $\diamond$  sa<sup>5</sup>: 5-bit shift amount used by shift instructions
- ♦ funct<sup>6</sup>: 6-bit function field for R-type instructions
- ♦ immediate<sup>16</sup>: 16-bit immediate constant or PC-relative offset
- $\diamond$  address<sup>26</sup>: 26-bit target address of the jump instruction

STUDENTS-HUB.com

#### MIPS Five Addressing Modes

1 <u>Register Addressing:</u>

Where the operand is a register (R-Type)

2 Immediate Addressing:

Where the operand is a constant in the instruction (I-Type, ALU)

3 Base or Displacement Addressing:

Where the operand is at the memory location whose address is the sum of a register and a constant in the instruction (I-Type, load/store)

4 <u>PC-Relative Addressing:</u>

Where the address is the sum of the PC and the 16-address field in the instruction shifted left 2 bits. (I-Type, branches)

5 <u>Pseudodirect Addressing</u>:

Where the jump address is the 26-bit jump target from the instruction shifted left 2 bits concatenated with the 4 upper bits of the PC (J-Type)

#### MIPS Addressing Modes/Instructio n Formats

STUDENTS-HUB.com

## MIPS Subset of Instructions

- Only a subset of the MIPS instructions is considered
  - ALU instructions (R-type): add, sub, and, or, xor, slt
  - Immediate instructions (I-type): addi, slti, andi, ori, xori
  - ♦ Load and Store (I-type): Iw, sw
  - ♦ Branch (I-type): beq, bne
  - ♦ Jump (J-type): j
- This subset does not include all the integer instructions
- But sufficient to illustrate design of datapath and control
- Concepts used to implement the MIPS subset are used to construct a broad spectrum of computers

STUDENTS-HUB.com

# Details of the MIPS Subset

|      | nstruction                   | Meaning          |                     |                 | Fo                    | rmat                 |                  |      |
|------|------------------------------|------------------|---------------------|-----------------|-----------------------|----------------------|------------------|------|
| add  | rd, rs, rt                   | addition         | $0p^{6} = 0$        | rs⁵             | rt <sup>5</sup>       | rd⁵                  | 0                | 0x20 |
| sub  | rd, rs, rt                   | subtraction      | $0p^{6} = 0$        | rs <sup>5</sup> | rt⁵                   | rd⁵                  | 0                | 0x22 |
| and  | rd, rs, rt                   | bitwise and      | $0p^{6} = 0$        | rs⁵             | rt⁵                   | rd⁵                  | 0                | 0x24 |
| or   | rd, rs, rt                   | bitwise or       | $0p^{6} = 0$        | rs⁵             | rt⁵                   | rd⁵                  | 0                | 0x25 |
| xor  | rd, rs, rt                   | exclusive or     | op <sup>6</sup> = 0 | rs <sup>5</sup> | rt⁵                   | rd⁵                  | 0                | 0x26 |
| slt  | rd, rs, rt                   | set on less than | $0p^{6} = 0$        | rs⁵             | rt⁵                   | rd⁵                  | 0                | 0x2a |
| addi | rt, rs, imm <sup>16</sup>    | add immediate    | 0x08                | rs⁵             | rt⁵                   | imm <sup>16</sup>    |                  |      |
| slti | rt, rs, imm <sup>16</sup>    | slt immediate    | 0x0a                | rs <sup>5</sup> | rt⁵                   | imm <sup>16</sup>    |                  |      |
| andi | rt, rs, imm <sup>16</sup>    | and immediate    | 0x0c                | rs <sup>5</sup> | rt⁵                   | imm <sup>16</sup>    |                  |      |
| ori  | rt, rs, imm <sup>16</sup>    | or immediate     | 0x0d                | rs⁵             | rt <sup>5</sup>       | imm <sup>16</sup>    |                  |      |
| xori | rt, imm <sup>16</sup>        | xor immediate    | 0x0e                | rs⁵             | rt <sup>5</sup>       |                      | imm <sup>1</sup> | 6    |
| lw   | rt, imm <sup>16</sup> (rs)   | load word        | 0x23                | rs⁵             | rt⁵                   | imm <sup>16</sup>    |                  |      |
| SW   | rt, imm <sup>16</sup> (rs)   | store word       | 0x2b                | rs⁵             | rt⁵                   | imm <sup>16</sup>    |                  |      |
| beq  | rs, rt, offset <sup>16</sup> | branch if equal  | 0x04                | rs <sup>5</sup> | rt⁵                   | offset <sup>16</sup> |                  |      |
| bne  | rs, rt, offset <sup>16</sup> | branch not equal | 0x05                | rs <sup>5</sup> | rt⁵                   | offset <sup>16</sup> |                  |      |
| j    | address <sup>26</sup>        | jump             | 0x02                |                 | address <sup>26</sup> |                      |                  |      |

STUDENTS-HUB.com

# Register Transfer Level (RTL)

- RTL is a description of data flow between registers
- RTL gives a meaning to the instructions
- All instructions are fetched from memory at address PC

#### Instruction RTL Description

| ADD | $Reg(rd) \leftarrow Reg(rs) + Reg(rt);$                                                                                                                              | $PC \gets PC + 4$      |
|-----|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|
| SUB | $Reg(rd) \leftarrow Reg(rs) - Reg(rt);$                                                                                                                              | $PC \gets PC + 4$      |
| ORI | $Reg(rt) \leftarrow Reg(rs)   zero_ext(imm^{16});$                                                                                                                   | $PC \gets PC + 4$      |
| LW  | $Reg(rt) \leftarrow MEM[Reg(rs) + sign_ext(imm^{16})];$                                                                                                              | $PC \leftarrow PC + 4$ |
| SW  | $MEM[Reg(rs) + sign\_ext(imm^{16})] \leftarrow Reg(rt);$                                                                                                             | $PC \leftarrow PC + 4$ |
| BEQ | if $(\text{Reg}(\text{rs}) == \text{Reg}(\text{rt}))$<br>$PC \leftarrow PC + 4 + 4 \times \text{sign}_\text{ext}(\text{offset}^{16})$<br>else $PC \leftarrow PC + 4$ |                        |

STUDENTS-HUB.com

## Instruction Fetch/Execute

R-type Fetch instruction: Instruction  $\leftarrow$  MEM[PC] Fetch operands: data1  $\leftarrow$  Reg(rs), data2  $\leftarrow$  Reg(rt) Execute operation: ALU\_result  $\leftarrow$  func(data1, data2) Write ALU result:  $Reg(rd) \leftarrow ALU_result$ Next PC address:  $PC \leftarrow PC + 4$ Fetch instruction: ✤ I-type Instruction  $\leftarrow$  MEM[PC] Fetch operands: data1  $\leftarrow$  Reg(rs), data2  $\leftarrow$  Extend(imm<sup>16</sup>) Execute operation: ALU\_result  $\leftarrow$  op(data1, data2) Write ALU result:  $Reg(rt) \leftarrow ALU_result$ Next PC address:  $PC \leftarrow PC + 4$  $\Leftrightarrow$  BEQ Fetch instruction: Instruction  $\leftarrow$  MEM[PC] Fetch operands: data1  $\leftarrow$  Reg(rs), data2  $\leftarrow$  Reg(rt) Equality:  $zero \leftarrow subtract(data1, data2)$ Branch: if (zero)  $PC \leftarrow PC + 4 + 4 \times sign_ext(offset^{16})$  $PC \leftarrow PC + 4$ else

STUDENTS-HUB.com

## Instruction Fetch/Execute - cont'd

- ♦ LW Fetch instruction: Instruction ← MEM[PC] Fetch base register: base ← Reg(rs) Calculate address: address ← base + sign\_extend(imm<sup>16</sup>) Read memory: data ← MEM[address] Write register Rt: Reg(rt) ← data Next PC address: PC ← PC + 4
- SW Fetch instruction:
   Fetch registers:
   Calculate address:
   Write memory:
   Next PC address:
- Instruction  $\leftarrow$  MEM[PC] base  $\leftarrow$  Reg(rs), data  $\leftarrow$  Reg(rt) address  $\leftarrow$  base + sign\_extend(imm<sup>16</sup>) MEM[address]  $\leftarrow$  data PC  $\leftarrow$  PC + 4
- ◆ Jump Fetch instruction: Instruction ← MEM[PC]
   Target PC address: target ← PC[31:28] || address<sup>26</sup> || '00'
   Jump: PC ← target

STUDENTS-HUB.com

## **Basic MIPS Instruction Processing Steps**



STUDENTS-HUB.com

#### **Overview of MIPS Instruction Micro-operations**

- ✤ All instructions go through these common steps:
  - ♦ Send program counter to instruction memory and <u>fetch the instruction</u>.
     (*fetch*) Instruction ← Mem[PC]
  - ♦ Update the program counter to point to next instruction  $PC \leftarrow PC + 4$
  - ♦ Read one or two registers, using instruction fields. (decode)
    - Load reads one register only.
- Additional instruction execution actions (*execution*) depend on the instruction in question, but similarities exist:
  - ♦ All instruction classes <u>use the ALU</u> after reading the registers:
    - Memory reference instructions use it for address calculation.
    - Arithmetic and logic instructions (R-Type), use it for the specified operation.
    - Branches use it for comparison.
- ✤ Additional execution steps where instruction classes differ:
  - ♦ <u>Memory reference instructions:</u> Access memory for a load or store.
  - ♦ Arithmetic and logic instructions: Write ALU result back in register.
  - ♦ Branch instructions: Change next instruction address based on comparison.

## Requirements of the Instruction Set

#### ✤ Memory

- ♦ Instruction memory where instructions are stored
- Data memory where data is stored
- Registers

  - ♦ Read source register Rs
  - ♦ Read source register Rt
  - ♦ Write destination register Rt or Rd
- Program counter PC register and Adder to increment PC
- Sign and Zero extender for immediate constant
- ALU for executing instructions

#### Next . . .

Designing a Processor: Step-by-Step

#### Datapath Components and Clocking

Assembling an Adequate Datapath

Controlling the Execution of Instructions

Main, ALU, and PC Control

STUDENTS-HUB.com

# Components of the Datapath



STUDENTS-HUB.com

# Register Element

- Register
  - ♦ Similar to the D-type Flip-Flop
- n-bit input and output
- Write Enable (WE):
  - ♦ Enable / disable writing of register
  - ♦ Negated (0): Data\_Out will not change
  - Asserted (1): Data\_Out will become Data\_In after clock edge
- Edge triggered Clocking
  - Register output is modified at clock edge





# MIPS Register File

- Register File consists of 31 × 32-bit registers
  - ♦ BusA and BusB: 32-bit output busses for reading 2 registers
  - BusW: 32-bit input bus for writing a register when RegWrite is 1
  - $\diamond$  Two registers read and one written in a cycle
- ✤ Registers are selected by:
  - RA selects register to be read on BusA
  - RB selects register to be read on BusB
  - ♦ RW selects the register to be written
- Clock input
  - ♦ The clock input is used ONLY during write operation
  - ♦ During read, register file behaves as a combinational logic block
    - RA or RB valid => BusA or BusB valid after access time





## Details of the Register File



STUDENTS-HUB.com

## Tri-State Buffers

- Allow multiple sources to drive a single bus
- **Two Inputs:** 
  - Data\_in  $\diamond$
  - Enable (to enable output)
- One Output: Data\_out
  - ♦ If (Enable) Data\_out = Data\_in

else Data\_out = High Impedance state (output is disconnected)

Tri-state buffers can be used to build multiplexors





#### Another Implementation



# Building a Multifunction ALU



# Details of the Shifter

- Implemented with multiplexers and wiring
- Shift Operation can be: SLL, SRL, SRA, or ROR
- Input Data is extended to 63 bits according to Shift Op
- The 63 bits are shifted right according to  $S_4S_3S_2S_1S_0$



## Details of the Shifter - cont'd

- Input data is extended from 32 to 63 bits as follows:
  - $\Rightarrow$  If shift op = SRL then ext\_data[62:0] = 0<sup>31</sup> || data[31:0]
  - $\Rightarrow$  If shift op = SRA then ext\_data[62:0] = data[31]<sup>31</sup> || data[31:0]
  - If shift op = ROR then ext\_data[62:0] = data[30:0] || data[31:0]
  - ♦ If shift op = SLL then ext\_data[62:0] = data[31:0] ||  $0^{31}$
- For SRL, the 32-bit input data is zero-extended to 63 bits
- For SRA, the 32-bit input data is sign-extended to 63 bits
- For ROR, 31-bit extension = lower 31 bits of data
- Then, shift right according to the shift amount
- As the extended data is shifted right, the upper bits will be: 0 (SRL), sign-bit (SRA), or lower bits of data (ROR)

STUDENTS-HUB.com

# Implementing Shift Left Logical

- The wiring of the above shifter dictates a right shift
- However, we can convert a left shift into a right shift
- ✤ For SLL, 31 zeros are appended to the right of data
  - $\diamond$  To shift left by 0 is equivalent to shifting right by 31
  - $\diamond$  To shift left by 1 is equivalent to shifting right by 30
  - $\diamond$  To shift left by 31 is equivalent to shifting right by 0
  - ♦ Therefore, for SLL use the 1's complement of the shift amount
- ✤ ROL is equivalent to ROR if we use (32 rotate amount)
- ✤ ROL by 10 bits is equivalent to ROR by (32–10) = 22 bits
- Therefore, software can convert ROL to ROR

## Instruction and Data Memories

- Instruction memory needs only provide read access
  - ♦ Because datapath does not write instructions
  - ♦ Behaves as combinational logic for read
  - Address selects Instruction after access time
- Data Memory is used for load and store
  - MemRead: enables output on Data\_out
    - Address selects the word to put on Data\_out
  - MemWrite: enables writing of Data\_in
    - Address selects the memory word to be written
    - The Clock synchronizes the write operation
- Separate instruction and data memories
  - $\diamond$  Later, we will replace them with caches

Address Instruction



# Clocking Methodology

- Clocks are needed in a sequential logic to decide when a state element (register) should be updated
- To ensure correctness, a clocking methodology defines when data can be written and read



- We assume edgetriggered clocking
- All state changes occur on the same clock edge
- Data must be valid and stable before arrival of clock edge
- Edge-triggered clocking allows a register to be read and written during same clock cycle

# Determining the Clock Cycle

With edge-triggered clocking, the clock cycle must be long enough to accommodate the path from one register through the combinational logic to another register



- T<sub>clk-q</sub>: clock to output delay through register
- T<sub>max\_comb</sub>: longest delay through combinational logic
- T<sub>s</sub>: setup time that input to a register must be stable before arrival of clock edge
- T<sub>h</sub>: hold time that input to a register must hold after arrival of clock edge
- Hold time (T<sub>h</sub>) is normally satisfied since T<sub>clk-q</sub> > T<sub>h</sub>

## Clock Skew

- Clock skew arises because the clock signal uses different paths with slightly different delays to reach state elements
- Clock skew is the difference in absolute time between when two storage elements see a clock edge
- ✤ With a clock skew, the clock cycle time is increased

Clock skew is reduced by balancing the clock delays

## **Clocking Methodologies**

- State element design choices
  - $\diamond$  Latches:
    - Output responds to input changes only when the clock is asserted
    - Assert means "logically true"(could mean electrically low)
  - ♦ Flip-flop:
    - State changes only on a clock edge
    - (edge-triggered methodology)

#### State Elements

Level sensitive D latch

#### ✤ D flip flop

| STUDEN <sup>®</sup> | TS-HUB.com |
|---------------------|------------|
|---------------------|------------|

#### **Overview:** Processor Implementation Styles

#### Single Cycle

- ♦ perform each instruction in 1 clock cycle
- ♦ clock cycle must be long enough for slowest instruction; therefore,
- $\diamond$  disadvantage: only as fast as slowest instruction

#### Multi-Cycle

- ♦ break fetch/execute cycle into multiple steps
- ♦ perform 1 step in each clock cycle
- ♦ advantage: each instruction uses only as many cycles as it needs

#### Pipelined

- $\diamond$  execute each instruction in multiple steps
- ♦ perform 1 step / instruction in each clock cycle

# Single Cycle, Multiple Cycle, vs. Pipeline

STUDENTS-HUB.com

#### Next . . .

Designing a Processor: Step-by-Step

Datapath Components and Clocking

Assembling an Adequate Datapath

Controlling the Execution of Instructions

Main, ALU, and PC Control

STUDENTS-HUB.com

# Instruction Fetching Datapath

- We can now assemble the datapath from its components
- ✤ For instruction fetching, we need …
  - ♦ Program Counter (PC) register
  - ♦ Instruction Memory
  - ♦ Adder for incrementing PC





## Datapath for R-type Instructions



#### Control signals

♦ ALUOp is the ALU operation as defined in the funct field for R-type

RegWr is used to enable the writing of the ALU result

STUDENTS-HUB.com

## Datapath for I-type ALU Instructions



- ALUOp is derived from the Op field for I-type instructions
- RegWr is used to enable the writing of the ALU result
- ExtOp is used to control the extension of the 16-bit immediate

STUDENTS-HUB.com

\*\*

# Combining R-type & I-type Datapaths



as either Rt or Rd Another mux selects 2<sup>nd</sup> ALU input as either data on BusB or the extended immediate

A mux selects RW

#### Control signals

- ALUOp is derived from either the Op or the funct field
- RegWr enables the writing of the ALU result
- ExtOp controls the extension of the 16-bit immediate
- RegDst selects the register destination as either Rt or Rd
- ♦ ALUSrc selects the 2<sup>nd</sup> ALU source as BusB or extended immediate

STUDENTS-HUB.com

#### Controlling ALU Instructions



For R-type ALU instructions, RegDst is '1' to select Rd on RW and ALUSrc is '0' to select BusB as second ALU input. The active part of datapath is shown in green

For I-type ALU instructions, RegDst is '0' to select Rt on RW and ALUSrc is '1' to select Extended immediate as second ALU input. The active part of datapath is shown in green

STUDENTS-HUB.com

#### Details of the Extender

- Two types of extensions
  - ♦ Zero-extension for unsigned constants
  - ♦ Sign-extension for signed constants
- Control signal ExtOp indicates type of extension
- Extender Implementation: wiring and one AND gate



## Adding Data Memory to Datapath

#### ✤ A data memory is added for load and store instructions



ALU calculates data memory address

A 3<sup>rd</sup> mux selects data on BusW as either ALU result or memory data\_out

Additional Control signals

- MemRd for load instructions
- MemWr for store instructions

BusB is connected to Data\_in of Data Memory for store instructions

♦ WBdata selects data on BusW as ALU result or Memory Data\_out

STUDENTS-HUB.com

## Controlling the Execution of Load



| RegDst = '0' selects RtRegWr = '1' to enableExtOp = 1 to sign-extendas destination registerwriting of register fileImmmediate16 to 32 bits |
|--------------------------------------------------------------------------------------------------------------------------------------------|
|--------------------------------------------------------------------------------------------------------------------------------------------|

ALUSrc = '1' selects extended immediate as second ALU input

ALUOp = 'ADD' to calculate data memory address as Reg(Rs) + sign-extend(Imm16)

| MemRd = '1' to read | WBdata = '1' places the data read | Clock edge updates PC |
|---------------------|-----------------------------------|-----------------------|
| data memory         | from memory on BusW               | and Register Rt       |

#### STUDENTS-HUB.com

## Controlling the Execution of Store



| RegDst = 'X' because no                          | RegWr = '0' to disable         |                                                                                   | ExtOp = 1 to sign-extend |  |
|--------------------------------------------------|--------------------------------|-----------------------------------------------------------------------------------|--------------------------|--|
| register is written                              | writing of register file       |                                                                                   | Immmediate16 to 32 bits  |  |
| ALUSrc = '1' selects ex<br>immediate as second A |                                | ALUOp = 'ADD' to calculate data memory<br>address as Reg(Rs) + sign-extend(Imm16) |                          |  |
| MemWr = '1' to write                             | WBdata = 'X' because don't car |                                                                                   | e Clock edge updates PC  |  |
| data memory                                      | what data is put on BusW       |                                                                                   | and Data Memory          |  |

# Adding Jump and Branch to Datapath



Additional Control Signals

PCSrc for PC control: 1 for a jump and 2 for a taken branch

Zero flag for branch control: whether branch is taken or not STUDENTS-HUB.com

## Controlling the Execution of a Jump



MemRd = MemWr = RegWr = 0, Don't care about other control signals

Clock edge updates PC register only

STUDENTS-HUB.com

## Controlling the Execution of a Branch



ALUSrc = 0, ALUOp = SUB, ExtOp = 1, MemRd = MemWr = RegWr = 0

Clock edge updates PC register only

STUDENTS-HUB.com

#### Next . . .

Designing a Processor: Step-by-Step

Datapath Components and Clocking

Assembling an Adequate Datapath

Controlling the Execution of Instructions

#### Main, ALU, and PC Control

#### Main, ALU, and PC Control



STUDENTS-HUB.com

#### Single-Cycle Datapath + Control



STUDENTS-HUB.com

## Main Control Signals

| Signal | Effect when '0'                                                     | Effect when '1'                                                  |
|--------|---------------------------------------------------------------------|------------------------------------------------------------------|
| RegDst | Destination register = Rt                                           | Destination register = Rd                                        |
| RegWr  | No register is written                                              | Destination register (Rt or Rd) is written with the data on BusW |
| ExtOp  | 16-bit immediate is zero-extended                                   | 16-bit immediate is sign-extended                                |
| ALUSrc | Second ALU operand is the value of register Rt that appears on BusB | Second ALU operand is the value of the extended 16-bit immediate |
| MemRd  | Data memory is NOT read                                             | Data memory is read<br>Data_out ← Memory[address]                |
| MemWr  | Data Memory is NOT written                                          | Data memory is written<br>Memory[address] ← Data_in              |
| WBdata | BusW = ALU result                                                   | BusW = Data_out from Memory                                      |

### Main Control Truth Table

| Ор     | RegDst | RegWr | ExtOp    | ALUSrc   | MemRd | MemWr | WBdata  |
|--------|--------|-------|----------|----------|-------|-------|---------|
| R-type | 1 = Rd | 1     | Х        | 0 = BusB | 0     | 0     | 0 = ALU |
| ADDI   | 0 = Rt | 1     | 1 = sign | 1 = lmm  | 0     | 0     | 0 = ALU |
| SLTI   | 0 = Rt | 1     | 1 = sign | 1 = lmm  | 0     | 0     | 0 = ALU |
| ANDI   | 0 = Rt | 1     | 0 = zero | 1 = lmm  | 0     | 0     | 0 = ALU |
| ORI    | 0 = Rt | 1     | 0 = zero | 1 = lmm  | 0     | 0     | 0 = ALU |
| XORI   | 0 = Rt | 1     | 0 = zero | 1 = lmm  | 0     | 0     | 0 = ALU |
| LW     | 0 = Rt | 1     | 1 = sign | 1 = lmm  | 1     | 0     | 1 = Mem |
| SW     | Х      | 0     | 1 = sign | 1 = lmm  | 0     | 1     | Х       |
| BEQ    | Х      | 0     | 1 = sign | 0 = BusB | 0     | 0     | Х       |
| BNE    | Х      | 0     | 1 = sign | 0 = BusB | 0     | 0     | Х       |
| J      | Х      | 0     | Х        | Х        | 0     | 0     | Х       |

X is a don't care (can be 0 or 1), used to minimize logic

STUDENTS-HUB.com

## Logic Equations for Main Control Signals





## ALU Control Truth Table

| Ор     | funct | ALUOp | 4-bit Coding |
|--------|-------|-------|--------------|
| R-type | AND   | AND   | 0001         |
| R-type | OR    | OR    | 0010         |
| R-type | XOR   | XOR   | 0011         |
| R-type | ADD   | ADD   | 0100         |
| R-type | SUB   | SUB   | 0101         |
| R-type | SLT   | SLT   | 0110         |
| ADDI   | Х     | ADD   | 0100         |
| SLTI   | Х     | SLT   | 0110         |
| ANDI   | Х     | AND   | 0001         |
| ORI    | Х     | OR    | 0010         |
| XORI   | X     | XOR   | 0011         |
| LW     | Х     | ADD   | 0100         |
| SW     | Х     | ADD   | 0100         |
| BEQ    | Х     | SUB   | 0101         |
| BNE    | Х     | SUB   | 0101         |
| J      | Х     | Х     | Х            |

The 4-bit Coding defines the binary ALU operations.

Logic equations are derived for the 4-bit coding.

Other bit-coding can be used. The goal is to simplify the ALU control.

STUDENTS-HUB.com

#### PC Control Truth Table

| Ор                        | Zero flag | PCSrc                     |  |
|---------------------------|-----------|---------------------------|--|
| R-type                    | X         | 0 = Increment PC          |  |
| J                         | X         | 1 = Jump Target Address   |  |
| BEQ                       | 0         | 0 = Increment PC          |  |
| BEQ                       | 1         | 2 = Branch Target Address |  |
| BNE                       | 0         | 2 = Branch Target Address |  |
| BNE                       | 1         | 0 = Increment PC          |  |
| Other than Jump or Branch | X         | 0 = Increment PC          |  |

#### The ALU Zero flag is used by BEQ and BNE instructions

STUDENTS-HUB.com

## PC Control Logic

The PC control logic can be described as follows:

```
if (Op == J) PCSrc = 1;
else if ((Op == BEQ && Zero == 1) ||
             (Op == BNE \&\& Zero == 0)) PCSrc = 2;
else PCSrc = 0;
                                                         Op
                                                       Decoder
Branch = (BEQ.Zero) + (BNE.Zero)
                                                BEQ
                                                       BNE
                                          Zero -
Branch = 1, Jump = 0 \rightarrow PCSrc = 2
Branch = 0, Jump = 1 \rightarrow PCSrc = 1
Branch = 0, Jump = 0 \rightarrow PCSrc = 0
```

STUDENTS-HUB.com



Jump

Branch

#### Drawbacks of Single Cycle Processor

#### Long cycle time

♦ All instructions take as much time as the slowest



Alternative Solution: Multicycle implementation

# Simplified Single Cycle Datapath Timing

- Assuming the following datapath/control hardware components delays:
  - ♦ Memory Units: 2 ns
  - ♦ ALU and adders: 2 ns
  - ♦ Register File: 1 ns
  - ♦ Control Unit < 1 ns</p>



Obtained from low-level target VLSI implementation technology of components

Uploaded By 5 ibreet Bornat

Ignoring Mux and clk-to-Q delays, critical path analysis:



STUDENTS-HisB.manosecond = 10<sup>-9</sup> second

# Drawbacks of Single Cycle Processor

- 1. Long cycle time:
  - ♦ All instructions must take as much time as the slowest
    - Here, cycle time for load is longer than needed for all other instructions.
      - Cycle time must be long enough for the load instruction:
         PC's Clock -to-Q + Instruction Memory Access Time +
         Register File Access Time + ALU Delay (address calculation) +
         Data Memory Access Time + Register File Setup Time + Clock Skew
  - ♦ Real memory is not as well-behaved as idealized memory
    - Cannot always complete data access in one (short) cycle.
- 2. Impossible to implement complex, variable-length instructions and complex addressing modes in a single cycle.
  - $\diamond$  e.g indirect memory addressing.
- 3. High and duplicate hardware resource requirements
  - ♦ Any hardware functional unit cannot be used more than once in a single cycle (e.g. ALUs).
- 4. Does not allow overlap of instruction processing

STUDENTS-HUB.com

#### Alternative: Multicycle Implementation

- Break instruction execution into five steps
  - ♦ Instruction fetch
  - ♦ Instruction decode, register read, target address for jump/branch
  - ♦ Execution, memory address calculation, or branch outcome
  - ♦ Memory access or ALU instruction completion
  - ♦ Load instruction completion
- One clock cycle per step (clock cycle is reduced)
  - ♦ First 2 steps are the same for all instructions

| Instruction | # cycles | Instruction | # cycles |
|-------------|----------|-------------|----------|
| ALU & Store | 4        | Branch      | 3        |
| Load        | 5        | Jump        | 2        |

STUDENTS-HUB.com

### Reducing Cycle Time: Multi-Cycle Design

- Cut combinational dependency graph by <u>inserting registers / latches.</u>
- ✤ The same work is done in two or more shorter cycles, rather than one long cycle.
  - ♦ Different CPI
  - ♦ Share functional units



#### Partitioning the Single-Cycle Design



STUDENTS-HUB.com

#### Where to add registers



#### Control Specification For Multi-cycle CPU Finite State Machine (FSM) - State Transition Diagram



STUDENTS-HUB.com

#### Multi-cycle Datapath Instruction CPI

- ✤ R-Type/Immediate: Require four cycles, <u>CPI = 4</u>
  ♦ IF, ID, EX, WB
- ✤ Loads: Require five cycles, <u>CPI = 5</u>
  ♦ IF, ID, EX, MEM, WB
- Stores: Require four cycles,  $\underline{CPI} = 4$  $\diamond$  IF, ID, EX, MEM
- ✤ Branches/Jumps: Require three cycles, <u>CPI = 3</u>
  ♦ IF, ID, EX
- ✤ Average or effective program CPI:  $3 \leq CPI \leq 5$ depending on program profile (instruction mix).

### Performance Example

- Assume the following operation times for components:
  - ♦ Access time for Instruction and data memories: 200 ps
  - $\diamond$  Delay in ALU and adders: 180 ps
  - ♦ Delay in Decode and Register file access (read or write): 150 ps
  - $\diamond$  Ignore the other delays in PC, mux, extender, and wires
- Which of the following would be faster and by how much?
  - ♦ Single-cycle implementation for all instructions
  - ♦ Multicycle implementation optimized for every class of instructions
- Assume the following instruction mix:
  - ♦ 40% ALU, 20% Loads, 10% stores, 20% branches, & 10% jumps

STUDENTS-HUB.com

## Solution

| Instruction<br>Class | Instruction<br>Memory | Register<br>Read | ALU<br>Operation | Data<br>Memory | Register<br>Write | Total  |
|----------------------|-----------------------|------------------|------------------|----------------|-------------------|--------|
| ALU                  | 200                   | 150              | 180              |                | 150               | 680 ps |
| Load                 | 200                   | 150              | 180              | 200            | 150               | 880 ps |
| Store                | 200                   | 150              | 180              | 200            |                   | 730 ps |
| Branch               | 200                   | 150              | 180 ←            | Compare and ι  | update PC         | 530 ps |
| Jump                 | 200                   | 150 🕂            | -Decode and ι    | update PC      |                   | 350 ps |

For fixed single-cycle implementation:

- ♦ Clock cycle = 880 ps determined by longest delay (load instruction)
- For multi-cycle implementation:
  - $\diamond$  Clock cycle = max (200, 150, 180) = 200 ps (maximum delay at any step)

 $\Rightarrow$  Average CPI = 0.4×4 + 0.2×5 + 0.1×4+ 0.2×3 + 0.1×2 = 3.8

✤ Speedup = 880 ps / (3.8 × 200 ps) = 880 / 760 = 1.16

STUDENTS-HUB.com

# Summary

- ✤ 5 steps to design a processor
  - Analyze instruction set => datapath requirements
  - Select datapath components & establish clocking methodology
  - Assemble datapath meeting the requirements
  - ♦ Analyze implementation of each instruction to determine control signals
  - ♦ Assemble the control logic
- MIPS makes Control easier
  - $\diamond$  Instructions are of the same size
  - $\diamond$  Source registers always in the same place
  - ♦ Immediate constants are of same size and same location
  - ♦ Operations are always on registers/immediates

STUDENTS-HUB.com