



# CS305: Computer Architecture

Instruction Pipelining-II

https://www.cse.iitb.ac.in/~biswa/courses/CS305/main.html

## Multi-cycle vs Pipelined



## Vanilla 5-stage pipeline



#### Resource Utilization



Computer Architecture

## Visualizing Pipeline



## Visualizing Pipeline: Execution time



For a k-stage pipeline executing N instructions

first instruction: K cycles

Next N-1 instructions: N-1 cycles, total = K + (N-1) cycles

Computer Architecture

## Latency and Bandwidth revisited

- Latency
  - time it takes to complete one instance

- Throughput
  - number of computations done per unit time

Let's see how much throughput can be achieved?

## Pipelined versus Single cycle CPU design

| Instruction | Ifetch | Decode | Execute | Memory | Writeback | <b>Total time</b> |
|-------------|--------|--------|---------|--------|-----------|-------------------|
| LOAD        | 200ns  | 100    | 200     | 200    | 100       | 800ns             |
| STORE       | 200    | 100    | 200     | 200    |           | 700ns             |
| ADD         | 200    | 100    | 200     |        | 100       | 600ns             |
| BRANCH      | 200    | 100    | 200     |        |           | 500ns             |

Total latency in single cycle CPU: 3200 ns

Total latency in pipelined CPU (200ns clock cycle):

1000ns (1<sup>st</sup> instruction) + 3 X 200 ns (for next three) = 1600 ns

## What's the big deal

Speedup = 3200ns/1600ns = 2X

What if we have a billion instructions?

Single cycle = 1 billion X 800ns = 800 seconds

Pipelined = 1000ns + (1 billion -1) X 200ns ~ 200 seconds

Speedup = 4X ☺

## Let's include latch latency too

Inter-stage latch = 10ns

New clock cycle time in the pipelined design = 210ns

First instruction will get completed by 1040ns (five stages X 200 ns + four inter-stage latches X 10ns)

New Speedup = 800s/210s ~ 3.8X

## How to Divide the Datapath?

Suppose memory is significantly slower than other stages. For example, suppose

```
t IM = 10 units
t DM = 10 units
t ALU = 5 units
t RF = 1 unit
t RW = 1 unit
```

Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance

## #Stages and Speedup

#### **Assumptions**

1. 
$$t_{IM} = t_{DM} = 10$$
,  
 $t_{ALU} = 5$ ,  
 $t_{RF} = t_{RW} = 1$   
4-stage pipeline

2. 
$$t_{IM} = t_{DM} = t_{ALU} = t_{RF} = t_{RW} = 5$$
  
4-stage pipeline

3. 
$$t_{IM} = t_{DM} = t_{ALU} = t_{RF} = t_{RW} = 5$$
  
5-stage pipeline

### Unpipelined Pipelined Speedup

27

25

25

$$t_{C}$$

10

#### Tashakkor Mikonam