



#### Biswa biswa@cse.iitb.ac.in

#### Shhh... for the next 3600000000000

#### Btw, feel free to ask/interrupt

#### Two checkpoints: Slide #35 and 67

#### My journey at IITM (CS10S003 – CS10D019)

Madhu Mutyam <madhumutyam@gmail.com> to me •

Hi Panda,

We made an offer to you. You will receive the offer letter in a week time.

Regards, Madhu Mutyam Mon, Nov 30, 2009, 10:17 PM

#### Biswabandan Panda

Last position held : PhD Scholar (Roll No: CS10D019)

Duration with CSE Dept : Sep 2012 to Jul 2015

Advisor(s) : Shankar Balachandran

Link to Personal Homepage

Brownian motion between RISE lab and PACE lab, through DCF

#### Un Dino Ki Baat Hai (Same T-shirt 🕑)



Joined MS, December 2009 ⓒ ⓒ

5

Joined MS, December 2009 🙂 🙂 December 2011, Fixing simulator 🟵 Research 😕 😕 Scooped too ⊗⊗⊗



Joined MS, December 2009 🙂 🙂 December 2011, Fixing simulator 🟵 Research 888 January 2012 Scooped too 😕 😕 Progress meeting  $\mathfrak{S}$ 

February 5, 2012 Applied for MS to PhD? Committee? Nah Committee? Nah Committee? Ohhhkkkk

#### The ups and downs@Research



#### After 2015: It is a daily affair







### Shhh... Time for the Talk now

#### Microarchitecture:101



#### But, Microarchitecture research is dead?



BENEFIT ~ CONTRIBUTE ~ DISCOVER ~ CAR

#### The Microarchitecture Research Wall and its Renaissance

by Biswabandan Panda on May 31, 2022 | Tags: Academia, Microarchitecture, Research





#### Hardware is new software

People who are really serious about software should make their own hardware - Alan Kay, father of PCs

#### and... Domain specific processors

#### AWS Graviton Processor

Enabling the best price performance in Amazon EC2

Get Started with AWS Graviton-based EC2 Instances

#### Microsoft's Innovative 4-Processor PC

By Rob Enderle 🕴 May 30, 2022 4:00 AM PT 🕴 🖂 Email Article

Tweet 6 Share 0 in Share 0 < Share 6

# Facebook is just crazy enough to make its own processors

Job listings for a chip design team have surfaced online.

#### Google Replaces Millions of Intel's CPUs With Its Own Homegrown Chips

By Anton Shilov published June 04, 2021

YouTube now uses homegrown Argos VCUs

Dead? Stop listening ...

#### NVIDIA Grace CPU

Purpose-built to solve the world's largest computing problems.



#### Programs (including OS) running on CPUs







#### Let's run an application

#### Web search, next time chatGPT maybe 😳

#### CPU: Intel Haswell

#### And the bottleneck [Google's websearch]



18

#### Back-end memory bottleneck (100s cc)



#### Retiring bottleneck in an out-of-order Core



In-order Instruction Fetch

(Multiple fetch in one cycle)

Even an L1 hit has to wait for a DRAM access  $\Theta$ 

#### Microarchitect's dream (impossible though)



#### Retiring bottleneck in an out-of-order Core



ROB (re-order buffer)

Even an L1 hit has to wait, Nah, no more 😳

#### Microarchitects ?

### Microscopic view on microarchitecture problems

#### Microarchitecture solutions



#### Microarchitecture: A Hardware Prefetcher

A hardware prefetcher can make impossible, possible ③



#### The reality from last 30 years

#### Academia and industry: L2 prefetchers

### Challenges: many for a practical L1 prefetcher $\otimes$

### A lightweight/high-performing L1 prefetcher: Impossible

#### Guru Gyan



Start solving a research problem when the most@research community: "we are done"

Andre Seznec, My mentor, post-PhD What about impact?

• Write the first flagship conference paper on a research topic (US centric, large groups)

or

• Write the final flagship conference paper on a research topic (*My approach, ekla chalo re* )

#### Is it Really Impossible? 2018 around

*Is it <i>possible* to design a competitive/practical L1 data prefetcher?



#### Why? Challenges



#### Many more issues: bandwidth, port contention, and others

#### Opportunity: L1-D Prefetching



#### Oh yes, it is possible



Bouquet of Instruction Pointer Classifier based prefetching [ISCA 2020]

#### Samuel, Remote Mentee, BITS Pilani [2018-2020]

M.S. @ Texas A&M

# Message I: Dare to be different

### What is the FUSS? Only 2% improvement

Bouquet approach for prefetching



6KB, 35% 34KB, 39% 40KB, 42% 119KB, 43% 800B, 45%

#### Well, It is possible

#### 1% improvement matters

- "Microarchitects can kill their grandmothers to get 0.5% improvement"
- "1% improvement in industry is a cause for celebration"

• "1% improvements on multiple microarchitecture ideas make a big difference in the final revenue"

## PAUSE for a minute

What next?

#### Community started looking at L1 prefetchers ③

Hang on. What next for me and my mentees? Why no FUSS about energy? 🟵



#### Do not forget energy 😳



Energy-Efficient Hardware Data Prefetching [IEEE CAL 2021]

Neelu, M.S. by Research, IIT Kanpur [2019-2021]

Ph.D.@EPFL

#### Energy Consumption: 101



#### Power-gating for mitigating static energy



#### More requests in the memory hierarchy can lead to higher energy consumption

#### Power-gating for mitigating static energy



#### More requests in the memory hierarchy can lead to higher energy consumption

#### Let's Quantify it

No publicly available tool that can provide faithful numbers  $\mathfrak{S}$ 

We showed our results to major research labs 😳 😳

They said No, Yes, and No 🛞 😳

Took 10 months and finally Intel said YES

# Message II: Hang in there, persist++

#### Where is the problem?



Coverage: Fraction of cache misses that become hits

**Accuracy:** Fraction of prefetch requests that provide hits

Of course, accuracy is not 100%, not even 90% 🛞

#### Trivial Solution: Instruction Criticality

| In-order Instruction<br>Sequence | Interaction<br>With<br>Memory | <b>Execution Cycles</b> |            |
|----------------------------------|-------------------------------|-------------------------|------------|
| Load R1, [R2]                    | LLC Miss                      | 150 cycles              | ☐ ROB Head |
| Add R3, R1, R4                   |                               | 1 cycle                 |            |
| Load R5, [R6]                    | L2 Hit                        | 15 cycles               |            |
| Mult R7, R3, R8                  |                               | 3 cycles                |            |

Prefetch data only for critical loads that delay retiring instructions

#### Why is this a big deal?



#### Is this New?

#### Focusing Processor Policies via Critical-Path Prediction

Brian Fields Shai Rubin Rastislav Bodík

Computer Sciences Department University of Wisconsin-Madison {fields, shai, bodik}@cs.wisc.edu

-

#### Performance Oriented Prefetching Enhancements Using Commit Stalls

R Manikantan R Govindarajan Indian Institute of Science, Bangalore, India

RMANI@CSA.IISC.ERNET.IN GOVIND@CSA.IISC.ERNET.IN

#### **Criticality-Based Optimizations for Efficient Load Processing**

Microarchitecture Research Laboratory

Santa Clara, CA, USA

College of Computing

Atlanta, GA, USA

| Criticality Aware Tiered Cache Hierarchy:                                                           |                        |                         |                                |                |
|-----------------------------------------------------------------------------------------------------|------------------------|-------------------------|--------------------------------|----------------|
| A Fundamental Relook at Multi-level Cache Hierarchies                                               |                        |                         |                                |                |
|                                                                                                     | Samantika Subramaniam  | Anne Bracy $^{\dagger}$ | Hong Wang^{\dagger}            | Gabriel H. Loh |
| Anant Vithal Nori*, Jayesh Gaur*, Siddharth Rai <sup>†</sup> , Sreenivas Subramoney* and Hong Wang* | Georgia Institute of T | echnology               | <sup>†</sup> Intel Corporation |                |

\*Microarchitecture Research Lab. Intel

## **Key insight**: Only the instructions on the critical execution path matter to performance

### This is a 20-year-old problem/solution

- What is the big deal here?
- Simple idea ☺
   Prior works build a data
   dependency graph in
   hardware ☺



#### Simplest metric: ROB Occupancy + Stall frequency



#### Connecting the dots

| Idea     | Storage | Performance     | Energy         |
|----------|---------|-----------------|----------------|
|          | (Lower) | <i>(Higher)</i> | <i>(Lower)</i> |
| ISCA '20 | 1KB     | 45%             | 50%            |
| CAL'21   | +2.5KB  | 43%             | 45%            |

#### Is this a good deal?



2% loss in performance, no big deal ③ ??
Hardware Prefetching research:
2 to 3% performance improvement in 2 years

#### The Pertinent Question



Can we have a prefetcher that is energy-efficient and yet high performing?



#### Pushing the limits of an L1 Prefetcher



Berti, State-of-the-art L1 Prefetcher [MICRO 2022]

Agustin, Ph.D. Universidad de Zaragoza

Defending next week ③

#### The Problem and the approach



**Observation-I** 

Deltas for each load (IP) is different

#### Where existing local prefetchers fail?

Inorder LOADs to L1



Out of order LOADs to L1

#### Presenting Berti

## Berti: per-IP **be**st request **ti**me aggregate of deltas

#### Why Time? Devil is in the details

Time to fetch the data into L1 is not constant

22 to 2098 cycles with an average of 278 cycles on a 4-core system

Each IP has a different time to fetch (locality, reuse, queueing delay etc)

In summary out of order memory hierarchy



#### The notion of timely local deltas



#### The updated question of interest

### "for an L1D access to address X, what is the timely and accurate delta (d) that should be used for prefetching?"



#### Idea in one slide



#### Our Insights

#### "Timely deltas that provide the **best local coverage** also contribute to high global accuracy."

#### Berti as a package



#### What about performance?

Berti provides ~90% accuracy 🙂 🙂

Berti consumes additional ~11% energy

## **3.5%** performance improvement over IPCP [ISCA 2020]

#### Connecting the dots

| Idea     | Storage<br>(Lower) | Performance<br><i>(Higher)</i> | Energy<br><i>(Lower)</i> |
|----------|--------------------|--------------------------------|--------------------------|
| ISCA '20 | 1KB                | 45%                            | 50%                      |
| CAL'21   | +2.5KB             | 43%                            | 45%                      |
| MICRO'22 | +1.5KB             | 48.5%                          | 11%                      |



#### Message-III: Keep Pushing

### PAUSE for a minute

#### Finally, do not forget address translations ③



Address Translation Conscious Cache Hierarchy [ISPASS 2022]

*Vasudha, M.S. by Research, IIT Kanpur* [2019-2021]

*Qualcomm Microarchitecture Team* 

#### Virtual Memory



#### Page Table



#### Page Table Walk



#### The Memory Hierarchy



#### Misses and latencies



5 DRAM accesses in worst case

What happens if we have both?... 6 DRAM accesses

#### New terms



#### **Processor Stalls because of translations**



Average ROB stall cycles due to STLB miss is 33, replay is 191 and remaining loads is 47. If an OS page is cold then data will be even cooler

## How cache management policies react to Replays



#### Will data prefetchers work for replay?



State-of-the-art data prefetchers also fail to reduce the replay misses

## Enhancement-I: Treat translations differently



Yipee. 99% hit rate at the cache hierarchy for translations

## Enhancement-II: Translation hit triggered Prefetcher



79

#### Takeaway message: Common Sense

# Common sense is not so common.

Voltaire

#### **Reduction in Processor Stalls**



ROB stall cycles get reduced by 28.76% for translations and 18.5% for replay loads, leading to 4.8% performance improvement

### Bouquet of microarchitecture ideas

| Idea      | Storage<br>(Lower) | Performance<br><i>(Higher)</i> | Energy<br><i>(Lower)</i> |
|-----------|--------------------|--------------------------------|--------------------------|
| ISCA '20  | 1KB                | 45%                            | 50%                      |
| CAL'21    | +2.5KB             | 43%                            | 45%                      |
| MICRO'22  | 2.5KB              | 48.5%                          | 11%                      |
| ISPASS'22 | +0.0KB 🙂           | +4.5%, data-intensive apps.    |                          |

Shh.. Microarchitects at work !

#### Bouquet of microarchitecture ideas



Microarchitecture research is fun and rewarding ③ Stop listening to ...

## What Next? Security, performance, both

#### Miles to go before I (we) sleep



Microarchitects are on it ③

#### CASPER@IITB



About CASPERIANS Publications Blogs



Welcome to CASPER: Computer Architecture for Security and Performance Research group!

## If you like vada pav along with dosa, do consider joining $\odot$

#### In a nutshell

What do we do?

Dream the impossible and ask the right questions to make it possible

My take from last five years:

*If you know the problem well then you know the solution well too* 

## Microarch. Research – Test Match Cricket

"Play Life like a Cricket match.

Don't try to hit hard in every situation, Just keep rotating, moving, and then look for that one delivery and hit it as hard as you can."

#### Dhanyavaad

## Google Research

#### Work done remotely (COVID-19 @)













## Dhanyavaad CSE-IITM