

#### CS 773

# Intel® Data Direct I/O Technology

#### Hari Sharan harisharan@cse.iitb.ac.in

## TRENDS IN NETWORK VIRTUALIZATION



#### Virtualization of network capabilities

- Software Load Balancers, Routers, Firewalls
- Software based network functions on shared general purpose servers
- Demand for high bandwidth
  - ➢ Video streaming
  - IoT devices
  - 5G technologies

#### Expectations from network virtualization

- High throughput
- ➤ Low latency
- Service time guarantees



#### ETHERNET SPEEDS & PROCESSING POWER

- Network bandwidth increasing at faster pace then processor speeds
- Server receiving 64B packets @100 Gbps needs to process packet in 5.12ns
- Software network functions not able to match performance of traditional hardware middle boxes





PERFORMANCE ISSUES IN NFV WORKLOADS



- Keeping up with demand for high bandwidth applications and rise in link speeds
  - Lesser time to process a packet
  - Focus on optimizing packet processing for higher throughput
- Providing service guarantees using software Network Functions
  - Predictable performance as provided by hardware
- Performance isolation in shared multi-tenant environment
  - Performance degradation due to contention for shared CPU resources

#### EFFECTS OF CO-RUNNING WORKLOADS ON PERFORMANCE





Maximum performance degradation for minimum and MTU-sized packets without isolation.

Source: ResQ: Enabling SLOs in Network Function Virtualization, NSDI 2018

#### Latency Degradation

#### HOW NETWORK I/O HAPPENS (Tx)



#### HOW NETWORK I/O HAPPENS (Rx)



#### NETWORK I/O OVERHEADS & S/W SOLUTIONS





#### ROLE OF MEMORY ACCESS SPEEDS

- Socket buffers, network driver Tx / Rx rings frequently accessed during packet processing
- Faster packet processing if memory requests for these data structures serviced from cache





9

Multi-core Processor Components

#### 10

### Sharing of uncore resources leads to

Processor uncore resources -

integrated Memory Controller

(IIO) shared among multiple

cores.

(iMC), Integrated I/O Controller

Separate cores & L1/L2 caches

Last Level Cache (LLC),

leads to

for different workloads

- Resource contention
- Performance degradation



#### INTEL PROCESSOR UNCORE RESOURCES





- Microarchitecture level optimization technique
- Introduced with Intel® Xeon ® processor E5 product family
- A platform technology that enables I/O data transfers that require far fewer trips to memory
- Makes the processor cache the primary destination and source of I/O data rather than the main memory.

## INTEL DATA DIRECT I/O TECHNOLOGY



- Traditionally, incoming data from NIC DMA'ed to DRAM.
- Data fetched to cache for processing
- Requires a memory write and a memory read before processing begins
- Inefficient due to
  - Number of accesses to Main Memory
  - Access latency to I/O data
  - Memory bandwidth usage





# INTEL DATA DIRECT I/O TECHNOLOGY

- With Intel DDIO
  - Avoid multiple reads from and writes to memory
  - Reduces latency, memory bandwidth requirements



### INTEL DATA DIRECT I/O TECHNOLOGY



- DDIO limited to 10% of LLC (2 LLC ways) and cannot be partitioned using CAT
- DDIO ways can be written by cores, whereas I/O can't write into core ways
- If this limit is exceeded, new inbound I/O will continue to go directly to cache, but the least-recently used I/O will be written to memory to make room for the new data.
- No changes to drivers, applications, operating systems or firmware

# I/O READ





Without Intel DDIO

With Intel DDIO

# I/O READ



- I/O reads are achieved with fewer trips to memory
- The data in the cache is not disturbed by an I/O data consumption operation
- I/O reads to memory are non-allocating
  - If line is not present in LLC it is not allocated
  - If line is in LLC, it stays there
  - If it is in one of higher caches, it moves or gets copied to LLC
- LLC Hit: Data serviced from LLC Cache line
- LLC Miss: Data serviced from Memory

### I/O WRITE





Without Intel DDIO

With Intel DDIO

#### I/O WRITE



Two modes of operation:

- Write Update (LLC Hit): Cache line is overwritten with new data
  - If the memory addresses for the data being delivered already exists in the LLC

- Write Allocate (LLC Miss): Requires allocation of Cache line in LLC.
  - If the memory addresses for the data being delivered does not exist in LLC, then no trips to memory are needed.

## TUNING INTEL DDIO



- Model Specific Register (MSR) "IIO LLC WAYS"
- For Skylake, default value is 0x600 (two bits set)
- Max value on testbed CPU 0x7FF (11 bits set same as LLC ways)
- Disabling DDIO
  - Globally- setting the Disable\_All\_Allocating\_Flows bit in "iiomiscctrl" register
  - Per-root PCIe port- setting bit NoSnoopOpWrEn and unsetting bit Use\_Allocating\_Flow\_Wr in "perfctrlsts\_0" register

### TUNING INTEL DDIO





Using more DDIO ways ("W") enables cores to forward large packets at 100 Gbps with large no. of descriptors achieving better or similar latency



### Packet Size & Rx Descriptors



#### Increasing the no. of packet descriptors and/or packet size adversely affects the performance of 2-way DDIO

Source: Reexamining Direct Cache Access to Optimize I/O Intensive Applications, Usenix ATC 2020



## No. of Cores & DDIO Capacity



Increasing no. of DDIO ways have similar effect as Increasing the no. of Cores (Forward 1500-B packets with 4096 Rx Descriptors)

Source: Reexamining Direct Cache Access to Optimize I/O Intensive Applications, Usenix ATC 2020



### PERFORMANCE ISSUES



ResQ: Enabling SLOs in Network Function Virtualization, NSDI 2018



# **PERFORMANCE ISSUES**

#### Latent Contender Problem

- DDIO's LLC ways may be allocated to certain cores running LLC-sensitive workloads.
- Although LLC ways are isolated from the core's point of view, DDIO still contends with the cores for the LLC

| App 1<br>LLC | App 2 LLC    | ,<br>,     | DDIO LLC V |
|--------------|--------------|------------|------------|
|              |              |            |            |
|              |              |            |            |
|              |              |            |            |
|              | Last Level C | ache (LLC) |            |

# CONCLUSION



- Intel DDIO reduce Latency, increases system I/O bandwidth and throughput by writing the incoming I/O packets directly to LLC
- With DDIO tuning, possible to increase performance in comparison to default DDIO configuration
- Increasing no. of DDIO ways have similar effect as increasing the no. of cores
- Performance issues such as Leaky DMA, Latent Contender requires proper tuning of DDIO and workload characteristics



# THANK YOU