Network I/O optimizations: overview.

* Overview of the Linux network stack: the NIC and the kernel share one or more pairs of RX/TX rings. A ring is a circular buffer where packets for rx/tx are stored. More specifically, each slot in the ring contains the length and physical address of a packet buffer. CPU-accessible registers in the NIC indicate the portion of the ring available for transmission or reception. 

* When a packet arrives, the NIC does a DMA to store the packet in the address pointed to by the RX ring, updates the ring with this information that the packet has been copied, and raises an interrupt to the kernel. Until the kernel (device driver) comes around to servicing the interrupt, further interrupts from the NIC are disabled. The interrupts are reenabled once the kernel services the interrupt from this device. The kernel only does minimal procesing in the interrupt service routine / hardware interrupt handler, and schedules a kernel thread (softirq) to perform the rest of the TCP/IP processing (bottom half of the interrupt service routine). The bottom half / softirq removes the buffers from the ring for network stack processing, and reinitializes the ring with new buffers. After TCP/IP processing, the packet buffers are queued up at the receive socket buffer. When the user program calls read, the payload is copied from the kernel buffers to userspace memory, and the kernel buffer is freed. New connections (after completion of 3-way TCP handshake) are queued up in the backlog queue and can be accessed via accept. 

* If the hardware supports multiple queues and features like receive side scaling (RSS), then the interface between the kernel and the NIC can consist of multiple TX/RX rings, with separate cores handling the packets coming from separate rings.

* On the transmit path, the write system call writes payload into the socket send buffer initially. If the TCP congestion window has space to send more packets, the TCP/IP processing is performed and the packet buffer is queued up in the buffers pointed to by the NIC TX ring. Otherwise, this step happens when space is created in the TCP window by an incoming ACK. Once the packet is queued up at the NIC ring, the device driver initiates the transmission by writing commands into the NIC registers.

* What are the inefficiencies in the kernel stack? (The first three apply to both single core and multicore systems, while the last three are specific to multicore systems)

- Packet copies: from NIC to kernel space, and from kernel buffers to user space.

- Interrupt processing and system calls incur userspace to kernel mode switching overheads and so on. 

- Dynamic allocation and deallocation of packet buffers (sk_buff struct in Linux).

- A single backlog queue per listening socket can become a bottleneck in multicore systems. 

- Delivering a packet's interrupt on one core and processing it in the application in another core can lead to cache misses.

- Filesystem related overheads. Sockets are managed much like files. When a socket is opened, the lowest unused file descriptor number has to be found, an entry has to be created in the global open file table and so on. All of these actions incur a system-wide synchronization cost

* Solutions to improve network I/O performance:

- Optimized network stacks for Linux that address various inefficiencies identified. The new designs can choose to stick to the socket API or expose a new networking API altogether. For example, [Megapipe] describes a new network stack design for Linux that proposes an alternative to the socket API. We have already seen fixes to the Linux kernel like per-core accept queues in the [Mosbench] paper.

- Kernel bypass techniques that directly deliver packets from the NIC to the application, bypassing the overheads of packet copies and interrupt processing in Linux. Examples of such techniques are netmap and DPDK. Such techniques often rely on a userspace network stack (for example, like the one described in the [mTCP] reference) for TCP/IP processing.

- Kernel bypass with hardware network stack (RDMA/Infiniband).

- OS designs optimized for I/O.

* netmap exposes a new API to applications to access network data. When an application wants to use a netmap enabled NIC, the NIC is disconnected from the regular network stack. The NIC transfers packets into a netmap ring, and this ring is memory mapped into the userspace program. The netmap ring uses its own packet buffers (not the kernel packet buffer datastructures). Every netmap-enabled port has a TX/RX ring, with pointers to the netmap packet buffers. The kernel and the userspace each own a part of the ring, e.g., the user process can access buffers from the 'cur' pointer up to 'cur + avail -1', while the kernel owns the rest. When a program wishes to transmit packets, it writes into the packet buffers and makes a system call (ioctl), at which point the kernel takes over the tx buffers and allocates new buffers for the next set of packets. When the process wants to read from the NIC, it makes a system call which lets it know the part of the RX ring that is available for reading. Note that except for the system call duration, the kernel and user program can access the netmap rings in parallel. 

* DPDK (Dataplane development kit) is another way to bypass the kernel - developed by Intel for Intel (and now for other) NICs. A DPDK-enabled NIC is also disconnected from the host stack. A process now directly communicates with the NIC via the DPDK poll mode driver (and not the kernel device driver). The DPDK library allocates a ring of packet buffers in user space. The driver periodically polls the NIC and fetches packets into this ring buffer, which are then accessed by the user process. The DPDK library manages this whole packet I/O process very efficiently, e.g., by using huge pages to store packet buffers (avoiding TLB misses), by using lockfree datastructures for the rings, and so on.

* Which overheads do netmap/DPDK avoid? One packet copy is saved. Overheads due to interrupts, system calls, dynamic buffer allocation are avoided / reduced due to batching. 

* What packets does a user process obtain from a netmap or DPDK enabled NIC? Raw packets that have not gone through the kernel network stack. If the application only needs access to raw packets (e.g., firewall), this is fine. Otherwise, we must have a mechanism to run TCP/IP on top of netmap/DPDK. This is achieved via userspace network stacks (e.g., mTCP). Userspace network stacks are like libraries (e.g., glibc) linked with userprograms, which provide TCP/IP functionality. 

* Another way to avoid the kernel overhead: push the kernel stack into hardware. High performance computing (HPC) application use special network cards which implement all layers of the network stack in hardware. These cards use a technology called Infiniband (IB). IB manufactures network cards and switches that all work together to provide high bandwidth low latency communication between hosts. 

* IB network cards expose a different API called IB verbs (send/receive/read/write) to applications. Send and receive are two-sided operations (that is one host should call send and the other should call receive), while read/write are one sided operations (one host can invoke this operation without the other host CPU being involved). Note that applications written for sockets should be rewritten to work over IB networks, or use translation libraries (IP over IB). This technology, especially when using one sided verbs, is also called RDMA (remote direct memory access).

* How do IB verbs work? Every application registers a region of memory with the network card, which is used for DMA to the IB NIC. A remote host that wishes to communicate with this host must have a key to access this memory region. When one host does a send and the other host does a recv, the IB NIC and switches transfer memory from one host's region to the other, and deliver a notification. Two-sided verbs involve both sender and receiver CPUs while one-sided verbs can read/write from remote memory without involving the remote CPU. Applications issue requests to the IB network card via send and receive queues. IB provides a reliable as well as unreliable transport layer, much like IP networking. For a reliable transport layer, explicit connection must be setup and keys exchanged with each end host, so it is somewhat cumbersome.

* Note that IB is not compatible with regular TCP/IP/Ethernet. For example, the transport layer in the IB network cards assumes lossless networks, which is provided by IB switches but not Ethernet. Until recently, IB and Ethernet worked in separate domains. Now, there is a push to make IB network cards work in Ethernet networks (RoCE or RDMA over Converged Ethernet).

* Finally, there has also been work on completely redesigning operating systems to perform well for network I/O. Such designs say that the OS should only be in the control plane (e.g., setting up access to an application at a certain port), but not be involved in the dataplane (i.e., the application should be able to directly transfer packets to the NIC). For example, Arrakis is a recent research project that proposes this idea.