Operating System Modifications for Multicore Scalability

* What limits performance of applications and OS with increasing cores? Locking-related overheads, other shared memory that bounces across caches, system bus bandwidth, DRAM bandwidth, and so on.

* How has the design of Linux changed to accommodate the challenges of multicore systems? Changes have been made to individual subsystems like the scheduler (e.g., local queues of ready processes). Mechanisms for lightweight synchronization (RCU) have been provided. This is an ongoing process. However, the fundamental structure of the kernel has not changed drastically.

* Also, Linux has support for NUMA machines. The OS exposes an abstract topology of the system as a set of nodes, with CPUs, memory banks, and I/O buses attached to each node. The user can find out which node a CPU belongs to, which is the memory located in the same node, and so on. However, future versions of the OS can expose other information like latency ratios, interconnect topology etc also.

* The Linux kernel is already NUMA aware, and a configuration option can be set to enable NUMA-aware memory allocation to processes. When memory to a new process is allocated (during fork or exec), the memory is allocated on the node with the least loaded CPU and the process is assigned to execute on that CPU. Tasks are also periodically load balanced across CPUs and migrated across nodes if required. 

* The numactl commandline can be used to specify process affinity to NUMA nodes CPUs at a coarse-grained granularity. The libnuma API functions can be used for fine-grained NUMA-aware memory allocation inside a program. 

* While Linux already does some improvements for multicore scalability, the [Mosbench] reference comes up with a few other improvements to Linux to enable it to scale to large number of cores (the paper tests for up to 48 cores) as described below.

* Multicore network processing. When a network packet arrives, the NIC copies the packet into a DMA region of memory and raises an interrupt on a CPU. The process on that CPU processes the packet in kernel mode, does TCP/IP processing, and delivers it to a network socket buffer. The application to which the packet is destined may be scheduled on another core at a later time to consume the packet. Thus, a packet is bounced around multiple CPU cores and caches.

* What happens when there are a lot of packets? The core handling interrupts may get very busy. So, modern network cards split packets into multiple queues and each queue delivers packets to a separate core. This feature of NICs is called RSS (receive side scaling). With RSS, interrupt load of incoming packets can be distributed to multiple cores. RSS typically uses the hash of the TCP 4 tuple to split packets into queues, in order to keep connections of one TCP connection in the same queue. 

* Now, a packet can be delivered to core X by RSS. The application thread that is handling this TCP connection can still run on a different core Y. How can we ensure that a packet is processed on the same core as it is received (to optimize cache performance)? One way is to check which core the transmitted packets on that connection are coming from, and update the hardware to deliver interrupts to the core from which the application sent packets. However, this monitoring of transmit packets can only be done infrequently and short connections will not benefit from this technique.

* For applications that accept connections in multiple processes/threads, another solution is possible. The kernel maintains per-core accept queues, and a thread that calls accept will get connections that were delivered on the core it was running. This way, all processing of a connection stays on the same core. Another advantage of this solution is to avoid contention on the lock protecting the common accept queue.

* Taking this idea further, several kernel datastructures can be split into per-core components without effecting correctness. For example, the list of open files can be maintained as a per-core list. A process that opens the file can add it to the per-core list of open files, and as long as the process runs on the same core, nothing goes wrong. Of course, if the process migrates, the entry needs to be migrated. That is, per-core datastructures must be carefully kept in sync for correctness. Another kernel datastructure that can easily be partitioned is list of free pages / packet buffers. 

* Another new idea in the [Mosbench] paper is the idea of sloppy counters. Instead of keeping a global counter in sync all the time, each core maintains a local counter. When core needs to decrement a global counter, it keeps these spare references in its local counter. Future increments to the counter can use these spare references instead of changing the global counter. Once the spare references cross a threshold, the global counter is updated. Such counters work for a lot of kernel datastructures. For example, to keep track of amount of memory allocated to protocols, reference counters on routing table entries, directory entry objects, and so on.

* A few other fixes to Linux from the paper:

- Eliminate false sharing by moving per-thread data to separate cache lines

- Eliminate unnecessary locks

- Try to perform lockfree comparisons: maintain a generation counter / version number of datastructures. In order to simply read and compare entries of a datastructure, avoid locking. If the generation is found to change during the comparison, then resort to locking. Else, can read and compare without locks.

* Now let us ask the question: how would a clean slate OS designed for multicore systems look like? The reference [Barrelfish] proposes a radical design, which it calls a multikernel. That is, the OS no longer runs by sharing memory and datastructures across multiple cores. Instead, each core runs its own copy of the kernel, and these multiple copies communicate with each other via message passing. That is, a multicore system is viewed not as a single shared memory system but as a distributed system.

* Barrelfish paper says that the multicore hardware is very diverse and evolving fast. The performance of various a particular software design depends on the architecture of the multiple cores, the interconnect, how many caches are shared and so on. So, optimizing the OS for every hardware is impossible. Further, some recent multicore hardware do not guarantee cache coherence, so depending on the shared memory abstraction is not possible.

* The paper also says that passing messages between cores is easy, e.g., the paper develops an efficient mechanism that transfers cacheline sized messages between cores. 

* Design principles of Barrelfish: the OS runs separately on each core, and any inter-core communication is done explcitly by passing messages (as opposed to implicitly via cache coherence traffic). All OS state is replicated across cores and consensus protocols from distributed systems are run to maintain consistency for those datastructures that cannot be partitioned into per-core components.

* What about the API to the applications? The applications still have a shared memory view of the system, and the OS realizes this by sharing the hardware page tables consistently across cores. Every process has per-core dispatchers that execute the process on that core by interacting with that core's scheduler.

* The paper also presents performance evaluation using micrbenchmarks. For example, they show that TLB shootdowns (invalidating TLB entries on other cores when one core changes a page table) is more efficient in Barrelfish. However, real life applications do not show much difference and do not get more scalable with Barrelfish. So, is this idea really useful?