High performance key-value stores: a case study. * Case studies of two examples of key-value stores built over a custom network stack on top of kernel bypass frameworks: MICA over DPDK, and Pilaf over RDMA. * MICA goals: build a high throughput key-value store on a single node (can work in a cluster architecture with other such nodes). Of course, imposes certain restrictions (e.g., small key/value pairs, all stored in memory). Supports basic get/put/delete operations. * Design choice 1: how to partition the key value pairs? - One big hash table: all cores have to lock, incurs synchronization overhead. - Partitioned hash table, with each core handling a subset of requests. Avoids sync overhead, but can lead to unfairness if some partitions more popular than others. - What does MICA do? Partions key value pairs into per-core hash tables using the hash of the key. Usually, one core manages one partition (EREW mode - exclusive read exclusive write). In case of imbalance across cores, multiples cores can read from a partition, even though only one core writes to the partition (CREW mode - concurrent read exclusive write). - What about reader-writer synchronization in CREW mode? No locks are used. Every item has a version number, and a reading core checks the version number before and after a read to see if a concurrent write has happened (optimistic concurrency control). A writer increments the version number to an odd value at start of write and updates to even value at end of write, so if a reader sees an odd version number, it can wait for write to complete. * Design choice 2: which network stack to use? - Since key-value pairs are small, simple UDP based stack and application level loss recovery should suffice. So, MICA skips the TCP/IP stack and socket API in the kernel, and runs over the DPDK kernel bypass mechanism using a custom network stack. - Multiqueue support in the NIC is used to bind a NIC queue to a core. Packets are processed in bursts and handled in zero copy fashion with DPDK. - How are packets split into multiple queues? The RSS feasture in NICs by default uses the TCP 4 tuple. However, this is not ideal. Why? Because a key can go to any core's queue depending on the TCP 4 tuple hash. Because the key-value pairs are partitioned, the core that receives the key must them redirect the request to another core handling the partition that the key belongs to. What do we want ideally? We want the hardware to partition a key into a queue based on the key hash (not TCP 4 tuple hash). However, hardware NICs cannot look inside payload to obtain the key. So, MICA uses a hack to work around this problem: the clients pick the UDP port number based on the hash of the key, and RSS uses the UDP port number alone (not the 4 tuple) to assign incoming packets to hardware queues. This way, a single key will always reach a single core and be stored in its partition. This application object level partitioning adds greatly to MICA's performance gains as seen from the results. * Design choice 3: What data structures to use to store key value pairs? - MICA uses a memory region to store key-value pairs and maintains an index/hash table to point from a key to the memory address where the key-value pair is stored. - Memory allocation for key-value pairs: the standard way is to use a slab allocator, one for each class/size of key-value pairs. However, this can lead to fragmentation. Another option is to append key-value pairs to a log, and store a pointer from a hash table to the log entry. A problem with this is garbage collection in a log is difficult. MICA uses the first option when the key-value pairs must be stored reliably, and the second option when it is only acting as a cache to another key-value store. When acting as a cache, MICA uses a circular log and rewrites old values when the log wraps around, taking care of garbage collection issues. Of course, this can lead to a loss of key-value pairs and is only acceptable when in cache mode. - The index or hash table has also been optimized to be efficient, especially when in cache mode and some amount of loss is acceptable. * Overall takeaway from MICA: it is a holistic design, where they optimize all aspects of the system, from memory allocation to network stack, to achieve high performance for a specific application. * Pilaf is also a key-value store with a similar goal as MICA. It uses one-sided read/write infiniband verbs (also referred to as RDMA) at the client to access key-valur pairs at the server. * Why RDMA and Infiniband? Lower latency (order of few microseconds and not few hundreds of microseconds as in regular TCP/IP stack over Ethernet). Because remote server CPU is not involved, can get higher throughput for the same number of cores. (Note that two-sided verbs do not have this advantage of saving remote server CPU, only one-sided verbs do.) * What are the challenges? If the remote server CPU is not involved, a remote read/write can happen concurrently with the server's write, leading to race conditions. To avoid this problem, Pilaf uses one-sided RDMA only for get requests, and put requests use two-sided verbs. Further, it uses self-certifying data structures to detect read-write races. * High-level architecture: server exposes two memory regions to each client: a fixed size hash table array, and a region containing the actual keys and values corresponding to an index in the hash table array. Recall that with RDMA, the server must give access keys to all clients for every memory region they can access. So these two regions are exposed to all clients. * How does a get request work? The client first hashes the key, accesses the index in the fixed size array at the server via RDMA, and fetches the pointer to the actual key-value region. Then it fetches the key-value pairs one by one until it finds the key. (An optimized hash table is also described in the paper instead of this linear probing.) Put requests go via two-sided send/recv operations to the remote server. * How does Pilaf deal with read-write races? Every hash table entry is protected by a checksum over all its contents. So, if any concurrent write happens, the checksum won't match, and the client can detect the race. * Overall takeaway from Pilaf: using a new network stack (RDMA) to build an optimized application.