Dynamic memory allocation in multicore systems. * How does dynamic memory allocation work? Malloc and free allocate/deallocate from the heap. * Fixed size allocation: divide heap into fixed size chunks for allocation. Free chunks are organized as free lists, with every free chunk storing a pointer to the next. The top of the free list is used to access the list. (A slab allocator is a slightly more complex allocator that is used when memory is allocated in one of few known fixed sizes.) malloc: x = free free = free->next return x free: x->next = free free = x * Variable sized allocation: the chunk header can store the size of the chunk allocated. When reallocating a chunk from the free list, fragmentation may occur if an exact size match is not found. Optimizations to fix this: the "buddy" allocator organizes chunks into groups where each group has chunks of a size of power of 2, and satisfies an allocation request from one of these sizes. The benefit of this scheme: a power-of-2 chunk can be split into 2 smaller chunks, or can be coalesced with a similar neighbor to form a bigger chunk. * The kernel uses slab and buddy allocators for fixed and variable allocations. Userspace memory allocators (e.g., malloc in the C library) use a wide variety of different heuristics. The application developer can choose a memory allocator based on the specific pattern of memory allocation expected. * External fragmentation: unused space in the heap that can't be used by applications, internal fragmentation: unused space within an allocated chunk. Allocators must try to minimize both. A metric of fragmemtation: ratio of amount of memory allocated from OS to amount of memory that can be used by an application. * What if multiple cores are accessing the same heap? Then the threads will need to lock in order to maintain consistency, resulting in a serialization of memory accesses. This is not desirable for performance. So several multicore scalable memory allocators have been developed. * Desitable features of a memory allocator: - Speed: allocations and freeing must be O(1) - Low Fragmentation - Scalability: performance must scale with number of cores - Avoid false sharing of cache lines. An allocator can actively introduce false sharing by allocating chunks to different threads on different cores from the same cache line. An allocator can also passively introduce false sharing. For example, the main thread of a program allocates chunks from the same cache line and passes then to multiple threads running on different cores. Now, if the allocator lets each core reuse the chunks it freed, then the false sharing issue will continue. - There must be no blowup of memory in a multicore setting. Blowup is defined as the ratio of memory consumed in a multicore setting vs. a single core setting. Some allocators can have an unbounded or O(P) blowup where P is number of processors. Ideally, we would like to have a O(1) blowup. * What are some ways to do multicore-scalable memory allocation? - The simplest solutions involve allocating serially from a single heap, or performing concurrent access to the single heap for different object sizes (e.g., allocate large and small chunks from different parts of the heap in parallel). All such allocators induce false sharing, do not scale with many cores, and suffer from synchronization overhead. - Per-processor private heaps. Threads on a core only access the core's private heap for all allocations and freeing. This design leads to a scalable and fast allocation. However, the blowup can be unbounded. For example, in a producer consumer situation, all allocations happen from one heap and memory is freed to another heap. This design can also lead to false sharing, since chunks allocated from the same cache line on one core can be freed to different heaps on different cores. - Private heaps with ownership return free blocks to the heap from which they were allocated. This algorithm is used by ptmalloc, a commonly used allocator in Linux. The blowup for these allocators is O(P). For example, consider a round robin producer consumer, where K blocks are produced on core i, freed on i+1, then produced on i+1, freed on i+2, and so on. In this case, each core's heap will have K blocks allocated, while a serial allocator would only have allocated K blocks in total. So the blowup is O(P). What about false sharing? ptmalloc can actively induce false sharing if different threads allocate from the same cacheline and then migrate to different cores. - Private heaps with thresholds. Such allocators proactively track empty parts of private heaps and recycle them, thus avoiding the blowup issue. * The reference [Hoard] describes a memory allocator that claims to achieve O(1) blowup, minimal false sharing, multicore scalability, and fast allocations. It is based on the idea of private heaps with thresholds. * Hoard has multiple heaps (2 in the paper) per processor, and one global heap. Multiple concurrent allocations from different threads are satisfied from different heaps. Every heap consists of superblocks (of one or more pages). Each superblock has blocks/chunks that can be allocated. All blocks in a superblock are of the same size, and there are multiple superblocks for different sizes (much like the buddy allocator). * Each heap owns some superblocks. Once a superblock becomes empty beyond a fraction, and there are many such nearly empty superblocks, Hoard returns some superblocks to the global heap. Thus monitoring of empty blocks with a threshold bounds blowup. Hoard maintains superblocks in bins based on how full they are, and always tries to allocate from the most full superblocks first to avoid fragmentation. When a chunk is freed from a superblock, that superblock is moved to the front of its bin, so that it is preferred in future allocations. * When multiple threads make requests, they are satisfied from different heaps as far as possible, thus avoiding active false sharing. When a block is freed, it is returned to the owner superblock, to avoid passive false sharing. There is still a possibility of false sharing when superblocks are returned to the global heap, but such events are infrequent. * Evaluation results show that Hoard is only slightly slower than a simple uniprocessor allocator (Table 2), scales well to multiple cores across different applications (Fig 3), scales well on benchmarks that actively/passively introduce false sharing (Fig 4), and has very little fragmentation of memory (ratio of maximum memory allocated to memory in use is around 1, as shown in Table 4).