Memory allocation and memory access in applications

* So far, we have seen how virtual memory is allocated from an OS point of view. From an application's point of view, how does an application allocate memory for its data structures?

- Static allocation: static/global variables are assigned virtual addresses by the compiler, and are assigned physical memory when physical frames are assigned to the program executable (say, during the exec system call or during a page fault).

- Dynamic allocations: using malloc on the heap. Malloc manages the heap by dividing it into chunks of memory, and finding the chunk that satisfies the request. Malloc is implemented by the C library. Other language libraries also have similar functions. Some languages dynamically garbage collect heap memory, while malloc'ed memory must be explictly freed in C. 

- Automatic allocations: local variables get allocated on the stack automatically. Further, the "alloca" system call can be used to allocate memory on the stack as part of a function call. This memory is automatically deallocated when the function returns and stack frame is popped. This is not very popular.

* How does malloc get its memory? The brk/sbrk system call increases the contiguous area of the code+data+heap portion of the memory image, and can be used to grow the heap. (User programs should not use these system calls directly since malloc manages the heap may concurrently allocate this memory on the heap). Otherwise, the mmap system call can be used to get a new page mapped into the address space of a process anywhere, not just in the heap. Malloc uses brk for small allocations, and mmap for larger allocations. Modern versions of malloc use mmap to create multiple heaps in multithreaded systems. 

* Malloc divides the heap into chunks and allocates them. The list of free chunks are maintained in linklist like data structures, and a free chunk is found to satisfy the memory request. Each chunk has a header that stores the size of the allocation among other things, so that the number of bytes to be freed is known later on. Also, specific systems can override malloc and build their own memory allocators based on the needs of the application, e.g., if it is known that application allocates only fixed size objects.

* More on mmap, a very powerful system call, used to allocate pages to a process and also to perform file I/O. mmap is used to allocate a physical frame and map it into any part of the virtual address space of a process. The page can be backed by a file (when a file is memory mapped for I/O) or can be anonymous (purely residing in RAM). The page can also be private (changes visible only in the address space of one process) or shared (changes visible across processes). Most memory mmap'ed by malloc, heap, stack and other such things is private anonymous memory. 

* mmap is a more general system call to increase process memory than brk/sbrk, because brk can be only used to allocate memory only at the "program break" virtual address (which is the end of the code+data+heap section), whereas mmap can be used to allocate memory at any virtual address. 

* When a file is memory mapped and read, two copies of the data can exist in the memory mapped pages and disk buffer cache. So modern versions of Linux use a unified page cache. If a file is read via file descriptors, disk blocks are cached in pages. If a file is read via memory mapping, page sized chunks of the file are read from disk. Note that reading a file via memory mapping can be faster due to directly mapping the kernel page into the user program (instead of copying from disk buffer cache to user memory) and due to reading larger chunks from disk. Similarly, writing to a memory mapped file may be faster because only the memory pages are changed, and multiple changes are written to the disk at once when the file is flushed or unmapped. 

* Now that we have seen memory allocation, let us understand memory access. End of end view of memory access: what happens when CPU issues a request to a memory location?

- Check one or more levels of caches. If the requested address is found in cache (cache hit), data is found. Else, if cache miss, main memory must be accessed.

- To access main memory, TLB is checked by the memory controller. If TLB hit, physical address is found, memory location is read, populated in caches, and finally available to the CPU.

- If TLB miss, the MMU walks the page tables to locate the PTE. If the PTE is valid, physical frame number if found, physical address is calculated, and memory location is accessed.

- If PTE is found to be invalid, a page fault occurs. Servicing page fault may cause one or more disk accesses (if file backed page, or some other page needs to be evicted to disk). Eventually, the page is mapped to a physical frame, and the OS restarts the instruction again. 

* Approximate values of access latency: caches and TLB (few ns), main memory (few tens of ns), disk access (few ms). A CPU cannot execute an instruction until memory access completes, so several techniques are used to speed up memory access latency, e.g., prefetching, instructions and data can be prefetched into cache, or pages can be preloaded before page fault. Hardware designers worry about these concerns a lot, to make sure the CPU is executing an instruction every cycle, and not stalling for memory loads and stores.