Memory Subsystem

* Logical address space of a process: addresses are assigned to instructions and data of an application. The CPU uses these addresses to refer to code and data while executing a process. The logical address space has several parts: the code and data in the executable are assigned addresses by the compiler. When the executable is being loaded into memory (say, as part of the exec system call), the kernel allocates virtual addresses to other portions of the application, like the heap and stack. The kernel code and shared libraries are also mapped into the virtual address space of a process. 

* The virtual address space of a process is divided into logical pages. A page table entry (PTE) is created for each page in the virtual address space that has been mapped. When we say an entity X (kernel, shared library etc.) is 'mapped' into the virtual address space of a process, we simply mean that a set of addresses are allocated to the code and data of X, and page table entries are created for these addresses.

* Not all logical addresses are assigned physical memory. Modern operating systems use the concept of virtual memory and demand paging: logical pages are mapped to physical frames only on a need basis, i.e., when accessed. The first time a logical address is accessed, the page table entry does not contain the physical address. When this happens, the MMU raises a page fault, and the process moves to kernel mode. The OS then allocates a free physical frame, and updates the PTE to store the physical frame number. Now when the process is restarted, the MMU can successfully serve the memory request. Note that some logical pages (e.g., kernel code that handles a page fault) must always be allocated physical frames and must never page fault.

* Bits in the page table can also be set to cause page faults for other purposes even when physical frames have been allocated, e.g., to implement copy on write fork. In copy on write fork, the parent and child's page tables point to the same page, but set to read-only. When either of them accesses the memory, MMU causes a page fault, and the OS can allocate new pages and update page tables.

* When if the OS has run out of physical frames to allocate on a page fault? Then a page replacement algorithm (e.g., FIFO, LRU) is used to remove a frame from another page, store those contents to the swap space on disk, and reallocate the physical frame to another logical page. OS usually overcommits memory and manages with a smaller set of physical frames than logical pages. 

* The size of a page table = number of pages X size of each PTE. If this size is large and the page table does not fit in a single page, the page table itself can be paged and split over many pages. With a two-level hierarchical page table, an outer page table stores the physical address of the frames holding the inner page tables, which in turn contain the actual PTEs. 

* The page table is created and updated by the OS (e.g., when pages are allocated by demand paging). A pointer to the outermost page table is stored in a CPU register, and updated during context switch. The OS is however not involved in address translation on every memory access. This task is offloaded to the MMU, which walks the page table to translate an address. One memory request from the CPU can result in multiple memory accesses if the page tables must be accessed first. Therefore, modern systems use a hardware cache of virtual to physical address mappings called the TLB. If an virtual->physical address mapping is found in the TLB (TLB hit), the memory location can be directly accessed. Otherwise, the page table must be traversed, and physical address obtained before the CPU can access a memory location.

* Mind the difference between a page fault (physical memory does not exist for a logical page and must be allocated) and a TLB miss (physical memory has been allocated for a logical page, but physical address is not readily available). The former results in a page fault trap that is handled by the OS, and results in a performance penalty. The latter is lesser overhead and only results in extra memory accesses at the MMU.