CPU caches

* CPU caches: another piece that comes into play during memory access. CPU execution pipeline: multiple instructions are executed in a pipeline. CPU is much faster than main memory. If one stage stalls for memory access, it disrupts the entire pipeline. So it is crucial to have a low memory access latency. 

* DRAM vs SRAM: SRAM is much faster than DRAM but also much more expensive. So a mix of both is used. Two ways of using SRAM and DRAM together: either expose both to the OS/applications and let the OS/application control what part of the memory image goes where. Or, make it transparent to the OS/application and use the faster, more expensive SRAM as a cache of recently used instructions and data. The latter option is used in computer systems.

* Every CPU core has an instruction cache, and L1/L2 data cache. Multiple hardware threads on a core share all caches (only separate registers). Multiple cores on a processor share a last level L3 cache. Multiple processors on a machine do not share any caches, and share the main memory. The number of caches and their sizes vary across systems. Approximate size of cache: few MB and approx access time: few ns. As compared to gew GB of main memory with few tens/hundreds ns of access time.

* Understand the terms: spatial locality and temporal locality of memory accesses. A high locality of reference is crucial to getting good performance from caches.

* Caches store memory contents in cache lines of typically 64 bytes. When a memory location is accessed, an entire cacheline worth of memory is copied into the cache. For a write also, an entire cache line must be loaded first and suitable memory modified.  A modified cache line is marked dirty, and must be flushed to main memory eventually. Caches can be write thorugh (write to memory immediately) or write back (mark dirty and write later on).

* Since cache is smaller than memory, cache lines/slots need to be evicted to make room for newer memory accesses. A policy like LRU is used to find the victim cache line. Inclusive caches: when fetching a cache line from memory, it is populated in L3, L2, and L1. So, eviction is easy as a lower level cache would have the content (if not dirty). Exclusive caches store directly in L1, so must be flushed to L2 or main memory to make space in L1. 

* When does a CPU actually write to a cacheline? CPU uses a store buffer and values are not immediately written to cache or memory. That is, memory as seen by CPU and as seen by cache/memory can be inconsistent temporarily. A fence or memory barrier instruction can be used to drain the store buffer for consistency.

* Types of caches: direct mappes, fully associative, set associative. In a direct mapped cache, every memory location maps to one cache line (say, by doing memory address modulo the number of cache lines). In a fully associative cache, a memory location can occupy any cache slot. A set associative cache is in between these two extremes: a memory address is mapped to a set of 4 or 8 cache slots, and can occupy any of them.

* Cache size = number of cache sets * associativity of a set * size of a cache line.

* If the cache line size is 2^K bytes, then the last K bits of an address indicate the offset in the cache line. If there are S slots in the cache, the next S bits indicate the offset to locate the cache line. If remaining bits of a memory address are stored as a "tag" so that we can identify the multiple aliased memory address that map to the same cache line/set. For a fully associative cache, S = 0, and all the bits beyond the offset form the tag. In the a set associative cache, S depends on the number of cache sets available. 

* For example, consider a 4MB cache with 8-way associativity and 64B cache line size. Size of a memory address = 32 bits. 

Size of each cache set = 64*8 = 2^9 bytes. 

Number of cache sets = 2^22/2^9 = 2^13 = 8K. 

So the last 6 bits are offset in a cache line, the next 13 bits identify the cache set. The remaining 32-6-13 = 13 bits form the tag. We must compare these 13 bit tags for each of the 8 entries in the set to see if a memory address is present.

* What if the previous problem were a direct mapped cache with the same size?

Number of cache lines = 2^22/2^6 = 2^16, so we have 6 bit offset, 16 bit index, and 10 bit tag to compare which of the multiple addresses has been stored.

* What if the above problem were a fully associative cache? All 32-6 = 26 bits are stored as tag, and the tag has to be compared for all 2^16 entries.

* Clearly, one can see the advantage of set associative caches. In fact, 4-way or 8-way associativity seems enough to reduce the cache miss rate significantly.

* TLB is also a type of cache, usually fully associative and small in size. Also has multiple levels like CPU caches. 

* Which memory addresses are used in the tag in the cache? Virtual or physical? For L1 caches, virtual-physical translation does not happen fast enough, so virtual addresses are used. For other caches, physical address is used in the tag. Why? If virtual address is used as tag, the cache must be flushed when the page table changes (what you tagged as a certain address may point to a different memory location after a page table change). This may be ok for smaller L1 cache but is not acceptable for bigger caches.