Cache coherence in multicore systems

* Cache coherence: multiple cores of a CPU share the same memory, hence cache the same memory locations. Protocols needed to ensure coherence between the multiple L1/L2 caches, especially with writeback caches. If one core has cached a memory location, modified it, but not flushed it to memory, and another core wants to access the same location, what happens? We need a cache coherence protocol between cores to run and ensure that second core gets the updated value from the first core's cache, and not the stale value from memory.

* MESI cache coherence protocol: every cache line is in one of the following states: modified (dirty cache line in local cache), exclusive (clean cache line but no one else has this memory location cached), shared (clean but may exist in other caches), and invalid (not to be used).

* Initially, all cache lines empty/invalid. When a CPU core accesses a memory location to read, it is in exclusive mode. Can read locally many times in exclusive mode. If another processor also reads the same location, then both CPU cores must mark the location as shared, and continue reading without any issues.

* Now what happens when a cache location is written to? An exclusive cache line goes into modified state. A shared line also goes into modified state, but in this case, other cores may have a copy, so the other cores must mark their copies as invalid. This is done via messages exchanged between cores to invalidate other copies of this cached data item.

* Now a modified line can be written to and read from locally. Now, what if another core wants to read the data. Then the modified copy of the data is sent over the system bus, and the cache that requested it updates its copy. The memory controller can also snoop on this exchange and update memory. The cache line will now go to shared state in both cases.

* What if another core wishes to write to a locally modified cache location? In this case, the cache content must be sent over the memory bus as before, and the locally modified copy must be marked invalid (because another core will modify it).

* Summary of MESI protocol.
From state M:
  local read/write - M
  remote read - S (send data over bus)
  remote write - I (send data over bus)

From state E:
  local read - E
  local write - M (no need to tell others)
  remote read - S
  remote write - I (will receive cache invalidation message)

From state S:
  local read - S
  local write - M (tell others to invalidate their cache)
  remote read - S
  remote write - I (will receive cache invalidation message)

From state I:
  local read - E
  local write - M
  remote read/write - N/A

* So when the same data is accessed across two CPU cores (e.g., a thread migrates from one core to another, or two threads on two different cores are updating the same shared data), cache coherence overhead will arise. This traffic will saturate the system bus, in addition to causing cache misses and high memory access latency.

* In fact, atomic instructions like compare and swap (CAS or xchg in x86) are implemented by using the cache coherence mechanism. For example, to atomically update a variable, the variable is loaded in cache in an exclusive mode, and the core that is updating it will not respond to any cache coherence messages or relinquish control until the atomic update is done. (In some sense, cache coherence protocols synchronize access to memory at a cache line granularity.) Note that atomic instructions are often used with fence instructions to guarantee a consistent view of memory. 

* What about hardware level hyperthreading? The CPU has multiple hardware threads of execution, which share the CPU hardware but have different registers. The hyperthreads look like two separate cores to the OS scheduler, and two processes are scheduled. At the hardware level, when one of the process blocks for memory access, the other process executes on the same CPU hardware in a time-multiplexed fashion. When is hyperthreading useful? Only if the cache miss rate is high enough, one of the threads stalls long enough to bring in another executing process.

* Classification of cache misses:

- Cold miss: first access to memory. Cannot avoid unless using prefetching.

- Capacity miss: memory was in cache but evicted due to cache being full, and must be fetched later. Occurs when the working set size is too big. 

- Conflict miss: multiple data items mapped to the same location, and eviction happens even when the cache is not full. Lesser of a problem with set associative caches. Can be avoided by padding data structures suitably.

- True sharing miss: thread in another core wanted the same data, so current copy invalidated. Can be avoided by minimizing sharing of data between threads.

- False sharing: threads are accessing different memory locations that happen to be in the same cache line, leading to invalidations. Can avoid by padding data of a thread to fall on cache boundaries.