Multicore systems: overview

* What are multicore systems? Multiple CPU execution cores to run applications in parallel. Many types of multicore systems exist.

* How are the multiple CPU cores arranged? Multicore (multiple cores on a single socket or single chip) vs multiprocessor (multiple cores spread across multiple sockets or chips) vs multiple hardware threads (hyperthreading) on a single CPU core. Each socket or processor can have its own memory controller. The multiple sockets can be connected to each other in some topology. 

* The multiple cores can be identical and have equal access to all memory (SMP or symmetric multiprocessors) or some CPU cores can be closer to certain ranges of memory addresses (non uniform memory access or NUMA).

* What about caches? Usually the top level cache is exclusive to a core. The cores in a chip can share a level of cache, and cores across sockets can share other levels of caches. The caches can be uniformly distributed or some cache lines can be closer to some cores, much like NUMA in main memory. The number and types of caches (exlcusive vs inclusive) can vary. 

* What about cache coherence? Different systems use different methods for cache coherence. Snooping/broadcast method where cache invalidation messages/modified cache lines are broadcast to all cores using a common bus, or directory based coherence protocols where the location of cache lines is explicitly maintained. Directory based protocols are more scalable in larger multicore systems. 

* What is the memory model? Shared memory systems where all cores are using the same memory address space vs. distributed memory systems where each CPU has its own memory address space. In shared memory systems, an underlying cache coherence protocol ensures a unified view of memory across all cores (almost always), and processes/threads on different cores can communicate using shared memory variables. 

* Communication and synchronization between threads in a shared memory system is implicit:

- Sharing common data structures correctly without race conditions. This sharing should satisfy the properties of mutual exclusion on critical sections, progress (freedom from deadlocks) and bounded wait (freedom from starvation). Sharing with the above properties can be achieved via locks or lock-free/non-blocking data structures or via transactional memory.

- Signaling via semaphores/conditional variables.

* In contrast, processes in distributed systems must communicate explicitly via message passing, or must build a shared-memory like abstraction over distributed memory (using complex cache coherence like protocols). Beyond a certain number of cores, the CPU-memory bus becomes a bottleneck and cache coherent shared memory in a single system doesn't perform very well and the multiple cores are better off being distributed across systems.

* How to program multicore systems? We need multiple threads of execution to utilize multiple cores. Two ways to do this: 

- Manual parallel programming: explictly create threads and assign tasks to threads. Who schedules these threads? The OS (kernel threads) or librray code (user threads). Example: pthreads. 

- Automatic parallel programming: the user simply indicates in the code which parts of the program can execute in parallel and the compiler takes care of generating parallel core. Example: OpenMP framework where compiler directives are used to parallelize certain parts of the code (e.g., for loops). 

* Which one to use: manual or automatic? The manual way gives more control and can be used when different threads must do different tasks, while automatic parallelization of code is quick and easy but also limited to some use cases.

* Speedup and scalability. Let T1 be the time to execute a program on a single core, and Tp be the time on p cores, then speedup due to parallel execution is T1/Tp. If alpha is the fraction of work in the program that can be parallelized, then speedup is 1/(alpha/p + 1 - alpha). For a large number of processors (p tends to infinity), the maximum speedup is 1/(1-alpha). This is called Amdahl's law.