Performance Overhead of Synchronization 

* The reference [Everything about Synchronization] measures the performance overhead of various synchronization mechanisms across several platforms. 

* Platforms considered: Opteron (multisocket, directory-based cache coherence), Xeon (multisocket, broadcast based cache coherence), Niagara (uniform single socket), Tilera (non uniform single socket).

* Table 2 shows time taken by loads, stores, atomic operations etc. depending on the state of the cache line, and location of previous version of the cache line (same socket, different socket etc.). We can see that stores across socket boundaries take much longer, especially in broadcast-based Xeon. Further, time for atomic operations is the same as cross-socket load/store. However, for single socket, atomic operations are more expensive than the much faster load/store.

* The paper mentions a certain peculiarity with Opteron. If the directory resides on a remote node, the operations will be expensive even if happening within a socket. So Xeon has much more locality than Opteron. The results in Table 2 are only best case scenarios for access within same socket for Opteron.

* Figure 4: throughput of atomic operations. In multisocket systems, as number of cores increases, throughput falls due to cross-socket contention and increased cache coherence traffic. In contrast, in single socket systems, throughput is initially lower than with multisocket, increases with number of threads, and eventually flattens out. So, atomic operations are less of a performance issue with single socket multicores.

* Figure 5 and 7 show similar throughput graphs for performing locking operations. While figure 5 has one lock over many threads (high contention), figure 7 has many locks over many threads (low contention for each lock). We see that:

- For single socket systems, performance scales up to a point as you add more threads, while for multi socket systems, performance degrades sharply. (As in figure 4.)

- No single locking algorithm is good across all platforms. Even on a single platform, the best lock changes with the contention level.

- In low contention, simple locks are better. While in high contention workloads, complex locks like MCS and CLH locks are better.

- Xeon has high locality (within socket access much faster than across socket access), so hierarchical locks do much better.

* Fig 12: performance of memcached key-value store with different types of locks: performance is sensitive to the type of lock used.