Understanding cache performance: experimental results

* Understanding some experimental results from [Drepper] reference. This paper runs a series of tests of accessing a linked list sequentially/randomly. The experiment is very simple and the results shed light on how caches work. Key takeaway points are as follows.

(single threaded results)

- Fig 3.10 shows how memory access time (cycles per linked list element) increases as size of the linked list (working set size) increases beyond the cache sizes. Also note that the CPU does prefetching, so that even when working set is larger than cache size and many cache misses, memory access time is low, because the next cache line can be prefetched when the previous one is being accessed.

- Fig 3.11 shows impact of increasing size of linked list element, or the stride of memory accesses. As the stride increases, successive accesses are on different cache lines, so the prefetcher cannot keep up with prefetching. What is more, as stride length increases, the list crosses page boundaries, and hardware prefetching cannot access beyond current page. Even if all pages are in memory, crossing page boundaries has two effects: hardware won't try to prefetch beyond a page, and TLB miss may occur.

- Fig 3.12 isolates effect of TLB. In both cases considered, every element is on new cache line. However, when these lines are on different pages, a TLB miss also needs to be factored in after TLB capacity is increased. Note that TLB entries cannot be prefetched like cache entries, because a valid page table mapping may not even exist for the next page.

- Fig 3.13 shows the impact of computation - simply following the list vs. incrementing an element vs. adding one element to the next. Addnext forces a prefetch, so performs as good as follow when everything fits in caches. Once cache size is exceeded, every cache eviction also needs a write into memory, so system bus usage is doubled, doubling the access time as compared to only following/reading the list.

- Fig 3.14 - better processor, more cache gives better performance.

- Fig 3.15 - random access performance is much worse than sequential performance. In addition to cache misses, TLB misses also come into play.

(multithreaded results)

- Fig 3.19 - sequential read access, no thread is modifying anything and no cache invalidations, still memory access time increases slightly with threads, likely due to interference from other threads on the system bus when accessing memory.

- Fig 3.20 - multiple threads are incrementing sequentially. No synchronization. Now, system bus has traffic to fetch into cache as well as write back from cache.

- Fig 3.21 - random element is accessed and added to previous element. Cycles/list element increases drastically, due to cache misses, invalidations, and so on. In fact, adding more threads is not providing much benefit at all in such cases.

- Fig 3.22 shows the speedup due to adding more threads. Speedup of 4X with 4 threads only possible when workload fits in cache. However, even this is not realistic, since in real life threads would be incurring even more overhead for synchronization etc.