Improving Cache Performance * What does the memory access time depend on? - Some factors that are beyond the control of application developer, like cache and memory access bandwidth (depends on size and speed of memory bus), system bus bandwidth, and so on. - However, some factors are in the control of the programmer. We can influence data placement (place data in such a way that spatial locality of reference is high) or the way code is written (bunch code that works on the same data together). * We will now discuss several techniques that can be employed by an application developer to improve memory access time by better utilizing the cache. * Loop interchange. Example of multiplying matrices: when multiplying matrices A and B, A is accessed row wise (greater probability of finding next few elements in cache) while B is accessed column wise (greater stride, so lower cache hit). Instead, we could transpose B and access both matrices row wise. Of course, the transpose operation itself leads to extra memory accesses, but it is compoensated by the savings due to high cache hit rate. * Loop fusion: if two separate loops interate through the same array, merge them so that the array is read into cache once and operated upon many times. * Loop tiling: Going back to matrix multiplication with transpose, we read one full row of A, and then iterate over all rows of B transpose. By the time we are done with pass over all rows of B, we need to refetch the rows of B again for the iteration with the second row of A. To avoid this, we can divide A and B into smaller blocks or tiles that fit in a cache line. Once a sub matrix is brought into cache, we do all the computations needed on this data, and fetch the other sub matrices. * Note that many such optimizing all already performed by modern compilers when compiling code. An application developer may be able to identify many more than what a compiler can do. * Next, how should data be placed in memory for improving cache performance? - Try to size data structures to fit in a cache line as far as possible - a slight overflow of cache line is a very bad thing. - Group variables that are accessed together in a structure. Access elements in the order in which they are defined in the structure. - Move the most critical element of the structure to the head of the structure. - Split structure into smaller pieces of required to group the elements that are accessed together. - Align structures at 64-byte boundaries, using C library functions (posix_memalign) for dynamic allocations or compiler hints for static allocations. - If lists / arrays or other data structures get large enough that multiple entries map to the same cache slot, then conflict misses may occur. * Techniques to improve performance of instruction cache: - Reduce code size to fit in cache. Ssome compiler optimizations, e.g., tiling or inlining function code, however increase code size, so the two concerns must be balanced. - Code execution should be linear as far as possible so that cache lines can be prefetched. - When branches happen, the hardware has some branch prediction units built into it. In addition, programmer can give the compiler links on which branch is more likely so that the compiler can reorder blocks of code suitably. E.g., if(likely(x>0)) - Start of functions, loops, and blocks of code reached via jumps can be aligned to start of cache block, so that prefetching will be very effective. - Plenty of other compiler level optimizations that can be turned on via options to the compiler. * Optimizations for multithreaded applications - Separate read-only variables and read-write variables. - To the extent possible, threads should access seprate data. Try to separate data as far as possible into per-thread structures or thread local data to avoid true sharing. Further, place per-thread data on separate cache lines to avoid false sharing. - Use locks, atomic instructions, and other such expensive operations that generate cache coherency traffic carefully.