Multicore scalability of Applications on Linux

* The reference [Mosbench] studies the scalability of several real-world applications over Linux (stock as well as their modified version) to see how the application performance scales with increasing cores. 

* The applications considered: servers like mail server, web server, key-value store, datastore, kernel build, map-reduce etc. All applications either have a pool of threads or spawn one process/thread per connection in order to utilize multiple threads.

* For each application, the per-core throughput of the application is measured, with increasing number of cores. The high-level conclusion is that the per-core performance drops with increasing cores, more for stock Linux than their modified kernel. Their modified kernel also fails to scale perfectly.

* One interesting point to note about their test setup: applications have been modified in order to not bottleneck at other resources, and are made to be CPU-bound. For example, disk I/O is avoided, and a temporary RAM-based filesystem is used. Why? If the bottleneck were not the CPU, then increasing CPU cores cannot be expected to give increased performance, and the per-core throughput will fall no matter what. Therefore, in order to test multicore scalability, it is ensured as far as possible that the workload is CPU-bound.

* Overview of results: in all cases, the applications did not scale well. Some issues in the kernel were identified and fix. This further exposed other bottlenecks. In the end, the scalability issue was identified to be something fundamental to the application design or an issue with other hardware components, and nothing more could be done to improve scalability. 

* Exim mail server: some bottlenecks like contention for mount table were eliminated by partitioning into per-core components. Some other issues still remain: when multiple threads need to write mail to a spool directory, it must acquire a per-directory lock. The contention for this lock limits performance.

* Memcached and Apache: after fixing bottlenecks, the final limitation to scaling was the network card. The network card could not efficiently split traffic into a large number of queues, so the application was not receiving enough traffic when running on a large number of cores.

* Postgresql database: some optimizations were done to the application to eliminate locking. The final bottleneck turns out to be an application spinlock for the root of the database index that all threads must contend for, and cannot be eliminated.

* The psearchy indexing applications reads many files and indexes them. The reading was done via mmap, so multiple calls to mmap caused contention. This was eliminated by having separate application processes for each core.  After this, the bottleneck was cache performance - the working set was too large for cache, and cache misses dominated performance. 

* Metis mapreduce: the application was optimized by using huge pages to eliminate page faults etc. The final bottleneck was DRAM bandwidth.

* Key lesson from the paper: understand how an application can be optimized to work well in a multicore setting, which issues are fixable, and which aren't.