Performance evaluation: Case studies

* "Open Versus Closed: A Cautionary Tale" This paper explains the differences between open-loop and closed-loop testing using several measurements in real systems.

* This paper defines a new type, in addition to open and closed, called partly open. In partly open systems, a user that has finished service returns back to the system with probability p, thinks for a while and makes another request, or leaves with probability 1-p. In addition to returning users, there are also new arrivals like an open system. What type of workloads does this model? For example, users sending multiple web requests once they visit a website.

* This paper talks of different ways of generating workloads:

- Trace based: replay an existing web trace. For example, replay arrivals of web requests for an open loop generator, take file names ut replay them only after previous request finished + think time in the case of closed loop generators.

- Model based: pick random variables from standard distributions for interarrival time, service time etc. for different types of requests.

Designing a workload generator requires significant domain knowledge of the system, e.g., types of requests, parameters of the workload, what to replay from trace etc.

* Scheduling: this paper also considers the problem of scheduling requests, where applications can decide which of the many requests they are going to process and in which order (e.g., FCFS, shortest job first, and so on), and see how open vs closed loop testing impacts scheduling benefits.

* Figure 2 compares open vs. closed. The load (measured as utilization) is varied and response time is measured. How is load varied? In open systems, varying lambda translates to varying rho, since throughput = lambda = rho mu. So the graphs show rho (x-axis) vs response time (y-axis). In closed systems, think time is varied to vary the utilization rho.

* Lessons learnt from the results:

- Closed systems have much lower response time than open systems as seen in Fig 2. From Fig 5, we see that as MPL becomes higher, closed converges to open, very slowly. Open systems are also more sensitive to variability in service demand.

- Scheduling matters much more for open systems than closed systems. In open systems, a small request can get stuck behind large ones and impact the average response time. In closed systems, the set of jobs is limited to begin with. Since N = X E[T], and throughput X is the same once saturation hits, E[T] also stays the same irrespective of scheduling policy. On the other hand, in open systems, E[T] depends on E[N], and E[N] can vary with scheduling policies.

- Partly open systems behave like open systems when requests per session is small, and like closed systems when number of requests per session is high. 

* Takeaway: closed loop or open loop testing significantly impacts the conclusions one draws from a performance study. One must carefully study the incoming traffic into a system and decide on open vs. closed loop testing. If a large number of users and requests are expected to come into the real system, then an open loop test is preferred. If a smaller set of users make many requests one after another in a session, a closed loop testing is preferred.

---------------------------

* "Comparing the Performance of Web Server Architectures" This paper compares the performance of several web servers of different architectures. This can serve as an example of how one goes about methodically measuring and comparing performance of systems.

* The various server architectures discussed in the paper:

- MT (multithreaded) or MP (multiprocess servers). One process or thread per connection to handle a request. The Knot server in the paper uses lightweight user-level threads, with one thread per connection model. 

- Single Process Event Driven (SPED) servers: one process only, and event-driven I/O is used to handle multiple sockets. However, such servers can block on disk I/O. Two alternatives: Symmetric Multi Process Event Driven (SYMPED) proposed in this paper, where multiple SPED servers are run, and OS context switches when one blocks; OR Asymmetric Multi Process Event Driven (AMPED) where one main process handles non blocking event-driven I/O, and blocking file I/O is handed over to worker threads/processes. The "userver" used by the authors is a SYMPED server.

- Staged Event Driven Architecture (SEDA) where the processing of a request is divided into stages, and each stage is handled by a pool of threads which can block on I/O. The advantage of this architecture is that the number of threads assigned to each stage can vary depending on how fast/slow a stage is. The Haboob server used by the authors is in this category. 

* What load generator? Partially-open, trace based.

* For each server, the authors improve the implementations to make all servers comparable. They also vary a number of parameters and run performance tests to "tune" the server, i.e., to identify the combination of parameters that gives the best performance. The parameters they tune are: max number of connections, number of worker threads, server processes and so on. 

* The knot server based on a user-level threading library: read to understand how user level threading is implemented. All network sockets are set to non blocking mode. If a system call can potentially block, the user level scheduler runs another thread, adds the socket to list of sockets being monitored by an event-driven system call like poll. Once poll indicates that the socket is ready, it is scheduled by the library. Blocking disk I/O operations are handled by separate kernel-level worker threads. Knot also has a provision for an application layer cache of recently accessed files.

* Knot was also modified to support the sendfile system call within the userlevel threading library. The sendfile system call copies file from disk buffer cache to socket buffers without going to and back from userspace. Other servers already use this to read and send files, so Knot was also upgraded. They had multiple versions of the server, based on whether the sendfile system call was blocking or non blocking. We will skip these details here.

* Multiple versions of Knot are then tested by varying the following parameters: number of user threads, number of worker processes for disk I/O, and cache size (for caching version). Takeways from the results: too few user level threads is not good. Need a minimum number of user level threads to support incoming requests, since each outstanding request in the system needs a dedicated thread. Also, it is important to match the number of worker threads to number of user threads. Worker threads are needed to handle the disk I/O requests coming in from user threads, and having too few worker threads increases the number of requests waiting in queue and thereby increases the overhead of event-driven system calls like poll that need to iterate over all outstanding requests. 

* The userver is a SYMPED server. The authors modify the server design to add some sharing. For example, the multiple processes doing event-driven I/O are forked after the listening socket is setup, so that they can all get requests from the same listening socket. Otherwise, if each process had its own listening socket, we would have multiple port numbers for the same server. All processes also share the list of open files and recent HTTP headers via a shared memory region (obtained via mmap). 

* The userver is also tuned by varying the max number of connections and the number of processes. The results show that more than one process is needed to ensure that not all are blocked for I/O. When uses non blocking sockets for network I/O, userver needs about 3-4 processes to give optimal performance. When using blocking network sockets, many more processes (few hundreds) are needed to get max performance. More the max number of connections, more the processes required for optimal performance.

* Watpipe is a SEDA server with many stages. The last stage of writing back the reply is the most intensive. So performance is measured by tuning the maximum number of connections and the number of threads in the last writing stage. Conclusions are similar. 

* Finally, all servers are compared against each other and the best config in each perform similarly. However, of all the best configs, Knot server's performance is the lowest. What is the reason? The authors perform a detailed profiling of the system (Table 1) using OProfile (to see what fraction of time is spent in which function). 

* How do profilers (gprof, OProfile etc.) work? A profiler samples the execution of your code multiple times and checks which function you are currently running. Based on where your code was found most of the time, it calculates which functions are taking more time. Profiling is a good exercise to understand why your code is performing the way it is. 

* What did the profiling exercise tell us about the various servers? The Knot server had a higher user level thread management overhead due to context switching between user threads. While other servers had a higher event I/O overhead, it was lower than the thread management overhead of Knot. So the other servers had a slightly  better performance than Knot. All other functions were comparable across servers.

* Takeway: tune your system, profile to understand the performance bottlenecks.

* Note that any system will eventually hit a bottleneck at sufficiently high load. The job of profiling is not to eliminate bottlenecks always. If some bottlenecks are due to trivial reasons and can be fixed by modifying the code, that is good. Some bottlenecks are fundamental to the system design and cannot be eliminated. However, profiling still helps us convince ourselves that the bottleneck is indeed justified.