Performance evaluation: basics, open systems.

* A computer system (e.g., a web server) provides service to a set of users. In general, we are interested in understanding the performance of the system: what is its maximum performance, is it performing as well as it should, and so on. Performance of systems can be modeled using queueing theory. We will now study how to measure and model the performance of simple systems. 

* Parameters of the system (inputs known to us):

- The incoming traffic / requests into the system. For example, we can model this using the average arrival rate of requests lambda, or mean inter-arrival time (IAT) 1/lambda. 

- the rate at which the system is able to handle the input traffic. For example, we may know the average service rate mu req/s, or the service demand of each request D = 1/mu. 

* Simple models of queueing theory assume a single server, which is only an approximation for real systems. In real systems, there are many components (CPU, disk, memory) and a request places demands on many of these components. As a first approximation, we will approximate the complex system as a single server with a single service demand (which can be the total time taken to process a request at the slowest component). 

* The performance metrics we wish to measure (outputs from analysis/measurements)

- Average response time or turnaround time of a request entering the system T.

- The average throughput X, or number of requests successfully completed per second.

- Utilization rho, or fraction of time the system/server is busy. If a  system has many components, the utilization of the main resource consumed by a request (e.g., CPU for a CPU-intensive workload) is important.

- Number of requests in the system N

- The time spent waiting in the queue TQ, and number of requests waiting in the queue NQ

* How are these entities related? One basic relationship is Throughput X = rate of request completion when server is busy * probability that server is busy

X = rho * mu (utilization law)

* Queueing theory connects these inputs to these outputs under various models of systems. 

* What if lambda > mu? The queue would build up indefinitely. Not stable. So we always assume lambda < mu.

* What if arrivals were uniform and service demands constant? Then each arrival would be served immediately, and response time would be constant, and there would be no waiting. This problem is interesting simply because arrivals are non-deterministic, leading to queueing. The most common model used for arrivals is Poisson arrivals and exponentially distributed service times. We won't go into these details in this course. 

* Open systems vs. closed systems. In an open system, requests keep arriving at a certain rate, irrespective of whether previous requests have been served. In a closed system, there are a fixed number of users N, also called the multiprogramming level (MPL). A user issues a request only when previous one completes, so the rate at which traffic comes in is decided by the rate at which requests finish.

* Let us consider an open system of arrivals. 

* Little's Law for Open Systems: E[N] = lambda E[T]. No assumptions on anything, always holds for parts of systems also.

- Consider only the queue as a system E[NQ] = lambda E[TQ]

- Consider only the server as a system rho = lambda / mu.


* Throughput X = rate of request completion when server is busy * probability that server is busy

X = rho * mu = lambda/mu * mu = lambda.

* That is, throughput of an open system is simply the rate at which requests arrive, simply because every request that comes in has to leave in steady state.

* What about response time T? This is a little more tricky. Under certain assumptions of arrival and service rate (M/M/1), T = 1/(mu-lambda), so response time exponentially increases as lambda approaches mu.

* Throughput and response time in an open system are uncorrelated. Throughput only depends on arrival rate. If you make your system faster (increase mu), response time reduces but throughput won't change. 

* How is mu measured in practice? Increase incoming rate, and throughput of the system eventually flattens out at mu. This value of throughput is also called saturation throughput or capacity. At capacity, the system utilization is also 1 or 100%. In real life, this means that some hardware resource (e.g., CPU) is fully utilized.

* Open-loop load testing of any system: run multiple experiments with gradually increasing values of incoming load into system until saturation throughput is hit. Measure: throughput, response time, utilization (of bottleneck resource). You should see throughput flattening out as incoming load approaches the maximum capacity mu, and you should see latency increasing significantly as you hit capacity. At saturation (i.e., when incoming load is close to capacity), the bottleneck resource should be fully utilized. 

* Ideally, utilization is proportional to throughput. That is, max throughput occurs when utilization is 1. However, sometimes  throughput may hit saturation for the wrong reasons also, even when the server is not fully utilized (software bottlenecks, bugs in code), and one has to fix such issues. Ideally, when a hardware resource is fully utilized at saturation, you know that your system cannot do any better.

* How to write an open loop load generator? Set a timer to 1/lambda, and fire a request every time the timer goes off.