Efficient Resource Allocation and Scheduling in Multiprocessors
(Research Summary)

Soumen Chakrabarti
Computer Science Division, U. C. Berkeley
Email: soumen@cs.berkeley.edu

Overview

With the growing application of parallel and distributed computing, effective resource allocation and scheduling become critical to response time and system utilization. These problems arise in many forms. Some examples are message scheduling in a parallelizing compiler; dynamic processor partitioning for applications with data and task parallelism; disk block allocation on a serverless distributed file system; runtime processor allocation for on-line, irregular tasks; scheduling multi-user resource- and precedence-constrained jobs in multiprocessor databases and operating systems.

Despite the attention this area has received, many problems lack satisfactory solutions. First, simple application of existing results to new domains is rarely possible, because the setting often has crucial details not addressed by classical scheduling literature. Second, the practical situation is usually complex, and it is hard to abstract a model that is both simple and faithful to reality. Third, most of the problems formulated are computationally intractable, and knowledge of complexity and approximations is needed to design algorithms and analyze them. Finally, it is important to design practical solutions reflecting both the constraints and simplifications offered by practical scenarios.

My dissertation research involves the application of the following skills for efficient system design: modeling system performance and formulating an optimization problem for the model; design and analysis of approximation algorithms related to on-line resource allocation and job scheduling under a variety of realistic situations; and contributing software to compilers like High Performance Fortran (HPF), parallel scientific libraries (ScaLAPACK), runtime systems (Multipol), and parallel applications. My specific accomplishments are summarized in the following paragraphs. I conclude with a review of the choice of problems, plans for future work, and the general direction I wish to pursue.

Message scheduling via compiler analysis

Minimizing communication overhead is essential in parallelizing compilers like HPF. There is a large body of literature on compiler analysis and optimization of communication generated by loops involving distributed arrays. Initial progress was mostly restricted to local optimizations at the single loop-nest level; these could not detect and eliminate globally redundant communication. As a simple example, a single definition of a distributed array could reach several uses in separate loops; array index dependency permitting, communication for the uses should be done once after the definition and not before each use.

In collaboration with the HPF group at IBM Research, I developed a new compiler algorithm for communication analysis and optimization that achieves redundancy elimination while maximizing network and cache performance across loops and statements in a global manner. This significantly improves on the recently proposed array dataflow analyses which uses an ``earliest placement'' policy: our algorithm maximally exploits opportunities for combining messages to amortize fixed overhead, and in fact avoids early communication placement if that can lead to large message startup overhead. I implemented the algorithm in IBM's production compiler called pHPF. Although pHPF already generates highly optimized code, the new technique cut down communication time by another factor of 2--3, which meant a 20--40% overall savings for many well-known scientific benchmarks.

Runtime systems for processor allocation

Regular data-parallel programs permit extensive and accurate compile-time analysis and optimization. There is, however, a growing body of dynamic and irregular scientific applications where runtime support is needed. For example, the next version of ScaLAPACK will include various parallel divide and conquer algorithms for computing eigenvalues. These applications are characterized by a precedence graph of tasks, each of which can run on many processors in a data parallel fashion. Array-parallel compiler researchers have already observed this opportunity, but the amount of task parallelism in static task graphs is so small as to be virtually useless; compilers also need to use expensive mathematical programming tools to statically schedule these task graphs. On the contrary, we consider dynamic task graphs where the task parallelism often scales with the problem size. The challenge is that new, simple, and yet effective dynamic schedulers are needed.

In collaboration with Demmel and Yelick, I designed a simple and effective scheduling library for these divide and conquer applications. The large problems near the root are allocated the whole partition, while small problems close to the leaves are packed in a task-parallel fashion. There is some internal frontier at which the execution model switches from data to task parallelism. Some previous implementations have used switching based on an ad-hoc switch point tuned to a given problem and machine. In contrast our switching algorithms are parameterized on machine and problem parameters and are provably within small constant factors of the optimal for independent batches of jobs and trees of jobs.

Our analysis also indicates that the performance gap between the switched mode execution (which is simpler to program) and the general setting of allocating arbitrary processor subsets to jobs (for which many algorithms exist in theory) is often small. From a language design perspective, these results indicate that this limited form of mixed parallelism will produce most of the benefits with a much simpler implementation. We are using the switching algorithms in the non-symmetric eigensolver in the next release of ScaLAPACK. The performance (MFLOPS) of the switch-scheduled code is up to three times larger than pure data parallel code.

Global resource scheduling for multiple jobs

So far we have focused on minimizing the finish time of one application. As users share multiprocessor installations with more challenging worksloads, the resource allocation and scheduling problem is complicated by multiple jobs arriving over time and having diverse running times and resource requirements, (e.g., processors, memory, network and disk bandwidth) precedence constraints between the jobs, and the need to optimize a measure of performance with some notion of priority and fairness (such as weighted average completion time, WACT) rather than just maximum finish time. Moreover, some resources, like processors, are malleable, i.e., they can be traded for time, while others, like memory can be non-malleable, i.e., admitting no such flexibility.

In collaboration with Phillips, Schulz, Shmoys, Stein and Wein, I designed simple and near-optimal algorithms for these scenarios and variations. We give algorithms for jobs with malleable resources and with tree-shaped precedence, and independent jobs with non-malleable resources; these are the first known constant factor approximations. For jobs with both precedence and non-malleable resource constraints the problem appears to be harder. In joint work with Muthukrishnan, I obtained an O(log T) approximation for both WACT and makespan, where T is the longest to shortest job time ratio. Our algorithm uses a technique that deliberately introduces delays to improve on a greedy schedule! We also show that the log T blowup is unavoidable for certain instances. Since our models were carefully abstracted from real-life scheduling problems in parallel databases and operating systems, we expect that these results will prove useful in practice.

Parallel randomized load balancing

Randomization has often been effective in dealing with dynamic and unpredictable problems. We motivate this using the problem of allocating file blocks in distributed ``serverless'' file systems like xFS (Dahlin, Serverless Network File Systems, PhD thesis, U.C. Berkeley, 1995). In this setting there are n clients whose local disks comprise the file system, and can thus be abstracted as n servers. As clients write files, new disk blocks must be allocated in a decentralized way without overloading any particular server. There is a trade-off between the cost of communication to obtain more global information and the cost of load imbalance owing to imperfect knowledge. At one extreme one can send each new block to a random server; at the other one can get optimal load balance by maintaining exact global load information.

In collaboration with Adler, Mitzenmacher and Rasmussen, I gave a precise characterization of this trade-off by deriving tight upper and lower bounds on the best load balance possible given a quota of communication rounds. The algorithms poll the load of a constant number of random servers for placing each block (as in Azar et al, Balanced Allocations, in STOC 1994), but in parallel rounds. Random allocation can also be useful in task-parallel applications like branch and bound where the running time of a job is unknown before completion. We can then replace blocks by jobs and servers by processors. In the above analysis jobs are unit-time and independent. I have also analyzed jobs with diverse running times and precedence constraints. These randomized bounds on makespan have been supported by practical experience with irregular applications.

Future directions

The problems described above in part reflect my reactions to the great flux in parallel computing in the last five years. The realities of system design have swung the trend from fine-grain dedicated, custom architectures to largely commodity architectures like IBM SP-1, SP-2, and networks of workstations. The operating systems trend is to permit multiprogramming and both time and space sharing, unlike earlier machines that provided dedicated time or space slices with gang scheduling. Communication optimization and resource allocation are big issues for such machines. This motivates my later work on switching, message scheduling and resource allocation.

My work suggests a few directions to improve the ubiquity of parallel computation. First, some of the compilation technology described above, combined with runtime communication scheduling, may be used for more fine-grained, dynamic communication optimization. Second, resource scheduling can assume a new scale where multiple communicating threads are scheduled on to functional units with private register files and fast communication between files explicitly managed by the compiler. These two areas will extend parallel computing to finer grains. A third area of focus, at a coarser grain, is better multi-user job scheduling on shared multiprocessor installations; this is gaining importance as many national supercomputing centers are being set up and the workload matures. I expect that the prevalent machine model for all this work will be clusters of symmetric multiprocessors.

In the long term I wish to work in an environment that encourages an interaction between modeling, practical algorithmics, and system building. Although I anticipate many surprises in hardware trends and performance models, the problems of scheduling and resource allocation will remain of central interest, and general principles, rather than tuning by trial and error, will be required for optimal performance and utilization.

Addendum (April 1998)

The high-performance computing scenario is changing faster than ever. Confederations of computers over local to wide area networks with the purpose of managing and serving distributed databases is now a dominant application of high-performance networks and computers. The arrival of queries at such distributed servers is continuous and far from predictable. For such a continuous arrival process, "completion time" is meaningless; fairness to jobs is a key issue.

In collaboration with Bender and Muthukrishnan, I have formalized one notion of fairness that has been informally used in the database server community. We call it stretch: this is the ratio of the time spent by a job in the system to its inherent processing time. One does not expect the same average delay at the teller machine and a loan application; one hopes for small stretch in both cases. We give positive and negative results for off-line and on-line scheduling with the stretch objective.

It would be interesting to conduct further research in the area of scheduling under uncertainty (fluctuating network bandwidth available to a web server) and in adaptive manners.