Despite the attention this area has received, many problems lack satisfactory solutions. First, simple application of existing results to new domains is rarely possible, because the setting often has crucial details not addressed by classical scheduling literature. Second, the practical situation is usually complex, and it is hard to abstract a model that is both simple and faithful to reality. Third, most of the problems formulated are computationally intractable, and knowledge of complexity and approximations is needed to design algorithms and analyze them. Finally, it is important to design practical solutions reflecting both the constraints and simplifications offered by practical scenarios.
My dissertation research involves the application of the following skills for efficient system design: modeling system performance and formulating an optimization problem for the model; design and analysis of approximation algorithms related to on-line resource allocation and job scheduling under a variety of realistic situations; and contributing software to compilers like High Performance Fortran (HPF), parallel scientific libraries (ScaLAPACK), runtime systems (Multipol), and parallel applications. My specific accomplishments are summarized in the following paragraphs. I conclude with a review of the choice of problems, plans for future work, and the general direction I wish to pursue.
In collaboration with the HPF group at IBM Research, I developed a new compiler algorithm for communication analysis and optimization that achieves redundancy elimination while maximizing network and cache performance across loops and statements in a global manner. This significantly improves on the recently proposed array dataflow analyses which uses an ``earliest placement'' policy: our algorithm maximally exploits opportunities for combining messages to amortize fixed overhead, and in fact avoids early communication placement if that can lead to large message startup overhead. I implemented the algorithm in IBM's production compiler called pHPF. Although pHPF already generates highly optimized code, the new technique cut down communication time by another factor of 2--3, which meant a 20--40% overall savings for many well-known scientific benchmarks.
In collaboration with Demmel and Yelick, I designed a simple and effective scheduling library for these divide and conquer applications. The large problems near the root are allocated the whole partition, while small problems close to the leaves are packed in a task-parallel fashion. There is some internal frontier at which the execution model switches from data to task parallelism. Some previous implementations have used switching based on an ad-hoc switch point tuned to a given problem and machine. In contrast our switching algorithms are parameterized on machine and problem parameters and are provably within small constant factors of the optimal for independent batches of jobs and trees of jobs.
Our analysis also indicates that the performance gap between the switched mode execution (which is simpler to program) and the general setting of allocating arbitrary processor subsets to jobs (for which many algorithms exist in theory) is often small. From a language design perspective, these results indicate that this limited form of mixed parallelism will produce most of the benefits with a much simpler implementation. We are using the switching algorithms in the non-symmetric eigensolver in the next release of ScaLAPACK. The performance (MFLOPS) of the switch-scheduled code is up to three times larger than pure data parallel code.
In collaboration with Phillips, Schulz, Shmoys, Stein and Wein, I designed simple and near-optimal algorithms for these scenarios and variations. We give algorithms for jobs with malleable resources and with tree-shaped precedence, and independent jobs with non-malleable resources; these are the first known constant factor approximations. For jobs with both precedence and non-malleable resource constraints the problem appears to be harder. In joint work with Muthukrishnan, I obtained an O(log T) approximation for both WACT and makespan, where T is the longest to shortest job time ratio. Our algorithm uses a technique that deliberately introduces delays to improve on a greedy schedule! We also show that the log T blowup is unavoidable for certain instances. Since our models were carefully abstracted from real-life scheduling problems in parallel databases and operating systems, we expect that these results will prove useful in practice.
In collaboration with Adler, Mitzenmacher and Rasmussen, I gave a precise characterization of this trade-off by deriving tight upper and lower bounds on the best load balance possible given a quota of communication rounds. The algorithms poll the load of a constant number of random servers for placing each block (as in Azar et al, Balanced Allocations, in STOC 1994), but in parallel rounds. Random allocation can also be useful in task-parallel applications like branch and bound where the running time of a job is unknown before completion. We can then replace blocks by jobs and servers by processors. In the above analysis jobs are unit-time and independent. I have also analyzed jobs with diverse running times and precedence constraints. These randomized bounds on makespan have been supported by practical experience with irregular applications.
My work suggests a few directions to improve the ubiquity of parallel computation. First, some of the compilation technology described above, combined with runtime communication scheduling, may be used for more fine-grained, dynamic communication optimization. Second, resource scheduling can assume a new scale where multiple communicating threads are scheduled on to functional units with private register files and fast communication between files explicitly managed by the compiler. These two areas will extend parallel computing to finer grains. A third area of focus, at a coarser grain, is better multi-user job scheduling on shared multiprocessor installations; this is gaining importance as many national supercomputing centers are being set up and the workload matures. I expect that the prevalent machine model for all this work will be clusters of symmetric multiprocessors.
In the long term I wish to work in an environment that encourages an interaction between modeling, practical algorithmics, and system building. Although I anticipate many surprises in hardware trends and performance models, the problems of scheduling and resource allocation will remain of central interest, and general principles, rather than tuning by trial and error, will be required for optimal performance and utilization.
In collaboration with Bender and Muthukrishnan, I have formalized one notion of fairness that has been informally used in the database server community. We call it stretch: this is the ratio of the time spent by a job in the system to its inherent processing time. One does not expect the same average delay at the teller machine and a loan application; one hopes for small stretch in both cases. We give positive and negative results for off-line and on-line scheduling with the stretch objective.
It would be interesting to conduct further research in the area of scheduling under uncertainty (fluctuating network bandwidth available to a web server) and in adaptive manners.