Workshop on Compiler Construction with Introduction to GCC

Uday Khedker and Amitabha Sanyal
Department of Computer Science & Engg., IIT Bombay

About the Workshop

This is a one-week workshop primarily aimed at teachers in various colleges. This workshop is organized under the aegis of the GCC Resource Center of the Department of Computer Science and Engineering of IIT Bombay.

We can conduct this course at any local venue provided there is a local organizer who is willing to make the arrangements. Basically, we need a class room and associated lab with machines running a contemporary version of Linux supporting the following packages. Other than these, the arrangements include the usual lodging and boarding required for us, our teaching assistants, and the participants. If you are interested in organizing this workshop, please contact us.

Packages required for conducting the workshop are:

The usual packages/commands available on a contemporary distribution of Linux

gcc, lex/flex, yacc/bison, macroprocessor m4, and postscipt viewer gv (or ggv/kghostview)

Extra packages

spim andxspim
iburg
dot (this is a part of the graphviz package.)

Proposed Pedagogy

Modern day compilers are, to a large extent, automatically generated rather than hand-crafted. This workshop will provide hands-on experience in building compilers through the use of compiler generation tools. While the tools for generating compile front-ends, lex and yacc, are now standard textbook material, the workshop will also include the use of a tools called iburg for generating back-ends.

The main objective of this workshop is to teach how to build a compiler incrementally such that:

Each steps supports new source language features rather than a new phase in the compiler. In other words, each steps results in a complete working compiler which generates code which can be executed. We begin with a very small language consisting merely of expression.
Tools like lex, yacc (or flex, bison), and iburg are used before the theory behind scanning, parsing, and instruction selection is covered.

Please see the philosophy behind this pedagogy for a justification. The basic motivation behind this approach is to be able to strike a good balance between theory and practice and make the theory follow the practice instead of the other way round which is what happens in traditional courses.

Our choice of source language is loosely based on C while the target machine is MIPS as defined by the spim simulator (some sample spim programs collected from various resources available on the net). I have used spim7.3 version.

The emphasis of the workshop will be on laboratory exercises in which participants will be required to build a compiler in a series of steps. Starting with a very simple language, each step will end in a working compiler for a language, and the subsequent step will start by augmenting the previous language by adding a few more features. Lectures in the workshop will be used to provide the conceptual background regarding different aspects of compiler construction.

In general, the delivery plan for the workshop will consists of several rounds of:

Lectures to introduce (certain aspects) of the tools.
Laboratory exercises to use the tools to implement a language with a given set of features.
Lectures that will utilize the experience gained in the laboratory exercises to provide the theoretical basis of compiler construction.

Contents of the Workshop

Theory Practice

Theory	Practice
Scanning. Expressing tokens through regular expressions, converting regular expressions to DFA, generating scanners from specifications, structure of generated scanners, efficient representation of scanning tables. Parsing. Expressing syntactic structure through grammars. LR parsing theory, generating parsers from specifications. Static Semantics. Using attributes for (i) type checking (ii) calculating variable offsets (iii) Generating abstract syntax trees. Runtime Environments. Activation records, function prologues and epilogues for allocation and deallocation of activation records. Code Generation. Optimal code generation for expression trees. Generating code generators from specifications. Simple register allocation. Introduction to GCC. The compiler generation framework. Retargetability model. Inspecting intermediate represenations of program.	Using the compiler generation tools `lex`, `yacc` , and `iburg` for generating MIPS code accepted by the `spim` simulator for the following language features: Constant expressions Expressions involving variables Assignment statements Array references Declarations Function calls These experiments will be preceded by lab exercises on writing simple interpreters.

Scanning. Expressing tokens through regular expressions, converting regular expressions to DFA, generating scanners from specifications, structure of generated scanners, efficient representation of scanning tables.
Parsing. Expressing syntactic structure through grammars. LR parsing theory, generating parsers from specifications.
Static Semantics. Using attributes for (i) type checking (ii) calculating variable offsets (iii) Generating abstract syntax trees.
Runtime Environments. Activation records, function prologues and epilogues for allocation and deallocation of activation records.
Code Generation. Optimal code generation for expression trees. Generating code generators from specifications. Simple register allocation.
Introduction to GCC. The compiler generation framework. Retargetability model. Inspecting intermediate represenations of program.

Using the compiler generation tools lex, yacc , and iburg for generating MIPS code accepted by the spim simulator for the following language features:

Constant expressions
Expressions involving variables
Assignment statements
Array references
Declarations
Function calls

These experiments will be preceded by lab exercises on writing simple interpreters.

Tentative Schedule

Here is a tentative list of topics. The schedule will be based on the proposed coverage but may change dynamically to suit the needs of the workshop and depends on the way the workshop progresses. The lectures and lab work will be interleaved and this interleaving may be decided dynamically based on the specific needs of the audience.

Day	Compilation Concepts Covered	Lab exercises
1	Structure of compilers and interpreters Introduction to scanning and parsing though lex and yacc	Assembly language programming using spim. Evaluation (i.e. interpretation) of constant expressions involving arithmetic operations.
2	Scanning theory Introduction to attribute evaluation and symbol tables Introduction to name and scope analysis, intermediate code generation, abstract syntax trees, three address codes, instruction selection through iburg, local register allocation, activation records and stack allocation for local variables, emitting assembly code for spim.	Evaluation (i.e. interpretation) of Expressions with variables and assignments to variables. Conditional expressions involving relational and boolean operations.
3	Parsing theory. Introduction to declaration processing and type analysis.	Compilation of Constant expressions involving arithmetic operations. Expressions with variables and assignments to variables. Boolean expressions involving relational operations on integer expressions.
4	Run-time environment support provided by a compiler.	Declarations of integer and boolean data types.
5	Theory of instruction selection	Compilation of Control flow constructs if-then, if-then-else, while-do etc. Conditional expressions using control flow constructs.
6	Theory of instruction selection	Compilation of arrays
7	Introduction to gcc	Compilation of function declarations and calls, parameter passing.

Pre-requisites

Good knowledge of C and familiarity with Unix/Linux.

The philosophy behind this approach

This approach is very different from the traditional approach of teaching compiler construction phase-wise. The traditional phase-wise approach first covers scanning in full details, then parsing, then semantic analysis etc. The problem with such an approach is that a student does not get a feel of the complete compiler or the overall process of compilation until very late in the course. While there are many small compilers (eg. in the "dragon back" by Aho, Sethi, and Ullman) can be used to expose the students to overall compilation, they are not scalable. So students end up seeing a lot of theory piece-wise but do not develop a feel of the overall compilation process. (In principle, the phases have very well defined simple interfaces, but to get its feel through an implementation which allows experimentation is a different matter altogether.) By time the students are ready to put the tools based on this theory to non-trivial use, it is already too late in the semester. Further, instruction selection phase remains more of a theory and very little practice gets across to the students. The proposed approach eliminates these drawbacks and bring the tools very early in the course. This is possible because the tools like `lex`, `yacc`, and `iburg` can be used without knowing the theory tool. An added advantage of this approach is that it serves as a motivation for the students to learn the theory by raising their curiosity.

Note that the basic motivation behind this approach is to be able to strike a good balance between theory and practice. The hope is that by bringing the practice upfront as a driving force a better balance can be achieved compared to the current offerings which are perceived as being biased towards theory by students.