Sclp: A Language Processor for a Small C-like Language
(version 1.1)
Uday Khedker
Department of Computer Science & Engg., IIT Bombay


About Sclp

The language processor sclp has been implemented for UG courses cs302+cs316: Implementation of Programming Languages (theory and lab) at IIT Bombay. It is designed and implemented by Uday Khedker with the help of many students. Its immediate predecessor cfglp, which processed cfg output of gcc was implemented by Tanu Kanvar in 2012. It was updated by N Venkatesh in 2019 to change the front end from flexc++ and bisonc++ to flex and bison as also the change the input language away from cfg dump of gcc. The current version is a significant enhancement done by Uday Khedker with the help of Nisha Biju and Saari Rajan. The main change is to introduce three address code between the low-level RTL IR and the abstract syntax tree, enhancements of reading and printing facilties, and support from string constants.

If you are new to the area of compilers, you may want to see this for an overview before proceeding further.

The main motivation behind this implementation has been to create a small well-crafted code base for a UG compiler construction class such that it can be enhanced systematically by the students. The focus in on incremental construction with some increments coming from language features and some coming from the phases in compilation. Thus the complexity of understanding the complete compilation is controlled by restricting the input language also apart from omitting phases or modules of the language processor. We believe that this provides a much better feel of compiler construction than the traditional approach of enhancing the compiler phase-wise for the full language. A secondary goal of sclp is to support interpretation so that students can understand the relationship between compilation and interpretation first-hand.

We welcome feedback and suggestions on improving sclp, as also bug reports.

All copyrights are reserved by Uday Khedker. This implementation has been made available purely for academic purposes without any warranty of any kind.



Functionality of Sclp

This language processor takes as input programs written in a subset of C (described below). Based on command line parameters, the read program is either interpreted or compiled to generate equivalent spim assembly language program. You may find this course on MIPS assembly programming to be very useful.

Compilation of a program through sclp involves translating a representation of the program into another through a series of steps. This sequence of translation is illustrated below:

SCLP IRs

All these representations (other than the concrete syntax tree, which is constructed implicitly during parsing) are available as text files produced by appropriate command line switches. The complete usage of the sclp command with all options is as shown below.

   Usage: sclp [options] [file]
            Options:
            -help     Show this help
            -sa-scan  Stop after scanning
            -sa-parse Stop after parsing
            -sa-ast   Stop after constructing Abstract Syntac Tree (AST)
            -sa-tac   Stop after constructing Three Address Code (TAC)
            -sa-rtl   Stop after constructing Register Transfer Language (RTL) code
            -tokens   Show the tokens in file.toks (or out.toks)
            -ast      Show abstract syntax trees in file.ast (or out.ast)
            -tac      Show the Three Address Code in file.tac (or out.tac)
            -rtl      Show the Register Transfer Language code in file.rtl (or out.rtl)
            -symtab   Show the symbol table after RTL construction (when offsets are
                      allocated) in file.sym, (or out.sym)
            -d        Demo version. Use stdout for the outputs instead of files


The compilation can also be stopped after a given phase.

For simplicity, we have divided sclp into two parts: Interpreter and Compiler. Currently only the compilation mode is enabled (and hence the -eval option has been withdrawn). Currently, the compiler does not do register allocation (or any other optimization). The current


Building Sclp



sclp
          usage



Features of the language supported by Sclp

The language supported by sclp is based on C with several simplifications and some enrichments. We describe it in terms of data types, values, operations on them, and data and control constructs. This description is based on the current design. The language and sclp is a work in progress and will be updated later.

Data type, values, and operations on values

sclp supports integers, floating point numbers, booleans values, and strings:

Language Constructs

At the moment, the language supports only scalar values; there are no aggregates. Since there are no type constructors, there are no user defined types.

The computations in a program are organized using the following constructs:

Several example of valid programs have been provided below. The finer details of the language can be discovered by creating examples and running them through the reference implementation.

Various Phases of Sclp Compiler


Sclp compiles a program by constructing a series of intermediate representations (IRs) such that each IR is computed from the earlier IR.
Sclp embodies object oriented design and implementation. The implementation language is C++. Lex and Yacc have been used for parsing and scanning. The overall compilation flow is the classical sequence of various phases as illustrated below.


Schematic of sclp
            compiler


The tasks performed by various modules in the above block diagram are briefly described below.
The main procedure of the sclp compiler is reproduced below verbatim. Observe the linear sequence of phases and the fact that each phase is executed conditionally. If a given phase is not executed, it blocks all subsequent phases.


int main(int argc, char * argv[])
{
    bool parse_result = false;

    command_options.process_user_command_options(argc, argv);

    if (command_options.construct_parse())
        parse_result = yyparse();    // AST is constructed along with parsing.
    else
        while (yylex())
            ;                // yylex is invoked repeatedly until it return 0.

    CHECK_INPUT((!parse_result), "Cannot parse the input program", yylineno);
   
    if (command_options.construct_ast())
    {
        if ((error_status() == false) && (command_options.is_show_ast_selected()))
            program_object.print_ast();
    }

    if (command_options.construct_tac() && (error_status() == false))
        program_object.gen_tac();
       
    if ((command_options.is_show_tac_selected()) && (error_status() == false))
        program_object.print_tac();

    if (command_options.construct_rtl() && (error_status() == false))
             program_object.gen_rtl();
       
    if ((command_options.is_show_symtab_selected()) && (command_options.construct_rtl())
                            && (error_status() == false))
        program_object.print_sym();

    if ((command_options.is_show_rtl_selected()) && (error_status() == false))
             program_object.print_rtl();

    if (command_options.construct_asm() && (error_status() == false))
        program_object.print_assembly();
        /* construct_asm implies show_asm */

    program_object.delete_all();

    return 0;
}



Sclp Levels

We define the following levels of the input language accepted by the sclp compiler. Each level includes all the features of the previous level as illustrated below.

Language Levels

The language features included in different levels are as follows. Although some keywords do not appear in some levels, they continue to be reserved and cannot be used as variable or function names.

There is a plan to extend the language further, by supporting data structures such as arrays, structures, and classes. Besides, dynamic memory allocation and pointers to handle it may also be included in future.


The Proposed Series of Assignment based on Incremental Construction


The full sclp construction is recommended to be done in a series of six assignments whose deliverables are based on monotonic increments as illustrated below. Different phases have been named with the IR that they produce. The accepted input language is shown in terms of nestings of levels in terms of containment of features. The black boxes show the language features and the compiler phases supported by the deliverable of a particular assignment. They are named A1 to A6 and the names are shown in the top right corner in light blue circles.

Assignment
          series

Note that these increments are monotonic and the implementation of Assignment i should continue to support all features of Assignment i-1.

Design of Sclp Level 1


The main classes that constitute sclp are:

In order to understand the working of a compiler, one needs to understand the grammar and construction of AST. There are other non-trivial questions that need to be answered:

  1. Why multiple TAC statements may be generated for a single AST and why multiple RTL statements may be generated for a single TAC statement.
  2. Why it is useful to also record the place (temporary variable for TAC and register for RTL) that contains the final result of the computation described by a list of TAC or RTL statements.
  3. How the register information has been encoded in the data structures.
  4. How the statement categories (eg. register-to-memory, memory-to-register etc.) have been used for register allocation.
  5. How offsets are assigned to variables.
  6. The activation record structure.
  7. How the machine instructions have been encoded in the data structures.
Points 1, 2, 5, and 6 are generic concepts that can be understood from the first principles. Points 3, 4, and 7 are specific sclp design. Points 1 and 2 implement the spirit of the syntax directed definition provided in Figure 6.19 (page 379) of the text book (Compilers: Principles, Techniques, and Tools by Aho, Lam, Sethi, and Ullman).

While understanding points 3, 4, 6, and 7, please keep the spim instruction set architecture in mind.
You may find this course on MIPS assembly programming to be very useful.


Ine the rest of this section, the grammar and the data structures for Level 1 have been described.
A Subset of Level 1 Grammar

A program consists of a declaration_list which is a list of (optional) global variable_declaration_list, and a procedure_declaration, followed by a procedure_definition (in level 1 we have a single procedure).


program : declaration_list
procedure_definition
;

declaration_list :
procedure_declaration
|
variable_declaration_list
procedure_declaration

| procedure_declaration
variable_declaration_list
;


The associated action saves the list of global variables and checks that the names of the global variables and the procedure name are distinct. If no global variables are declared in the user program then the global variable list is NULL. Once the body of the procedure is processed, the procedure is added with its procedure name to a map of procedures (in level 1 we have only one procedure).

Note that there is a difference between a procedure declaration (
procedure_declaration in the productions above) and a procedure definition  (procedure_definition the productions below).

Our declaration_list indicates that the optional global variable list and the procedure prototype can exist in any order. In the procedure_definition the procedure does not have a return type associated with it (because it does not return a value) and does not have any parameters. The procedure body is enclosed in a pair of braces { and } and contains an optional_variable_declaration_list (local variables) followed by a statement_list (in case of level 1).

   
procedure_declaration : VOID NAME '(' ')' ';'
;

procedure_definition : NAME '(' ')'

'{'
optional_variable_declaration_list
statement_list
'}'
;


The action associated with this rule checks if the list of local variables have been declared in the user program, and saves the list of local variables and the abstract syntax trees.

The variable_declaration_list is defined recursively as follows:

   
optional_variable_declaration_list : /* empty */
| variable_declaration_list
;

variable_declaration_list : variable_declaration
| variable_declaration_list
variable_declaration
;

variable_declaration : declaration ';'
;

declaration : INTEGER variable_list
;

variable_list : NAME
| variable_list ',' NAME
;


Note that the same non-terminal is used for lists of global as well as local variables and the position where it occurs (inside of a procedure or outside it) tells us whether it is global or local. A declaration of a variable consists of the type name (only INTEGER in level 1) and the NAME of the variable followed by a semicolon (;). The action checks variables for possible redeclaration within the same scope. Variables thus recognized, are pushed in a symbol table.

In level 1, we have only assignment statements. An assignment statement can involve only a variable or an integer number on the right hand side and only a variable on the left hand side.


statement_list : /* empty */
| statement_list
assignment_statement
;

assignment_statement : variable ASSIGN variable ;
| variable ASSIGN constant ;
;

variable : NAME
;

constant : INTEGER_NUMBER
;


The associated actions check that a variable has been declared and the types of the left hand side and the right hand side are same.

Level 1 Data Structures

The data structure of the interpreter for Level 1 is as shown below. For simplicity, we first show the data structure restricted to a single parameterless procedure (as mandated by Level 1). Its extension for a compiler for higher levels is provided further below.




The program object is an instance of class Program. It contains:


The Ast classes described in gray colour are to be included in the subsequent levels.


Level 1 Interpretation

During interpretation, the local and global variables are copied into a separate data structure called the eval_env which is an instance of class Local_Environment which stores the variable names and their information for each procedure call (including recursive function calls). Note that this data structure is not necessary in level 1 because the values could be stored in the symbol table but this design has been chosen keeping in mind the enhancements in the subsequent levels.

Interpretation is performed by traversing each AST. Each AST is evaluated and the values of the variables in eval_env are updated. This requires recording the targets of goto statements also. Further, subsequent levels would need recording the values of floating point variables also. In order to implement a generic evaluate procedure across all ASTs, the values are stored using objects of a class called Eval_Result. We derive two classes Eval_Result_Value (used to store values of variables) and Eval_Result_Target (used to store the targets of gotos in level 3). Further, class Eval_Result_Value is used to derive two more classes called Eval_Result_Value_Int and Eval_Result_Value_Float (this distinction will be required in level 3).

Level 1 Compilation

Compilation requires successful construction of ASTs for the entire program submitted to sclp and is performed by traversing the ASTs. It involves the following two steps.
The revised data structure (generalized to multiple procedures and procedures with parameters) now looks as shown below. The light blue boxes indicate additions required for compilation.





An TAC code statement can be one of the following. TAC_Stmt is the base class and other classes are derived from it. In level 1, we have only move statements. The gray ovals indicate the statements that  are introduced in subsequent levels.

TAC Statements


An RTL code statement can be one of the following. RTL_Stmt is the base class and other classes are derived from it. In level 1, we have only move statements. The gray ovals indicate the statements that  are introduced in subsequent levels.





The picture below shows the classes for each IR and what they are translated into for the next IR.







The classes for the operands for each IR and what they are translated into is shown by the picture below.