Sclp: A Language Processor for a Small C-like Language
(version 2.0)
Uday Khedker
Department of Computer Science & Engg., IIT Bombay


About History Future Functionality Building SCLP Input Language Phases of SCLP SCLP Levels Assignments Design of SCLP


About Sclp

The language processor sclp has been implemented for UG courses cs302+cs316: Implementation of Programming Languages (theory and lab) at IIT Bombay. It is designed and implemented by Uday Khedker with the help of many students. You can read details about its history and possible future.

If you are new to the area of compilers, you may want to see this for an overview before proceeding further.

The main motivation behind this implementation has been to create a small well-crafted code base for a UG compiler construction class such that it can be enhanced systematically by the students. The focus in on incremental construction with some increments coming from language features and some coming from the phases in compilation. Thus the complexity of understanding the complete compilation is controlled by restricting the input language also apart from omitting phases or modules of the language processor. We believe that this provides a much better feel of compiler construction than the traditional approach of enhancing the compiler phase-wise for the full language. A secondary goal of sclp is to support interpretation so that students can understand the relationship between compilation and interpretation first-hand.

We welcome feedback and suggestions on improving sclp, as also bug reports.


Functionality of Sclp

This language processor takes as input programs written in a subset of C with some extensions (described below). Based on command line parameters, the read program is either interpreted or compiled to generate equivalent spim assembly language program. You may find this course on MIPS assembly programming to be very useful.

Compilation of a program through sclp involves translating a representation of the program into another through a series of steps. This sequence of translation is illustrated below:

SCLP IRs

All these representations (other than the concrete syntax tree, which is constructed implicitly during parsing) are available as text files produced by appropriate command line switches. The complete usage of the sclp command with all options is as shown below.

$ ./sclp --help
Usage: sclp [OPTION...] [FILE]
Sclp - A language processor for C-like language

      --sa-scan              Stop after scanning
      --sa-parse             Stop after parsing
      --sa-ast               Stop after constructing Abstract Syntax Tree
                             (AST)
      --sa-tac               Stop after constructing Three Address Code (TAC)
      --sa-rtl               Stop after constructing Register Transfer Language
                             (RTL) code
      --show-tokens          Show the tokens in FILE.toks (or out.toks)
      --show-ast             Show abstract syntax trees in FILE.ast (or
                             out.ast)
      --show-tac             Show the Three Address Code in FILE.tac (or
                             out.tac)
      --show-rtl             Show the Register Transfer Language code in
                             FILE.rtl (or out.rtl)
      --show-symtab          Show the symbol table after RTL construction (when
                             offsets are allocated) in FILE.sym, (or out.sym)
      --show-asm             Generate the assembly program in FILE.spim (or
                             out.spim). This is the default action and is
                             suppressed only if a valid `sa-...' option is
                             given to stop the compilation after some earlier
                             phase.
      --show-json-ast        Show the Abstract Syntax Tree in JSON format
      --show-json-tac        Show the Three Address Code in JSON format
      --show-json-rtl        Show the Register Transfer Language code in JSON
                             format
      --read-json-ast        Use the input file (in JSON format) to generate
                             the AST and skip all the phases from scanning to
                             parsing
      --read-json-tac        Use the input file (in JSON format) to generate
                             the TAC and skip all the phases from scanning to
                             TAC generation
      --read-json-rtl        Use the input file (in JSON format) to generate
                             the RTL code and skip all the phases from scanning
                             to RTL generation
  -d, --demo                 Demo version. Use stdout for the output instead of
                             files
      --gen-temp-symb-table  Populate Symbol Table For Temporaries
  -e, --single-stmt-bb       Flag to construct single statement basic blocks
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version


The usage with optional passes enabled produces a much longer output.

The compilation can also be stopped after a given phase. Besides, AST, TAC, and RTL IRs can be exported in JSON format or imported in JSON format. When the --read-jason-<IR>is given, the compilation skips all the earlier phases, takes the given IR as input, and executes the subsequent phases depending upon the other flags.

For simplicity, we have divided sclp into two parts: Interpreter and Compiler. Currently only the compilation mode is enabled (and hence the -eval option has been withdrawn). Currently, the compiler does not do register allocation (or any other optimization).


Building Sclp



sclp
          usage



Language supported by Sclp

The language supported by sclp is based on C with several simplifications and some enrichments. We describe it in terms of data types, values, operations on them, and data and control constructs. This description is based on the current design. The language and sclp is a work in progress and will be updated later.

Data type, values, and operations on values

sclp supports integers, floating point numbers, booleans values, and strings:

Language Constructs

At the moment, the language supports scalar values (int, float, bool, string), pointers and multi-dimensional arrays. Pointers can point only to scalars and arrays can contain only scalars. There are no user defined types. Pointers and arrays are written in the usual C syntax.

The computations in a program are organized using the following constructs:

Operators have the following precedences and associativities. The higher numbers indicate higher precedence. In the type signature below BB represents the bool type whereas α\alpha is instantiated to either int or float type (but not both). Note that string is not an admissible type for any operator.


Precedence Level
Operator Group
Operators
Type Signature
Associativity
1 (lowest)
Ternary Expression
?, : B×α×ααB \times \alpha \times \alpha \to \alpha
Right
2
Boolean
||
B×BBB \times B \to B
Left
3
&&
B×BBB \times B \to B
Left
4
!
BBB \to B
Right
5
Relational
!=, ==, <, <=, >, >= α×αB\alpha \times \alpha \to B
Non-associative
6
Arithmetic
+, -
α×αα\alpha \times \alpha \to \alpha
Left
7
*, /
α×αα\alpha \times \alpha \to \alpha
Left
8 (highest)
- (unary)
αα\alpha \to \alpha
Right
       

Several example of valid programs have been provided below. The finer details of the language can be discovered by creating examples and running them through the reference implementation.


Phases of Sclp Compiler


Sclp compiles a program by constructing a series of intermediate representations (IRs) such that each IR is computed from the earlier IR.
Sclp embodies object oriented design and implementation. The implementation language is C++. Parser and scanner are implemented using lex and yacc (lex/flex manual, yacc/bison manual, lex & yacc book). The overall compilation flow is the classical sequence of various phases as illustrated below.


Schematic of sclp
            compiler


The tasks performed by various modules in the above block diagram are briefly described below.
The main procedure of the sclp compiler is reproduced below verbatim. All passes are registered with the pass manager by storing a pointer to the entry function of the pass.


int main(int argc, char * argv[])
{   
  // Initializing the pass list
    pass_manager.init_pass_list();

  // Process the command line options
    command_options.process_user_command_options(argc, argv);

  // Executing the passes
      pass_manager.execute_passes();
 
    program_object.delete_all();
 
    return 0;
}



Sclp Levels

We define the following levels of the input language accepted by the sclp compiler. Each level includes all the features of the previous level as illustrated below.

Language Levels

The language features included in different levels are as follows. Although some keywords do not appear in some levels, they continue to be reserved and cannot be used as variable or function names.

There is a plan to extend the language further by supporting classes. Besides, dynamic memory allocation  may also be included in future.


The Proposed series of assignment based on incremental construction


The full sclp construction is recommended to be done in a series of six assignments whose deliverables are based on monotonic increments as illustrated below. Different phases have been named with the IR that they produce. The accepted input language is shown in terms of nestings of levels in terms of containment of features. The black boxes show the language features and the compiler phases supported by the deliverable of a particular assignment. They are named A1 to A6 and the names are shown in the top right corner in light blue circles.

Assignment
          series

  • Assignment A5 requires enhancing the implementation of A4 by additionally producing the RTL IR of the input program.
  • Assignment A6 requires enhancing the implementation of A5 by including the features of Level 5 and by additionally producing the assembly code for the input program.
Note that these increments are monotonic and the implementation of Assignment i should continue to support all features of Assignment i-1. These increments differ from the conventional increments in which the language remains constant and all phases are implemented for the full language. The blue arrows in the picture below show the sequence of conventional increments whereas the red arrows show the proposed increments.


Increments
        in assignments


Design of Sclp


The main classes that constitute sclp are:
  • program, procedure, abstract syntax trees (AST), three address code (TAC), and register transfer language (RTL) statements.
  • symbol table (persistent as well as volatile symbol table for scope analysis)
  • register and instruction descriptor

In order to understand the working of a compiler, one needs to understand the grammar and construction of AST. There are other non-trivial questions that need to be answered:

  1. Why multiple TAC statements may be generated for a single AST and why multiple RTL statements may be generated for a single TAC statement.
  2. Why it is useful to also record the place (temporary variable for TAC and register for RTL) that contains the final result of the computation described by a list of TAC or RTL statements.
  3. How the register information has been encoded in the data structures.
  4. How the statement categories (eg. register-to-memory, memory-to-register etc.) have been used for register allocation.
  5. How offsets are assigned to variables.
  6. The activation record structure.
  7. How the machine instructions have been encoded in the data structures.
Points 1, 2, 5, and 6 are generic concepts that can be understood from the first principles. Points 3, 4, and 7 are specific to sclp design. Points 1 and 2 implement the spirit of the syntax directed definition provided in Figure 6.19 (page 379) of the text book (Compilers: Principles, Techniques, and Tools by Aho, Lam, Sethi, and Ullman).

While understanding points 3, 4, 6, and 7, please keep the spim instruction set architecture in mind.
You may find this course on MIPS assembly programming to be very useful.


In the rest of this section, the grammar and the top level data structures have been described.
SCLP Grammar

Level 6 of sclp is defined using the following grammar consisting of 100 rules. Each rule has been described using yacc syntax (except for the inclusion of rule numbers). Lower levels restrict the grammar by omitting some rules or by using simpler rules. For example, Level 1 programs contain a single procedure and hence, func_def_list in rule 2 is replaced by func_def in Level 1.



program
: global_decl_statement_list func_def_list (rule 1)
| func_def_list (rule 2)
;

global_decl_statement_list
: global_decl_statement_list func_decl (rule 3)
| global_decl_statement_list var_decl_stmt (rule 4)
| var_decl_stmt (rule 5)
| func_decl (rule 6)
;

func_decl
: func_header '(' formal_param_list ')' ';' (rule 7)
| func_header '('')' ';' (rule 8)
;

func_def_list
: func_def_list func_def (rule 9)
| func_def (rule 10)
;

func_header
: named_type NAME (rule 11)
;

func_def
: func_header '('formal_param_list')' '{'optional_local_var_decl_stmt_list statement_list'}'(rule 12)
| func_header '(' ')' '{' optional_local_var_decl_stmt_list statement_list '}' (rule 13)
;

formal_param_list
: formal_param_list ',' formal_param (rule 14)
| formal_param (rule 15)
;

formal_param
: param_type NAME (rule 16)
;

param_type
: INTEGER (rule 17)
| FLOAT (rule 18)
| BOOL (rule 19)
| STRING (rule 20)
;

statement_list
: statement_list statement (rule 21)
| %empty (rule 22)
;

statement
: assignment_statement (rule 23)
| if_statement (rule 24)
| do_while_statement (rule 25)
| while_statement (rule 26)
| compound_statement (rule 27)
| print_statement (rule 28)
| read_statement (rule 29)
| call_statement (rule 30)
| return_statement (rule 31)
;

call_statement
: func_call ';' (rule 32)
;

func_call
: NAME '(' actual_arg_list ')' (rule 33)
;

actual_arg_list
: non_empty_arg_list (rule 34)
| %empty (rule 35)
;

non_empty_arg_list
: non_empty_arg_list ',' actual_arg (rule 36)
| actual_arg (rule 37)
;

actual_arg
: expression (rule 38)
;

return_statement
: RETURN expression ';' (rule 39)
;

optional_local_var_decl_stmt_list
: %empty (rule 40)
| var_decl_stmt_list (rule 41)
;

var_decl_stmt_list
: var_decl_stmt (rule 42)
| var_decl_stmt_list var_decl_stmt (rule 43)
;

var_decl_stmt
: named_type var_decl_item_list ';' (rule 44)
;

var_decl_item_list
: var_decl_item_list ',' var_decl_item (rule 45)
| var_decl_item (rule 46)
;

var_decl_item
: NAME (rule 47)
| NAME array_decl (rule 48)
| pointer_decl NAME (rule 49)
;
pointer_decl
: '*' (rule 50)
| '*' pointer_decl (rule 51)
;

array_decl
: '[' INTEGER_NUMBER ']' (rule 52)
| '[' INTEGER_NUMBER ']' array_decl (rule 53)
;

named_type
: INTEGER (rule 54)
| FLOAT (rule 55)
| VOID (rule 56)
| STRING (rule 57)
| BOOL (rule 58)
;

assignment_statement
: variable_as_operand ASSIGN expression ';' (rule 59)
| variable_as_operand ASSIGN func_call ';' (rule 60)
| variable_as_operand ASSIGN ADDRESSOF variable_name ';' (rule 61)
;

if_condition
: '(' expression ')' (rule 62)
;

if_statement
: IF if_condition statement ELSE statement (rule 63)
| IF if_condition statement (rule 64)
;

do_while_statement
: DO statement WHILE '(' expression ')' ';' (rule 65)
;

while_statement
: WHILE '(' expression ')' statement (rule 66)
;

compound_statement
: '{' statement_list '}' (rule 67)
;

print_statement
: WRITE expression ';' (rule 68)

;

read_statement
: READ variable_name ';' (rule 69)

;

expression
: expression '+' expression (rule 70)
| expression '-' expression (rule 71)
| expression '*' expression (rule 72)
| expression '/' expression (rule 73)
| '-' expression (rule 74)
| '(' expression ')' (rule 75)
| expression '?' expression ':' expression (rule 76)
| expression AND expression (rule 77)
| expression OR expression (rule 78)
| NOT expression (rule 79)
| rel_expression (rule 80)
| variable_as_operand (rule 81)
| constant_as_operand (rule 82)
;

rel_expression
: expression LT expression (rule 83)
| expression LE expression (rule 84)
| expression GT expression (rule 85)
| expression GE expression (rule 86)
| expression NE expression (rule 87)
| expression EQ expression (rule 88)
;

variable_as_operand
: variable_name (rule 89)
| array_access (rule 90)
| pointer_access (rule 91)
;

variable_name
: NAME (rule 92)
;

array_access
: variable_name array_dimensions (rule 93)
;

pointer_access
: '*' variable_name (rule 94)
| '*' pointer_access (rule 95)
;

array_dimensions
: '[' expression ']' (rule 96)
| array_dimensions '[' expression ']' (rule 97)
;

constant_as_operand
: INTEGER_NUMBER (rule 98)
| DOUBLE_NUMBER (rule 99)
| STRING_CONSTANT (rule 100)
;



SCLP Data Structures

The main data structures of sclp and the relationship between them is shown below. Note that the arrow in the picture below does not who inheritance but containment, e.g., program contains a global symbol table, multiple procedures, and each procedure contains local symbol table, a formal parameter list, an AST, a list of TAC, RTL, and ASM statements.






The main class hierarchies used for the intermediate representations in sclp are as shown below. The most interesting aspect of this translation is that
  • a single  AST may be translated into multiple TAC statements (e.g., and expression AST),
  • a single TAC statement may give be translated into multiple RTL statements (e.g., floating point comparison statements), and
  • a single RTL statement may be translated into multiple ASM statements (e.g., push, pop statements).

Note that in the pictures below, an arrow indicates inheritance. These pictures have been obtained from a documentation of the source code created using doxygen.

  • Abstract Syntax Tree. This class hierarchy has the following classes. There are two broad categories of classes Expression_Ast and Statement_Ast. Expression_Ast further derive classes for Unary_Expr_Ast, Binary_Expr_Ast, Ternary_Expr_Ast, and Base_Expr_Ast (with are expressions with arity 0).
AST class hierarchy

  • TAC. This IR has two abstract classes for statements (TAC_Stmt) and their operands (TAC_Opd) from which all other classes derive.
TAC Statement Class
        HierarchyTAC Opd Classes
  • RTL. Similar to TAC, here also we have classes for statements (RTL_Stmt) and their operands (RTL_Opd). Note that there are four different kinds of control flow statements.
RTL ClassesRTL Opd classes
  • ASM. Similar to TAC and RTL, here also we have classes for statements (ASM_Stmt) and their operands (ASM_Opd). Note that there are four different kinds of control flow statements.

ASM ClassesASM Opd Classes