LANGUAGE PROCESSORS LABORATORY

Last updated on 3rd March 2003.

Objective: The students will undertake a sequence of experiments aimed at the design and implementation of the various phases of a compiler for the Pascal like programming language given below. This subset includes a sufficiently rich collection of data types and control structures.

Here is the grammar of the source language.

Experiment No. 1

Date of evaluation: 30th January 2003

Title: Design and implementation of a scanner

Study the grammar of the source language, and identify the lexical components of the language specifications.
Study the input format for Flex and clearly understand how tokens are described.
Identify the tokens to be returned by the scanner. Use the following as guidelines.

First define a class similar to Yytoken. Let us call this Yytoken. Let this class have the following fields

-value_or_key

-tokentype

When an identifier is found, the lexical analyzer uses the lexeme corresponding to the identifier as a key to hash into a symbol table. It enters the value of the identifier (which is the same as the key) in the hash table and returns the following instance of Yytoken

-"xyz" /* Will be used to hash into the symbol table */

- IDENTIFIER /* a value in an enumerated type*

For an integer constant 123, the Yytoken instance will be

- "123" -INTCONST

The numeric value 123 will be entered into the constant table.

For a keyword, say while, the Yytoken instance will be

-not relevant

-WHILE

For an operator such as >=, the Yytoken instance will be

-">="

-RELOP

For the operator :=, the Yytoken instance will be

-not relevant

-ASSIGNOP

For the delimiter {, the Yytoken instance will be

-not relevant

-LBRACE

Steps in red will eventually be done by the parser.

Develop Flex script for the tokens identified above. Be careful with the definition of constants, the string literal and comments.
Design the symbol table and the constants table. Use hash organization for both.
Design routines to produce the scanner output in the desired format as described in `Guidelines on how to present the results'.
Generate the scanner.

Guidelines on how to present the results

When the token is identified, print its details. Thus your scanner should produce the list of tokens for the source program input to it.
At the end of the program, neatly print the symbol and constants tables.
Examples should include some programs containing lexical errors. The error should be properly identified and indicated.

Answers to queries

(Romil)

1. In the grammer specified two rules are there:

type definition ::= identifier = type
relational operator:= = | <> | < | > | <= | >=

On encountering '=' what do we return its tokentype as? RELOP or something else? How do we decide?

Return the keyword as relop. In the parser, verify that the relop has the specific value '=' when used in the context of type definition.

2. factor ::= variable | constant | ( expression ) | function call
constant ::= integer constant | real constant

Since string constant is not there in definition, I assume that presence of string in the code will be an error. Is my assumption correct, or the string constant was not mentioned by mistake.

I have added string constants to the grammar. Later on I shall add a couple of predefined functions with which you can do meaningful things.

(Nikhil Sethi)

1. To implement boolean we should have two keywords true and false.

I have added these to the grammar.

2. Should we treat operators "and" and "or" as boolean operator (in the grammar they are named as addition and multiplication ops respectively).

They are additive and multiplicative operators. This means that they are analogous to '+' and '*'. However they are boolean operators.

3. Enumerated types are not defined in the grammar.

You seem unhappy! Add them if you want. However they are not a must.

4. There is no unary minus in the grammar. In the lab we discussed this with you and decided it was not feasible to detect negative numbers. So we return -6+7 as four tokens "-","6","+","7" and not as "-6", "+", "7". To implement this there should be a unary minus in the grammar.

This is an interesting point. You have seen a Fortran DO whose detection requires a trailing context. A token is a DO if it is follwed some time later by a comma. Negative numbers seem to reqire a leading context. - 43 is a negative number if it is not preceded by an identifier or a number. However LEX does not have provisions of defining a token with a leading context.

So dont detect negative numbers. Make a number negative with a unary minus which I have added to the grammar.

5. Are we supposed to implement strings/characters in the grammar.

Strings, yes. Characters, no.

Experiment No. 2

Date of evaluation: 13th February 2003

Title: Design and implementation of a skeletal parser

The objective of this assignment is familiarization with parser generators and development of a LALR parser for the source language. For this assignement you will use the parser generator YACC (alternatively Bison) and the grammar specification given above.

Study the format of a typical YACC script.
Enter the grammar in the YACC format, integrate with the scanner generated in the previous stage, and generate the parser.
Test the generated parser on sample programs.
Augment the grammar by adding production for the following features. Add enough production to ensure that the feature can be meaningfully used. These are being added just for this stage to test your grammar writing ability. They need not be implemented in the final compiler.

Pascal like for and case statements.
Pascal like enumerated types viz. color = (red, green, blue, yellow)

Generate the parser once again for the augmented grammar.

Some changes are required in the structure of your compiler when you interface your scanner with the parser. The symbol table routines should now become a part of the parser. The parser makes a call to the scanner whenever it needs to look at the next token. After returning from the scanner, the parser must update the appropriate table. For example an identifier should be entered in the symbol table, a constant in the constants' table, etc.

Guidelines on how to present the results

Write a pretty printer using the following suggestion. Roughly, what you have to do is;

0. maintain two global variables insertspaces::boolean and depth::integer.

1. depth counts the number of spaces (in multiples of 4) to be inserted at the beginning of every new line.

2. insertspaces is set as soon as you print a newline (\n).

3. every time you detect a terminal, check insertspaces. If it is true, change depth if required, print depth spaces, print the terminal, and reset insertspaces. 'every time', above, can possibly be optimised to 'at certain times'.

Check whether this works with very few or no conflicts. And if it does, figure why.
In case of an error, abort after reporting the line number where the error has occured. You need not recover from errors.

Experiment No. 3

Date of evaluation: 20th March 2003

Title: Declaration processing, scope and type analysis.

The objective of this assignment is to perform semantic analysis such as type and scope analysis and declaration processing, and integrate such analyses with the parser.

Notes on how to conduct the experiments:

Study the way attributes are assigned to a nonterminal in YACC
Go back to the original grammar, i.e. the one obtained by following the link shown above. For declaration processing, do the following:

Decide on the attribute information (i.e. type, dimension, length etc) for an identifier. Accordingly design the symbol table entry formats/conventions for storing information regarding variable identifiers, type identifiers and function identifiers.
Write semantic rules for processing of all declarations of data in a single block of the source program, and recording relevant information in the symbol table.
Perform storage allocation for data symbols (this will be an offset value in the activation record).

For type analysis, write semantic rules to do the following:

For a use of a symbol, check for the declaration that the symbol is bound to using Pascal like scope rules. If the symbol is not bound to any declaration, declare an error. The symbol table should be rich enough to implement the scope rules.
The symbol table entry for the declaration will provide all the attributes of the symbol.
Use this information to check that the use of a symbol is type compatible with its context.
The notion of type compatibility to be used is structural equivalence. The notion of structural equivalence is described below. Following complete substitution of type names by their definitions, we have

Any basic type is equivalent to itself.
Two array types array [l1..u1, l2..u2,...,ln..un] of T1 and array [l1'..u1', l2'..u2',...,ln'..un'] of T2, if and only if, for all i, ui - li = ui' - li', and T1 and T2 are equivalent.
The type array [l1..u1, l2..u2,...,ln..un] of T1 is equivalent to array [l1..u1] of array [l2..u2] of ,..., array [ln..un] of T1. Assuming a to be the name array name, in both cases array elements are accessed as a[e1,e2,...,en].

Perform type conversion if necessary. Follow Pascal convention.

Whole array assignments are allowed. a1:= a2 requires a1 and a2 to be type equivalent. The semantics is obvious.
For a type definition type t = type_expression all type names in type_expression should have been defined before the definition of t.
Detect and report semantic errors. You should detect as many semantic errors as possible in a single compilation of the program.

Guidelines on how to present the results

We shall give you a suit of programs to test your compiler against.

For erroneous programs, your compiler should report semantic errors as accurately as possible.
For correct programs, the compiler should print the symbol table.

Experiment No. 4

Date of evaluation: 10th April 2003

Title: Code Generation.

For the original grammar, Generate code for a restricted subset of C the original grammar. You could base your code generation strategy on the naive code generation strategy given in ASU, or the simple code generator outlined in the slides. There is no dearth of material on x86 assembly language. The specific assembly language that you have to generate code for is the GNU assembly language (GAS). Here is a short reference on GAS. You can get further information by doing 'info as' on Linux systems. Here is another reference on the SunOS assembly language.

Guidelines on how to present the results:

You have been given a test suite of programs. You will test your compiler by generating assembly code which will be assembled, linked and executed to produce output.
The evaluation will be done on the basis of another set of source programs, which will be given to you on the day of evaluation.