LANGUAGE PROCESSORS LABORATORY
Last updated on 3rd March 2003.
Objective: The students will undertake a sequence of
experiments aimed at the design and implementation of the various phases
of a compiler for the Pascal like programming language given below. This
subset includes a sufficiently rich collection of data types and control
structures.
Here is the grammar
of the source language.
Experiment No. 1
Date of evaluation: 30th January
2003
Title: Design and implementation
of a scanner
-
Study the grammar of the source language, and identify
the lexical components of the language specifications.
-
Study the input format for Flex and clearly understand
how tokens are described.
-
Identify the tokens to be returned by the scanner. Use
the following as guidelines.
-
First define a class similar to Yytoken. Let us call this
Yytoken. Let this class have the following fields
-
When an identifier is found, the
lexical analyzer uses the lexeme corresponding to the identifier as a key
to hash into a symbol table. It enters the value of the identifier (which
is the same as the key) in the hash table and returns the following
instance of Yytoken
-"xyz" /*
Will be used to hash into the symbol table */
- IDENTIFIER
/* a value in an enumerated type*
-
For an integer constant 123, the Yytoken instance will
be
The numeric value 123 will be entered
into the constant table.
-
For a keyword, say while, the Yytoken instance will be
-
For an operator such as >=, the Yytoken instance will
be
-
For the operator :=, the Yytoken instance will be
-
For the delimiter {, the Yytoken instance will be
Steps in red will eventually be done by the parser.
-
Develop Flex script for the tokens identified above. Be
careful with the definition of constants, the string literal and comments.
-
Design the symbol table and the constants table. Use hash
organization for both.
-
Design routines to produce the scanner output in the desired
format as described in `Guidelines on how to present the results'.
-
Generate the scanner.
Guidelines on how to present the
results
-
When the token is identified, print its details. Thus
your scanner should produce the list of tokens for the source program input
to it.
-
At the end of the program, neatly print the symbol and
constants tables.
-
Examples should include some programs containing lexical
errors. The error should be properly identified and indicated.
Answers to queries
(Romil)
1. In the grammer specified two rules are there:
type definition ::= identifier = type
relational operator:= = | <> | < | > | <=
| >=
On encountering '=' what do we return its tokentype
as? RELOP or something else? How do we decide?
Return the keyword as relop. In the parser, verify
that the relop has the specific value '=' when used in the context of type
definition.
2. factor ::= variable | constant | ( expression
) | function call
constant ::= integer constant
| real constant
Since string constant is not there in definition,
I assume that presence of string in the code will be an error. Is my assumption
correct, or the string constant was not mentioned by mistake.
I have added string constants to the grammar. Later
on I shall add a couple of predefined functions with which you can do meaningful
things.
(Nikhil Sethi)
1. To implement boolean we should have two keywords
true and false.
I have added these to the grammar.
2. Should we treat operators "and" and "or" as boolean
operator (in the grammar they are named as addition and multiplication
ops respectively).
They are additive and multiplicative
operators. This means that they are analogous to '+' and '*'. However they
are
boolean operators.
3. Enumerated types are not defined in the grammar.
You seem unhappy! Add them if you want. However
they are not a must.
4. There is no unary minus in the grammar. In the
lab we discussed this with you and decided it was not feasible to detect
negative numbers. So we return -6+7 as four tokens "-","6","+","7"
and not as "-6", "+", "7". To implement this there should be a unary
minus in the grammar.
This is an interesting point. You have seen a Fortran
DO whose detection requires a trailing context. A token is a DO if it is
follwed some time later by a comma. Negative numbers seem to reqire a leading
context. - 43 is a negative number if it is not preceded by an identifier
or a number. However LEX does not have provisions of defining a token with
a leading context.
So dont detect negative numbers. Make a number negative
with a unary minus which I have added to the grammar.
5. Are we supposed to implement strings/characters
in the grammar.
Strings, yes. Characters, no.
Experiment No. 2
Date of evaluation: 13th February
2003
Title: Design and implementation
of a skeletal parser
The objective of this assignment is familiarization
with parser generators and development of a LALR parser for the source
language. For this assignement you will use the parser generator
YACC (alternatively Bison) and the grammar specification given above.
-
Study the format of a typical YACC script.
-
Enter the grammar in the YACC format, integrate with the
scanner generated in the previous stage, and generate the parser.
-
Test the generated parser on sample programs.
-
Augment the grammar by adding production for the following
features. Add enough production to ensure that the feature can be meaningfully
used. These are being added just for this stage to test your grammar writing
ability. They need not be implemented in the final compiler.
-
Pascal like for
and case
statements.
-
Pascal like enumerated
types viz. color
= (red, green, blue, yellow)
-
Generate the parser once again for the augmented grammar.
Some changes are required in the structure of your compiler
when you interface your scanner with the parser. The symbol table routines
should now become a part of the parser. The parser makes a call to the
scanner whenever it needs to look at the next token. After returning from
the scanner, the parser must update the appropriate table. For example
an identifier should be entered in the symbol table, a constant in the
constants' table, etc.
Guidelines on how to present
the results
-
Write a pretty printer using the following suggestion.
Roughly, what you have to do is;
0. maintain two global variables insertspaces::boolean
and depth::integer.
1.
depth counts the number of spaces (in multiples of 4) to be inserted
at the beginning of every new line.
2.
insertspaces is set as soon as you print a newline (\n).
3. every time you detect a terminal, check
insertspaces.
If it is true, change depth if required, print
depth spaces, print the terminal, and reset
insertspaces. 'every time', above,
can possibly be optimised to 'at certain times'.
-
Check whether this works with very few or no conflicts.
And if it does, figure why.
-
In case of an error, abort after reporting the line number
where the error has occured. You need not recover from errors.
Experiment No. 3
Date of evaluation: 20th March
2003
Title: Declaration processing,
scope and type analysis.
The objective of this assignment is to perform semantic
analysis such as type and scope analysis and declaration processing, and
integrate such analyses with the parser.
Notes on how to conduct the experiments:
-
Study the way attributes are assigned to a nonterminal
in YACC
-
Go back to the original grammar,
i.e. the one obtained by following the link shown above. For declaration
processing, do the following:
-
Decide on the attribute information (i.e. type, dimension,
length etc) for an identifier. Accordingly design the symbol table entry
formats/conventions for storing information regarding variable identifiers,
type identifiers and function identifiers.
-
Write semantic rules for processing of all declarations
of data in a single block of the source program, and recording relevant
information in the symbol table.
-
Perform storage allocation for data symbols (this will
be an offset value in the activation record).
-
For type analysis, write semantic rules to do the following:
-
For a use of a symbol, check for the declaration that
the symbol is bound to using Pascal like scope rules. If the symbol is
not bound to any declaration, declare an error. The symbol table should
be rich enough to implement the scope rules.
-
The symbol table entry for the declaration will provide
all the attributes of the symbol.
-
Use this information to check that the use of a symbol
is type compatible with its context.
-
The notion of type compatibility to be used is structural
equivalence. The notion of structural equivalence is described below.
Following complete substitution of type names by their definitions, we
have
-
Any basic type is equivalent to itself.
-
Two array types array [l1..u1, l2..u2,...,ln..un]
of T1 and array [l1'..u1', l2'..u2',...,ln'..un']
of T2, if and only if, for
all i,
ui - li = ui' - li', and T1 and T2 are equivalent.
-
The type array [l1..u1, l2..u2,...,ln..un]
of T1 is equivalent to array [l1..u1]
of array [l2..u2] of ,..., array [ln..un] of T1. Assuming a
to be the name array name, in both cases array elements are accessed
as a[e1,e2,...,en].
-
Perform type conversion if necessary. Follow Pascal convention.
-
Whole array assignments are allowed. a1:=
a2 requires a1 and a2
to be type equivalent. The semantics is obvious.
-
For a type definition type t = type_expression
all type names in type_expression
should have been defined before the definition of t.
-
Detect and report semantic errors. You should detect as
many semantic errors as possible in a single compilation of the program.
Guidelines on how to present the
results
-
We shall give you a suit of programs to test your compiler
against.
-
For erroneous programs, your compiler should report semantic
errors as accurately as possible.
-
For correct programs, the compiler should print the symbol
table.
Experiment No. 4
Date of evaluation: 10th April
2003
Title: Code Generation.
For the original grammar, Generate
code for a restricted subset of C the original grammar. You could base
your code generation strategy on the naive code generation strategy given
in ASU, or the simple code generator outlined in the slides. There is no
dearth of material on x86 assembly language. The specific assembly language
that you have to generate code for is the GNU assembly language (GAS).
Here
is a short reference on GAS. You can get further information by doing 'info
as' on Linux systems. Here
is another reference on the SunOS assembly language.
Guidelines on how to present
the results:
-
You have been given a test suite
of programs. You will test your compiler by generating assembly code which
will be assembled, linked and executed to produce output.
-
The evaluation will be done
on the basis of another set of source programs, which
will be given to you on the day of evaluation.