jb -- Parser and Lexer Generation for Java
Overview and User Manual
Last Updated: 9 October 1997
Latest Version: jb-5.4
[Description
| Installation
| Source Tree
| Parser Overview
]
[Lexer Overview
| Examples
| JDK 1.0.2 Compatibility
]
For a description of the functionality of jb, as well as
information about obtaining jb,
see jb.html.
The jb installation process essentially consists of untarring
the jb tar file (after any needed decompression) into a directory
of your choice.
The jb code itself, as opposed to Bison and flex, is entirely
in Java, so there should be no reason to recompile to produce
new
.class
files to replace the ones that come with
standard installation.
If, for whatever reason, you decide to rebuild the
.class
files, you may do so as follows.
- Go into the regexp directory.
- Re-compile the java files by executing the following command:
make clean all
- Go into the jbf directory.
- Re-compile the java files by executing the following command:
make clean all
Note that compiling subst.java may produce warnings about
deprecated features. You can ignore them.
Once the files are installed, and possibly re-compiled,
you may go into the examples directory
and carry out the processes as described
in the
Parser Construction
and in the
Lexer Construction
sections.
This source tree contains a number of files and subdirectories.
- COPYRIGHT -- This files describes the policies governing the
use and distribution of this software.
- jb.html -- This file gives description and retrieval
information about this system. A text version is in jb.txt.
- README -- This file (html format) gives an overview of the source
tree and describes the build and installation process.
- INVENTORY -- Lists the files comprising this source tree.
- jbf -- The java classes needed to support the parsers and lexers
generated by jb.
- examples -- a set of example grammars (and lexers)
for several examples.
- calc -- a simple calculator with two lexers:
one using flex and one derived from YYlex.java.
- idl -- a CORBA IDL parser with a flex lexer that uses
the non-terminal package to build a parse tree.
- java -- a parser and lexer for java.
- regexp -- The starwave regular expression system; compliments
of Jonathan Payne (jpayne@starwave.com).
- examples/jexec, jbf/jexec -- a wrapper script for java and javac
so that the CLASSPATH environment variable can be set before execution.
The following discussion assumes substantial familiarity with
parsing using bison or yacc and with lexing using lex or flex.
Parser Structure
A jb parser consists of the following parts.
- YYparse.java
- The file YYparse.java is generated from
jbf/yyparse.template.
It is the primary runtime parse engine, and
it contains one class: YYparse.
The YYparse class is the primary parse class, and
it must be instantiated to create a parser instance.
Note that the class name (``YYparse'') is the default.
The substitution flags note below indicates how to provide an alternate
class name.
The YYparse class methods are as follows.
- YYparse --
the constructor takes an instance of a lexical scanner
(of class YYlex) as an argument.
- setdebug(boolean) -- turns verbose debug output on or off.
- setdebug(level) -- turns verbose debug to a specified level.
- yyparse -- executes the parse engine.
- yyerror -- invoked when a parse error occurs.
- yyerror_verbose -- can be invoked from within yyerror
to generate a more verbose syntax error message
(see the calculator example).
- YYtokentypes.java
- The file YYtokentypes.java is generated from
jbf/yytokentypes.template.
It contains one abstract class: YYtokentypes and several
constructed definitions.
- The token-type integer constants shared by the parser
and the lexical scanner.
- A table of names corresponding to the integer token types;
this table is normally accessed via the procedure
YYtokentypes.tokenname().
- The non-terminal type integer constants.
- A table of names corresponding to the integer non-terminal types;
this table is normally accessed via the procedure
YYtokentypes.nontermname().
- YYtoken.java
- The file jbf/YYtoken.java
(was Token.java)
defines an object to contain the lexeme value extracted by YYlex.
The YYtoken class contains an integer value
typically taken from the token values in class YYtokentype.
Additionally, it contains a StringBuffer of the text of the token
and a YYlocation instance. This latter contains
the line number and character number of the start of that
token string.
- YYlex.java
- The file YYlex.java is the lexical scanner used by the parser,
and defines the class YYlex.
The parser is passed an instance of this class at construction time.
This lexer can be constructed in two ways as described below in section
Lexer Overview.
As a rule, your lexer will import
the YYtokentypes class to get access
to the tokentype values. But it could also be passed
through jb to insert the Tokentype constants directly
(see the idl example below).
The YYlex class methods are as follows.
- YYlex -- the constructor takes an InputStream as an argument.
- setdebug(boolean) -- turns verbose debug output on or off.
- setdebug(level) -- turns verbose debug to a specified level.
- yylex -- repeatedly called to return an instance of YYtoken.
- yylexerror -- invoked when a parse error occurs.
- yyreport -- dump the contents of a YYtoken instance.
- jbf/{ParseException.java, YYnode_Stack.java, etc.}
- These files contain various support classes used by the parser.
Because they are not templates, these files, as well as the Token
(sub)classes can be shared by all parsers and lexers via
the package named jbf.
We will use the compilation of the calculator example (see
examples/calc.y) to show how a parser is ultimately compiled to a
.class file.
The sequence for building the calculator can be seen by looking
in the file examples/Makefile.
It essentially consists of the following sequence of actions.
- Run bison on the parser specification
``calc.y'' using the following command, for example,
to produce file calc.tab.c.
bison calc.y
- Run the jb java program.
jexec java jbf.jb -yyerror yyverror calc.tab.c \
../jbf/yyparse.template YYparse.java \
../jbf/yytokentypes.template YYtokentypes.java \
calc.lex.java YYlex.java
Note that the lexer file (calc.lex.java) will need to have
been constructed using the one of the lexer generation methods
described in the Lexer Overview.
Also, the substitution flag
causes the calls to yyerror() to be replaced
with a call to yyverror(), which is defined in the suffix
section of calc.y.
- Compile the parser java files with the following command.
javac -g Main.java YYparse.java YYlex.java YYtokentypes.java
- execute the parser with the following command.
java Main < calc.cmds
or
java Main -debug < calc.cmds
or, for really voluminous output,
java Main -debug -l 100 < calc.cmds
The jb Command:
The jb command is a java program for extracting information
from a bison produced parser file and inserting it into
other files at appropriate places and in appropriate form.
Any argument beginning with "--" is assumed to be an option to jb.
The current list of options is as follows.
- --flags -- dump a list of the flag substitutions
used in the processing of the files.
Additionally, jb will accept flag arguments of the following form.
-flagname flagvalue
As it scans its input files, it looks for instances of
@flagname@. If it encounters an instance of that flag, then
it will substitute the corresponding flagvalue string.
If @flagname@ is encountered in the template, and no flag argument
is specified, then the flag defaults to an internal value, which
is usually the same as the flagname
(e.g., @YYparse@ defaults to ``YYparse.'').
The list of flags can be seen by using the --flags argument.
A partial list would include the following flags, where the default
is in parenthesis if it is different from the flag name.
- YYparse -- this is the class name for the parser class.
By specifying this flag, one can change the parser class
name, and so one can have multiple parsers in the same
java program.
- yyparse -- this is the name for parser start procedure.
- yyerror -- this is the name for the error catching procedure.
- YYlex -- this is the name for the lexer class.
- yylex -- this is the procedure in YYlex to call to obtain
the next token.
- YYnode -- this is the type of the contents
of the parse operands stack.
- tokentype -- this is the field in YYnode that is used to access
the node type value (token or nonterminal).
- YYtoken -- this is the type of the result returned by YYlex.
- ntprefix ("nt_") -- this is the string that is prepended to
nonterminal names in the YYtokentypes class; it is used
to prevent name and keyword conflicts.
- ntsuffix ("") -- this is postpended to nonterminal names.
- package ("") -- this is the name of the package
in which to place the parser class. If not specified,
then no package declaration is inserted. Note that
this option will insert a complete package declaration
into the text. That is, the option -package x.y
will insert the string
``package x.y ;'' into the generated parser.
- constructor ("") -- this is a command to insert into the
YYparse constructor procedure to invoke user construction actions.
After options and flags have been extracted, jb interprets its first
argument as the input file. Presumably this file was
produced as the output from bison. The rest of the arguments to
jb are paired together. The first file of each pair is the
template file and the second is the name of the file in which to
place the result. Each template is read a line at a time and each line is
written to the output file. Simultaneously, it is scanned for
selected markers (and flags), and when one is encountered, the corresponding
information from the input file (suitably modified for java). is
inserted into the output file.
The current list of markers is as follows.
- @PREFIX@ -- this is the text from the beginning of the
parser specification and enclosed in %{ ... %}.
Warning: one must take care when specifying prefix code
(wrapped with %{ and %}). This code must not
contain any lines that have a sharp (#) in the first column.
- @SUFFIX@ -- this is the text from the end of the
parser specification after the last %%.
- @ACTIONS@ -- this is actions for the parse rules encoded
as a switch statement.
- @TOKENTYPES@ and @TOKENNAMES@ --
this is the list of token types and names, respectively,
taken from %token specifications.
- @NONTERMTYPES@ and @NONTERMNAMES@ --
this is the list of non-terminal types and names, respectively.
- @TABLES@ -- this is the parse tables and associated constants
used by the parse engine.
- @CONSTANTS@ -- this is the parse constants
used by the parse engine.
- @PROCS@ -- this is some special parse procs
used by the parse engine.
Parser Details
- The parser semantic actions are written in Java, not in C or C++.
- As you should know, the element of the parse rule are references using
special strings of the form $1, $2, etc. The resulting parse
value is returned from the semantic by assigning to $$.
These are actually translated into references into the parse value
stack yyvs; reference $i is translated to ``dollar(i)'', and $$
is translated to the ``yyval'' internal variable.
- The lexer procedure YYlex.yylex() is assumed to return a java
object of type Integer (i.e., a wrapped int)
representing the token type of a lexeme.
This is stored into the variable yylval.
Additionally, it is assumed that the
location information
for the just matched lexeme can be obtained from the variable
yytext, which is a StringBuffer.
Additionally, it is assumed that the
location information for the just-matched lexeme can be obtained
from the variable
yyloc,
of type YYlocation.
The location specifies the position of the first character
of the lexeme using three values: charno0 is the absolute character
location, lineno is the line containing the lexeme, and charno
is the character offset within the line.
- WARNING: yylval, yytext, and yyloc
are volatile, which means their contents may be modified
by every call to yylex().
So if the parse semantic action need any of the values from those
variables, the action must explicitly copy them out of these variables.
- Bison provides a way to specify the types of tokens and nonterminals
using the construct %union{...}.
Jb exploits this construct to provide Java-based typeing for
references to $$ and $1, etc.
The normal assumption is that the dollar references are of type
Object because the parse value stack is a stack of Objects.
But other things can be put onto the stack by assigning to $$
within the semantic action. The problem is that then
references to, for example, $1 must do an explicit cast to, for example
((YYterminal)$1), to get the correct kind of object off the stack.
One can get jb to do some automatic casting by
specifying the list of expected parse stack types within the %union.
For example (taken from idl.y), one might specify the following.
%union {YYnonterminal ; YYterminal ; YYtoken ; default Object}
This indicates that the expected types are YYnonterminal, YYterminal,
and YYtoken. Additionally, it says that the default type is Object.
This last clause is redundant because that type is the normal default.
The normal bison type specification mechanisms can be used to indicate
various token and nonterminal types; for example, again from idl.y.
%token <Integer>;INTEGER_LITERAL LONG_LITERAL
%type <YYterminal> param_attribute sharp_declaratives
If you examine the final generated code,
you will see that appropriate casts have
been inserted.
Beware that the mechanism is a little fragile. The bison
output for $1, for example, typically looks something like
yyvsp[0].YYnonterminal. That is the type is inserted
as a C union field reference. Jb converts this to
((YYnonterminal)dollar(0)). Thus, there is potential confusion
between YYnonterminal as type and YYnonterminal as field selector.
Jb makes the distinction by looking at the list from the %union specification.
One final warning. Bison, of course, does not know
about subtype relationships, so it may generate warnings
about type mismatches. If these come from rules
with no attached semantic actions, then they can be suppressed
by putting in an explicit action of the form {$$ = $1;}
since Bison assumes that semantic actions will transform
all type conflicts.
- In order to stop parsing, a procedure names ``yyreturn'' is provided.
If called with the argument YYACCEPT or YYABORT, then the parser returns
with that value.
The following discussion assumes substantial familiarity with
parsing using bison or yacc and with lexing using lex or flex.
Two distinct methods for lexer construction are provided.
First, there is the Gnu flex system, and second there is a generic,
hand-coded lexer.
Flex and jf
There is a program, jf, that corresponds to jb, but for
processing the output from flex.
As with jb, the ``--flags'' argument
can be used to see the list of substitutable flags.
A partial list would include the following flags, where the default
is in parenthesis if it is different from the flag name.
- YYlex -- this is the class name for the lexer class.
By specifying this flag, one can change the lexer class
name, and so one can have multiple lexers in the same
java program.
- yylex -- this is the procedure in YYlex to call to obtain
the next token.
- yylexerror -- this is the name for the error catching procedure.
- package ("") -- this is the name of the package
in which to place the lexer class. If not specified,
then no package declaration is inserted. Note that
this option will insert a complete package declaration
into the text. That is, the option -package x.y
will insert the string
``package x.y ;'' into the generated parser.
- constructor ("") -- this is a command to insert into the
YYlex constructor procedure to invoke user construction actions.
Additionally, jf supports three flags that indicate optimization levels
for lexers. These flags obviously correspond to the same flags (with
one less dash) for the flex command.
- --Cem -- This produces small and slow lexers.
- --Cfe -- This produces medium size and medium speed lexers.
- --Cfa -- This produces large and fast lexers. Note that
experience to date indicates that these lexers are too large
to be processed by Java and will crash the compiler and/or the virtual
machine.
Note that no other combinations of flex flags is supported.
Each of these flags requires a corresponding lexer template.
There are three: jbf/yylex.Cem.template, jbf/yylex.Cfe.template,
and jbf/yylex.Cfa.template.
If you get the wrong combination of flex flags, jf flags, and
template, then jf will complain.
Using the lexer specification
examples/calc.lex, flex and jf would be used as follows.
- Execute the following command
to produce calc.yy.c.
flex -ocalc.yy.c calc.lex
- Execute the following command
to produce calc.lex.java
jexec java jbf.jf --Cem calc.yy.c $JBFPATH/jbf/yylex.Cem.template calc.lex.java
If the output file, calc.lex.java does contain parser flags or markers
(such at @TOKENTYPES@), then it need to be further passed through jb.
Flex-Generated Lexer Details
In order to use flex to generate lexers, you will have to obey
some conventions.
- The lexical action bodies must be written in Java, and not C or C++.
- The lexical actions must not use "return" to return a token
value. This is because the code will then cause
"statement not reached" errors. Instead, as the last
thing in the semantic action, you should pass the token-type
value to a procedure called "yyreturn."
- As a rule, you should return the value using the token constants
(e.g., BOOLEAN_LITERAL) and not using the corresponding integer
using, for example, new Integer(5).
The reason is that the constants are shared and will generate no garbage.
- Do not attempt to have flex suppress line numbers (i.e., "#line"
directives) because they are used by jf to figure out the location
of things in the flex output.
- The flex parser has been hacked to work with java.
As a result, you should not expect any of the runtime flex
capabilities (yymore, etc.) to work with this parser.
However, an attempt has been made to support the yyless procedure.
- The lexeme string is kept in the local variable yytext
of type YYlexbuffer.
The lexical action code associated with the lex rule can modify this text,
if it desires, by calling the contents()
method.
An example might be to strip quotes or to
convert escape sequences for strings.
As an aside, the class YYtoken has some routines for these exact purposes.
-
The parser shares the text part of the YYtoken value, yylval
with the lexer via yyl.yylval.text().
This allows parser action routines to directly
access the string of the token via the variable yytext.
As with the jb,
and after options and flags have been extracted, jf interprets its first
argument as the input file. Presumably this file was
produced as the output from bison. The rest of the arguments to
jf are paired together. The first file of each pair is the
template file and the second is the name of the file in which to
place the result. Each template is read a line at a time and each line is
written to the output file. Simultaneously, it is scanned for
selected markers (and flags), and when one is encountered, the corresponding
information from the input file (suitably modified for java) is
inserted into the output file.
The current list of markers is as follows.
- @FLEXFLAGS --
- @PREFIX@ -- this is the text from the beginning of the
lexer specification (the so-called definitions section)
that is enclosed in %{ ... %}.
Warning: one must take care when specifying prefix code
(wrapped with %{ and %}). This code must not
contain any lines that have a sharp (#) in the first column.
- @SUFFIX@ -- this is the text from the end of the
lexer specification after the last %%.
- @ACTIONS@ -- this is actions for the lexer rules encoded
as a switch statement.
- @TABLES@ -- this is the lexeme tables and associated constants
used by the lexical engine.
- @CONSTANTS@ -- this is the lexical constants
used by the engine.
- @PROCS@ -- this is some special lex procs
used by the engine.
- @STATES@ -- this contains a set of constants defining the
lexical engine states (indicated by <...> in the rules).
- @YYLEX@ -- this is the text from the beginning of the
lexer rules section (after the first occurrence of %%)
that is enclosed in %{ ... %}.
The Generic Lexer
The second method for producing a lexer is to take the file
jbf/yylex.generic, which is written in java,
and modify it to recognize the tokens
as defined for the language you are parsing. Basically,
yylex.generic implements a classical, hand-coded lexer
recognizing comments, strings, numbers, identifiers, keywords,
and miscellaneous delimiters.
The reason one might want to use this rather than the flex
lexer is that the generic is smaller and faster than any of the flex lexers.
But on the other hand, the generic lexer is harder to maintain.
In order to use the generic lexer, you should copy it to a local file
and modify as needed. If you look at the end of it, you will see
a routine called "definekeywords", and which is set up to be
processed in the examples Makefile to generate either a lexer for
the calculator or for java. I do not recommend that you follow
this practice; it is only to simplify the examples construction.
A variety of examples have been provided to show how the system is used.
[Calculator
| CORBA IDL
| Java
]
The simplest example is the usual calculator example, where
the calculator files are as follows.
- Main.java --
The file Main.java instantiates and invokes the parser.
The main program is class Main, static method main. It
instantiates the parser and lexer and calls the parser yyparse
method to parse the input stream. This file is actually
shared by all the examples.
- YYparse.java -- the instantiation of yyparse.template.
- YYtokentypes.java -- the instantiation of yytokentypes.template.
- calc.y -- standard bison parser definition.
- calc.lex -- lexical specification for input to flex.
- calc.cmds -- a set of example commands to test out the calculator.
If you examine the Makefile, you will see that there are two
entries:
calc.lex.java
and
calcx.lex.java
If you invoke make with E=calc or E=calcx,
then you can produce a lexer using
either flex or jbf/yylex.generic, respectively.
A parser and lexer for the CORBA IDL (Interface Definition Language)
is provided as as a second example.
It attempts to produce a parse tree using the non-terminal mechanism.
Note that this parse tree construction has not been significantly tested yet.
The IDL parser files are as follows.
- Main.java -- this is the generic main program.
- YYparse.java -- the instantiation of yyparse.template.
- YYtokentypes.java -- the instantiation of yytokentypes.template.
- idl.y -- standard bison parser definition for the idl grammar.
- idl.lex -- lexical specification for input to flex.
- idl.cmds -- an example to test out the calculator.
One interesting thing about this example is the method for
accessing token type values in the lexer.
Normally, one might do
doing something like the following.
return YYtokentypes.INTEGER_LITERAL
Instead, the token definitions are directly inserted into YYlex
by inserting @TOKENTYPES@ in idl.lex and processing the lexical output
file (idl.lex.java) through jb.
A parser and lexer for the Java programming language
is provided as as a third example.
The grammar and lexicon were taken from the Java specifications
at JavaSoft. They were modified to pass thru bison and flex
and probably contain many errors.
The set of files (java.y, java.lex, etc.) parallel the other examples.
As of JB Version 5.0, the default is to conform to the JDK 1.1 / JDK 1.1.1
API. But markers have been left in several files to allow
for recompilation using JDK 1.0.2.
The files are as follows.
- jbf/subst.java
- jbf/jb.java
- jbf/jf.java
- regexp/Regexp.java
At several places within these files, you will see something like
the following.
/*JDK1.1.1*/
target.flush(); target.close(); src.close();
/**/
/*JDK1.0.2
ftarget.flush(); ftarget.close(); fsrc.close();
*/
For each occurrence of the above pattern, you should uncomment
the 1.0.2 code and commment out the 1.1.1 code to produce something
like the following.
/*JDK1.1.1
target.flush(); target.close(); src.close();
*/
/*JDK1.0.2*/
ftarget.flush(); ftarget.close(); fsrc.close();
/**/