jb -- Parser and Lexer Generation for Java

Overview and User Manual

Last Updated: 9 October 1997
Latest Version: jb-5.4

Description

For a description of the functionality of jb, as well as information about obtaining jb, see jb.html.

The jb installation process essentially consists of untarring the jb tar file (after any needed decompression) into a directory of your choice. The jb code itself, as opposed to Bison and flex, is entirely in Java, so there should be no reason to recompile to produce new .class files to replace the ones that come with standard installation.

If, for whatever reason, you decide to rebuild the .class files, you may do so as follows.

Go into the regexp directory.
Re-compile the java files by executing the following command: make clean all
Go into the jbf directory.
Re-compile the java files by executing the following command: make clean all Note that compiling subst.java may produce warnings about deprecated features. You can ignore them.

Once the files are installed, and possibly re-compiled, you may go into the examples directory and carry out the processes as described in the Parser Construction and in the Lexer Construction sections.

Source Tree Structure

This source tree contains a number of files and subdirectories.

COPYRIGHT -- This files describes the policies governing the use and distribution of this software.
jb.html -- This file gives description and retrieval information about this system. A text version is in jb.txt.
README -- This file (html format) gives an overview of the source tree and describes the build and installation process.
INVENTORY -- Lists the files comprising this source tree.
jbf -- The java classes needed to support the parsers and lexers generated by jb.
examples -- a set of example grammars (and lexers) for several examples.
- calc -- a simple calculator with two lexers: one using flex and one derived from YYlex.java.
- idl -- a CORBA IDL parser with a flex lexer that uses the non-terminal package to build a parse tree.
- java -- a parser and lexer for java.
regexp -- The starwave regular expression system; compliments of Jonathan Payne (jpayne@starwave.com).
examples/jexec, jbf/jexec -- a wrapper script for java and javac so that the CLASSPATH environment variable can be set before execution.

Parser Overview

The following discussion assumes substantial familiarity with parsing using bison or yacc and with lexing using lex or flex.

Parser Structure

A jb parser consists of the following parts.

YYparse.java

The file YYparse.java is generated from jbf/yyparse.template. It is the primary runtime parse engine, and it contains one class: YYparse. The YYparse class is the primary parse class, and it must be instantiated to create a parser instance. Note that the class name (``YYparse'') is the default. The substitution flags note below indicates how to provide an alternate class name.

The YYparse class methods are as follows.

YYparse -- the constructor takes an instance of a lexical scanner (of class YYlex) as an argument.
setdebug(boolean) -- turns verbose debug output on or off.
setdebug(level) -- turns verbose debug to a specified level.
yyparse -- executes the parse engine.
yyerror -- invoked when a parse error occurs.
yyerror_verbose -- can be invoked from within yyerror to generate a more verbose syntax error message (see the calculator example).

YYtokentypes.java

The file YYtokentypes.java is generated from jbf/yytokentypes.template. It contains one abstract class: YYtokentypes and several constructed definitions.

The token-type integer constants shared by the parser and the lexical scanner.
A table of names corresponding to the integer token types; this table is normally accessed via the procedure YYtokentypes.tokenname().
The non-terminal type integer constants.
A table of names corresponding to the integer non-terminal types; this table is normally accessed via the procedure YYtokentypes.nontermname().

YYtoken.java

The file jbf/YYtoken.java (was Token.java) defines an object to contain the lexeme value extracted by YYlex. The YYtoken class contains an integer value typically taken from the token values in class YYtokentype. Additionally, it contains a StringBuffer of the text of the token and a YYlocation instance. This latter contains the line number and character number of the start of that token string.

YYlex.java

The file YYlex.java is the lexical scanner used by the parser, and defines the class YYlex. The parser is passed an instance of this class at construction time. This lexer can be constructed in two ways as described below in section Lexer Overview. As a rule, your lexer will import the YYtokentypes class to get access to the tokentype values. But it could also be passed through jb to insert the Tokentype constants directly (see the idl example below).

The YYlex class methods are as follows.

YYlex -- the constructor takes an InputStream as an argument.
setdebug(boolean) -- turns verbose debug output on or off.
setdebug(level) -- turns verbose debug to a specified level.
yylex -- repeatedly called to return an instance of YYtoken.
yylexerror -- invoked when a parse error occurs.
yyreport -- dump the contents of a YYtoken instance.

jbf/{ParseException.java, YYnode_Stack.java, etc.}

These files contain various support classes used by the parser. Because they are not templates, these files, as well as the Token (sub)classes can be shared by all parsers and lexers via the package named jbf.

Parser Construction Process

We will use the compilation of the calculator example (see examples/calc.y) to show how a parser is ultimately compiled to a .class file. The sequence for building the calculator can be seen by looking in the file examples/Makefile. It essentially consists of the following sequence of actions.

Run bison on the parser specification ``calc.y'' using the following command, for example, to produce file calc.tab.c.
```
bison calc.y
```
Run the jb java program.
```
jexec java jbf.jb -yyerror yyverror calc.tab.c \
        ../jbf/yyparse.template YYparse.java \
        ../jbf/yytokentypes.template YYtokentypes.java \
        calc.lex.java YYlex.java
```
Note that the lexer file (calc.lex.java) will need to have been constructed using the one of the lexer generation methods described in the Lexer Overview. Also, the substitution flag causes the calls to yyerror() to be replaced with a call to yyverror(), which is defined in the suffix section of calc.y.

Compile the parser java files with the following command.

javac -g Main.java YYparse.java YYlex.java YYtokentypes.java

execute the parser with the following command.

java Main < calc.cmds

java Main -debug < calc.cmds

or, for really voluminous output,

java Main -debug -l 100 < calc.cmds

The jb Command:

The jb command is a java program for extracting information from a bison produced parser file and inserting it into other files at appropriate places and in appropriate form. Any argument beginning with "--" is assumed to be an option to jb.

The current list of options is as follows.

--flags -- dump a list of the flag substitutions used in the processing of the files.

Additionally, jb will accept flag arguments of the following form.

 -flagname flagvalue

As it scans its input files, it looks for instances of @flagname@. If it encounters an instance of that flag, then it will substitute the corresponding flagvalue string. If @flagname@ is encountered in the template, and no flag argument is specified, then the flag defaults to an internal value, which is usually the same as the flagname (e.g., @YYparse@ defaults to ``YYparse.'').

The list of flags can be seen by using the --flags argument. A partial list would include the following flags, where the default is in parenthesis if it is different from the flag name.

YYparse -- this is the class name for the parser class. By specifying this flag, one can change the parser class name, and so one can have multiple parsers in the same java program.
yyparse -- this is the name for parser start procedure.
yyerror -- this is the name for the error catching procedure.
YYlex -- this is the name for the lexer class.
yylex -- this is the procedure in YYlex to call to obtain the next token.
YYnode -- this is the type of the contents of the parse operands stack.
tokentype -- this is the field in YYnode that is used to access the node type value (token or nonterminal).
YYtoken -- this is the type of the result returned by YYlex.
ntprefix ("nt_") -- this is the string that is prepended to nonterminal names in the YYtokentypes class; it is used to prevent name and keyword conflicts.
ntsuffix ("") -- this is postpended to nonterminal names.
package ("") -- this is the name of the package in which to place the parser class. If not specified, then no package declaration is inserted. Note that this option will insert a complete package declaration into the text. That is, the option -package x.y will insert the string ``package x.y ;'' into the generated parser.
constructor ("") -- this is a command to insert into the YYparse constructor procedure to invoke user construction actions.

After options and flags have been extracted, jb interprets its first argument as the input file. Presumably this file was produced as the output from bison. The rest of the arguments to jb are paired together. The first file of each pair is the template file and the second is the name of the file in which to place the result. Each template is read a line at a time and each line is written to the output file. Simultaneously, it is scanned for selected markers (and flags), and when one is encountered, the corresponding information from the input file (suitably modified for java). is inserted into the output file.

The current list of markers is as follows.

@PREFIX@ -- this is the text from the beginning of the parser specification and enclosed in %{ ... %}. Warning: one must take care when specifying prefix code (wrapped with %{ and %}). This code must not contain any lines that have a sharp (#) in the first column.
@SUFFIX@ -- this is the text from the end of the parser specification after the last %%.
@ACTIONS@ -- this is actions for the parse rules encoded as a switch statement.
@TOKENTYPES@ and @TOKENNAMES@ -- this is the list of token types and names, respectively, taken from %token specifications.
@NONTERMTYPES@ and @NONTERMNAMES@ -- this is the list of non-terminal types and names, respectively.
@TABLES@ -- this is the parse tables and associated constants used by the parse engine.
@CONSTANTS@ -- this is the parse constants used by the parse engine.
@PROCS@ -- this is some special parse procs used by the parse engine.

Parser Details

The parser semantic actions are written in Java, not in C or C++.
As you should know, the element of the parse rule are references using special strings of the form $1, $2, etc. The resulting parse value is returned from the semantic by assigning to $$. These are actually translated into references into the parse value stack yyvs; reference $i is translated to ``dollar(i)'', and $$ is translated to the ``yyval'' internal variable.
The lexer procedure YYlex.yylex() is assumed to return a java object of type Integer (i.e., a wrapped int) representing the token type of a lexeme. This is stored into the variable yylval.
Additionally, it is assumed that the location information for the just matched lexeme can be obtained from the variable yytext, which is a StringBuffer.
Additionally, it is assumed that the location information for the just-matched lexeme can be obtained from the variable yyloc, of type YYlocation. The location specifies the position of the first character of the lexeme using three values: charno0 is the absolute character location, lineno is the line containing the lexeme, and charno is the character offset within the line.
WARNING: yylval, yytext, and yyloc are volatile, which means their contents may be modified by every call to yylex(). So if the parse semantic action need any of the values from those variables, the action must explicitly copy them out of these variables.
Bison provides a way to specify the types of tokens and nonterminals using the construct %union{...}. Jb exploits this construct to provide Java-based typeing for references to $$ and $1, etc.
The normal assumption is that the dollar references are of type Object because the parse value stack is a stack of Objects. But other things can be put onto the stack by assigning to $$ within the semantic action. The problem is that then references to, for example, $1 must do an explicit cast to, for example ((YYterminal)$1), to get the correct kind of object off the stack.
One can get jb to do some automatic casting by specifying the list of expected parse stack types within the %union. For example (taken from idl.y), one might specify the following.
```
%union {YYnonterminal ; YYterminal ; YYtoken ; default Object}
```
This indicates that the expected types are YYnonterminal, YYterminal, and YYtoken. Additionally, it says that the default type is Object. This last clause is redundant because that type is the normal default.
The normal bison type specification mechanisms can be used to indicate various token and nonterminal types; for example, again from idl.y.
```
%token <Integer>;INTEGER_LITERAL LONG_LITERAL
%type <YYterminal> param_attribute sharp_declaratives
```
If you examine the final generated code, you will see that appropriate casts have been inserted.
Beware that the mechanism is a little fragile. The bison output for $1, for example, typically looks something like yyvsp[0].YYnonterminal. That is the type is inserted as a C union field reference. Jb converts this to ((YYnonterminal)dollar(0)). Thus, there is potential confusion between YYnonterminal as type and YYnonterminal as field selector. Jb makes the distinction by looking at the list from the %union specification.
One final warning. Bison, of course, does not know about subtype relationships, so it may generate warnings about type mismatches. If these come from rules with no attached semantic actions, then they can be suppressed by putting in an explicit action of the form {$$ = $1;} since Bison assumes that semantic actions will transform all type conflicts.
In order to stop parsing, a procedure names ``yyreturn'' is provided. If called with the argument YYACCEPT or YYABORT, then the parser returns with that value.

Lexer Overview

The following discussion assumes substantial familiarity with parsing using bison or yacc and with lexing using lex or flex.

Two distinct methods for lexer construction are provided. First, there is the Gnu flex system, and second there is a generic, hand-coded lexer.

Flex and jf

There is a program, jf, that corresponds to jb, but for processing the output from flex. As with jb, the ``--flags'' argument can be used to see the list of substitutable flags. A partial list would include the following flags, where the default is in parenthesis if it is different from the flag name.

YYlex -- this is the class name for the lexer class. By specifying this flag, one can change the lexer class name, and so one can have multiple lexers in the same java program.
yylex -- this is the procedure in YYlex to call to obtain the next token.
yylexerror -- this is the name for the error catching procedure.
package ("") -- this is the name of the package in which to place the lexer class. If not specified, then no package declaration is inserted. Note that this option will insert a complete package declaration into the text. That is, the option -package x.y will insert the string ``package x.y ;'' into the generated parser.
constructor ("") -- this is a command to insert into the YYlex constructor procedure to invoke user construction actions.

Additionally, jf supports three flags that indicate optimization levels for lexers. These flags obviously correspond to the same flags (with one less dash) for the flex command.

--Cem -- This produces small and slow lexers.
--Cfe -- This produces medium size and medium speed lexers.
--Cfa -- This produces large and fast lexers. Note that experience to date indicates that these lexers are too large to be processed by Java and will crash the compiler and/or the virtual machine.

Note that no other combinations of flex flags is supported. Each of these flags requires a corresponding lexer template. There are three: jbf/yylex.Cem.template, jbf/yylex.Cfe.template, and jbf/yylex.Cfa.template. If you get the wrong combination of flex flags, jf flags, and template, then jf will complain.

Lexer Construction Process

Using the lexer specification examples/calc.lex, flex and jf would be used as follows.

Execute the following command to produce calc.yy.c.
```
flex -ocalc.yy.c calc.lex
```
Execute the following command to produce calc.lex.java
```
jexec java jbf.jf --Cem calc.yy.c $JBFPATH/jbf/yylex.Cem.template calc.lex.java
```
If the output file, calc.lex.java does contain parser flags or markers (such at @TOKENTYPES@), then it need to be further passed through jb.

Flex-Generated Lexer Details

In order to use flex to generate lexers, you will have to obey some conventions.

The lexical action bodies must be written in Java, and not C or C++.
The lexical actions must not use "return" to return a token value. This is because the code will then cause "statement not reached" errors. Instead, as the last thing in the semantic action, you should pass the token-type value to a procedure called "yyreturn."
As a rule, you should return the value using the token constants (e.g., BOOLEAN_LITERAL) and not using the corresponding integer using, for example, new Integer(5). The reason is that the constants are shared and will generate no garbage.
Do not attempt to have flex suppress line numbers (i.e., "#line" directives) because they are used by jf to figure out the location of things in the flex output.
The flex parser has been hacked to work with java. As a result, you should not expect any of the runtime flex capabilities (yymore, etc.) to work with this parser. However, an attempt has been made to support the yyless procedure.
The lexeme string is kept in the local variable yytext of type YYlexbuffer. The lexical action code associated with the lex rule can modify this text, if it desires, by calling the contents() method. An example might be to strip quotes or to convert escape sequences for strings. As an aside, the class YYtoken has some routines for these exact purposes.
The parser shares the text part of the YYtoken value, yylval with the lexer via yyl.yylval.text(). This allows parser action routines to directly access the string of the token via the variable yytext.

As with the jb, and after options and flags have been extracted, jf interprets its first argument as the input file. Presumably this file was produced as the output from bison. The rest of the arguments to jf are paired together. The first file of each pair is the template file and the second is the name of the file in which to place the result. Each template is read a line at a time and each line is written to the output file. Simultaneously, it is scanned for selected markers (and flags), and when one is encountered, the corresponding information from the input file (suitably modified for java) is inserted into the output file.

The current list of markers is as follows.

@FLEXFLAGS --
@PREFIX@ -- this is the text from the beginning of the lexer specification (the so-called definitions section) that is enclosed in %{ ... %}. Warning: one must take care when specifying prefix code (wrapped with %{ and %}). This code must not contain any lines that have a sharp (#) in the first column.
@SUFFIX@ -- this is the text from the end of the lexer specification after the last %%.
@ACTIONS@ -- this is actions for the lexer rules encoded as a switch statement.
@TABLES@ -- this is the lexeme tables and associated constants used by the lexical engine.
@CONSTANTS@ -- this is the lexical constants used by the engine.
@PROCS@ -- this is some special lex procs used by the engine.
@STATES@ -- this contains a set of constants defining the lexical engine states (indicated by <...> in the rules).
@YYLEX@ -- this is the text from the beginning of the lexer rules section (after the first occurrence of %%) that is enclosed in %{ ... %}.

The Generic Lexer

The second method for producing a lexer is to take the file jbf/yylex.generic, which is written in java, and modify it to recognize the tokens as defined for the language you are parsing. Basically, yylex.generic implements a classical, hand-coded lexer recognizing comments, strings, numbers, identifiers, keywords, and miscellaneous delimiters. The reason one might want to use this rather than the flex lexer is that the generic is smaller and faster than any of the flex lexers. But on the other hand, the generic lexer is harder to maintain.

In order to use the generic lexer, you should copy it to a local file and modify as needed. If you look at the end of it, you will see a routine called "definekeywords", and which is set up to be processed in the examples Makefile to generate either a lexer for the calculator or for java. I do not recommend that you follow this practice; it is only to simplify the examples construction.

Example Grammars

A variety of examples have been provided to show how the system is used.

[Calculator | CORBA IDL | Java ]

Calculator Example

The simplest example is the usual calculator example, where the calculator files are as follows.

Main.java -- The file Main.java instantiates and invokes the parser. The main program is class Main, static method main. It instantiates the parser and lexer and calls the parser yyparse method to parse the input stream. This file is actually shared by all the examples.
YYparse.java -- the instantiation of yyparse.template.
YYtokentypes.java -- the instantiation of yytokentypes.template.
calc.y -- standard bison parser definition.
calc.lex -- lexical specification for input to flex.
calc.cmds -- a set of example commands to test out the calculator.

If you examine the Makefile, you will see that there are two entries: calc.lex.java and calcx.lex.java If you invoke make with E=calc or E=calcx, then you can produce a lexer using either flex or jbf/yylex.generic, respectively.

CORBA IDL Example

A parser and lexer for the CORBA IDL (Interface Definition Language) is provided as as a second example. It attempts to produce a parse tree using the non-terminal mechanism. Note that this parse tree construction has not been significantly tested yet.

The IDL parser files are as follows.

Main.java -- this is the generic main program.
YYparse.java -- the instantiation of yyparse.template.
YYtokentypes.java -- the instantiation of yytokentypes.template.
idl.y -- standard bison parser definition for the idl grammar.
idl.lex -- lexical specification for input to flex.
idl.cmds -- an example to test out the calculator.

One interesting thing about this example is the method for accessing token type values in the lexer. Normally, one might do doing something like the following.

return YYtokentypes.INTEGER_LITERAL

Instead, the token definitions are directly inserted into YYlex by inserting @TOKENTYPES@ in idl.lex and processing the lexical output file (idl.lex.java) through jb.

Java Example

A parser and lexer for the Java programming language is provided as as a third example. The grammar and lexicon were taken from the Java specifications at JavaSoft. They were modified to pass thru bison and flex and probably contain many errors. The set of files (java.y, java.lex, etc.) parallel the other examples.

JDK 1.0.2 Compatibility

As of JB Version 5.0, the default is to conform to the JDK 1.1 / JDK 1.1.1 API. But markers have been left in several files to allow for recompilation using JDK 1.0.2. The files are as follows.

jbf/subst.java
jbf/jb.java
jbf/jf.java
regexp/Regexp.java

At several places within these files, you will see something like the following.

/*JDK1.1.1*/
    target.flush(); target.close(); src.close();
/**/

/*JDK1.0.2
    ftarget.flush(); ftarget.close(); fsrc.close();
*/

For each occurrence of the above pattern, you should uncomment the 1.0.2 code and commment out the 1.1.1 code to produce something like the following.

/*JDK1.1.1
    target.flush(); target.close(); src.close();
*/

/*JDK1.0.2*/
    ftarget.flush(); ftarget.close(); fsrc.close();
/**/