Regular Expression Package for Java

Here's a simple RE compiler/matcher for Java. See the documentation comments
in starwave/util/regexp/Regexp.java for details. Here's an excerpt from the
class documentation:

/**
 * Here's an example of how to use this class:
 *      import starwave.util.regexp.*;
 *
 *      Regexp reg = Regexp.compile("^([a-z]*) ([0-9]*)");
 *      String buffer = readFileIntoString("somefile.text");
 *      Result result;
 *      int pos = 0;
 *
 *      while ((result = reg.searchForward(buffer, pos)) != null) {
 *          System.out.println("Matched " + result.getMatch(1)
 *              + " and " + result.getMatch(2));
 *          pos = result.matchEnd() + 1;
 *      }
 */

Anyway, it's pretty simple, and fairly complete. It's modeled after the perl
expressions, but I am sure it's not highly optimized like perl expressions
are. It supports:

     char Match any character literally that doesn't have special
          significance. Below are characters with special significance. You
          can backslash special characters to have them treated as literals.
     .    Match any character.
     [...]
          Match any in a class of characters, e.g., [a-z] matches all the
          characters between 'a' and 'z' (inclusive).
     [^...]
          Match any NOT in a class of characters, e.g., [^a-z] matches all
          characters NOT between 'a' and 'z' (inclusive).
     \s   Matches a white space character (Space, Linefeed, Return, Tab).
     \S   Matches a non-white space character.
     \w   Matches word characters, e.g., same as [a-zA-Z0-9_]
     \W   Matches non-word characters (opposite of \w)
     \d   Matches digits 0-9.
     \D   Matches everything except 0-9.
     \n, \r, \f, \t
          Matches NL, CR, FF, TAB respectively.
     $    Matches at the end of a line.
     ^    Matches at the beginning of a line.
     \b   Matches if at a word boundery (consumes no characters).
     \B   Matches if NOT at a word boundery.
     (    Marks beginning of a group of matched characters.
     )    Marks ending of a group of matched characters.
     |    NEW!
          Separates two alternative matches, e.g., abc|def matches abc or
          def. Parentheses are used to specify precedence, and "|" is by
          default the lowest precedence.
     \0, ..., \9
          Matches a previous (parenthisized) group of matched characters.
     +    Match one or more occurrences of previous RE
     *    Match zero or more occurrences of previous RE
     ?    Match zero or one occurrences of previous RE

NOTE: you have to put double back-slashes in your RE's when you supply them
in Java Strings, which is a real pain in the ass. Wouldn't it be nice if
Java had a mechanism for specifying regular expressions? Something like with
perl:

        /^this is a* regular expression$/

I have been using this package for 5 months now, and I haven't had any
problems with it. The main problem, though, is that under some circumstances
it is really slow. For instance, if you are searching for something using
the ".*" notation, it can be very slow because it spends an awful lot of
time back-tracking on ultimately failing searches. It's made even slower by
the way I implement it, which involves method calls (which are horribly slow
in java) as opposed to a really tight byte-code interpreter for REs, which
is how it's often done. I might rewrite that aspect of it as well.

What I need to do is learn about massive optimization techniques. For
instance, is it possible for me to search first for the literal part of the
regular expression, and then work my way backward for the part to the left
of the literal, and then forward for the part AFTER the literal? What do I
know?
----------------------------------------------------------------------------
Grab the tar file (not compressed, because our daemon does the wrong thing
with .Z files). Let me know what you think.

The tar file turns into

        starwave/util/regexp/*.java

You can say or "javac starwave/util/regexp/*.java" to compile it, and put
the top-level starwave directory in your classpath to use it.