Regular Expression Package for Java Here's a simple RE compiler/matcher for Java. See the documentation comments in starwave/util/regexp/Regexp.java for details. Here's an excerpt from the class documentation: /** * Here's an example of how to use this class: * import starwave.util.regexp.*; * * Regexp reg = Regexp.compile("^([a-z]*) ([0-9]*)"); * String buffer = readFileIntoString("somefile.text"); * Result result; * int pos = 0; * * while ((result = reg.searchForward(buffer, pos)) != null) { * System.out.println("Matched " + result.getMatch(1) * + " and " + result.getMatch(2)); * pos = result.matchEnd() + 1; * } */ Anyway, it's pretty simple, and fairly complete. It's modeled after the perl expressions, but I am sure it's not highly optimized like perl expressions are. It supports: char Match any character literally that doesn't have special significance. Below are characters with special significance. You can backslash special characters to have them treated as literals. . Match any character. [...] Match any in a class of characters, e.g., [a-z] matches all the characters between 'a' and 'z' (inclusive). [^...] Match any NOT in a class of characters, e.g., [^a-z] matches all characters NOT between 'a' and 'z' (inclusive). \s Matches a white space character (Space, Linefeed, Return, Tab). \S Matches a non-white space character. \w Matches word characters, e.g., same as [a-zA-Z0-9_] \W Matches non-word characters (opposite of \w) \d Matches digits 0-9. \D Matches everything except 0-9. \n, \r, \f, \t Matches NL, CR, FF, TAB respectively. $ Matches at the end of a line. ^ Matches at the beginning of a line. \b Matches if at a word boundery (consumes no characters). \B Matches if NOT at a word boundery. ( Marks beginning of a group of matched characters. ) Marks ending of a group of matched characters. | NEW! Separates two alternative matches, e.g., abc|def matches abc or def. Parentheses are used to specify precedence, and "|" is by default the lowest precedence. \0, ..., \9 Matches a previous (parenthisized) group of matched characters. + Match one or more occurrences of previous RE * Match zero or more occurrences of previous RE ? Match zero or one occurrences of previous RE NOTE: you have to put double back-slashes in your RE's when you supply them in Java Strings, which is a real pain in the ass. Wouldn't it be nice if Java had a mechanism for specifying regular expressions? Something like with perl: /^this is a* regular expression$/ I have been using this package for 5 months now, and I haven't had any problems with it. The main problem, though, is that under some circumstances it is really slow. For instance, if you are searching for something using the ".*" notation, it can be very slow because it spends an awful lot of time back-tracking on ultimately failing searches. It's made even slower by the way I implement it, which involves method calls (which are horribly slow in java) as opposed to a really tight byte-code interpreter for REs, which is how it's often done. I might rewrite that aspect of it as well. What I need to do is learn about massive optimization techniques. For instance, is it possible for me to search first for the literal part of the regular expression, and then work my way backward for the part to the left of the literal, and then forward for the part AFTER the literal? What do I know? ---------------------------------------------------------------------------- Grab the tar file (not compressed, because our daemon does the wrong thing with .Z files). Let me know what you think. The tar file turns into starwave/util/regexp/*.java You can say or "javac starwave/util/regexp/*.java" to compile it, and put the top-level starwave directory in your classpath to use it.