Regular
Expressions
A
regular expression is a string where some characters have special meanings. A
regular expression is used to search and replace text. This set of notes gives
some examples of using the gnu.regexp Java classes that are part of the SDSU
Java class library. The gnu.regexp library can also be downloaded at
http://www.cacas.org/~wes/java/.
An on-line reference manual for regular expressions is at:
http://www.cs.utah.edu/csinfo/texinfo/regex/regex_toc.html.
Mastering
Regular Expressions
,
Friedl, O'Reilly & Associates is a text on how to use regular expressions.
This set of notes is not meant to be a tutorial on regular expressions.
Using
GNU.REGEXP package
In
this example, we show how to use a regular expression to replace "at" by "ow".
The substitute() method replaces the first instance of the pattern in the text.
The substituteAll() method replaces all instances of the pattern in the text.
import gnu.regexp.REException;
import gnu.regexp.RE;
public class RegularExpression
{
public static void main( String args[] ) throws REException
{
String text = "cat sat that hat";
String pattern = "at";
String replacement = "ow";
RE magic = new RE( pattern );
String result = magic.substitute( text, replacement );
System.out.println( result );
String all = magic.substituteAll( text, replacement );
System.out.println( all );
}
}
Output
cow sat that hat
cow sow thow how
Getting
all Matches
public class AccessingMatches {
public static void main( String args[] ) throws REException{
String text = "cat sat that hat";
String pattern = "at";
String replacement = "ow";
RE magic = new RE( pattern );
// Show how to access the all matches
REMatch[] allMatches = magic.getAllMatches( text);
REMatch lastMatch = allMatches[ allMatches.length - 1 ];
// Perform just the last match
int patternIndex = lastMatch.getStartIndex();
String last = text.substring( 0, patternIndex) +
magic.substitute(text, replacement, patternIndex );
System.out.println( last );
Enumeration matches = magic.getMatchEnumeration( text);
while ( matches.hasMoreElements() ){
REMatch aMatch = (REMatch) matches.nextElement();
System.out.println( "A match at location: " +
aMatch.getStartIndex());
}
}
Output
cat sat that how
A match at location: 1
A match at location: 5
A match at location: 10
A match at location: 14
Parameters
of RE methods
The
previous examples of RE methods (substitute, substituteAll,
getMatchEnumeration) use Strings as an argument for the text. The methods of RE
can take Strings, char[], Stringbuffers and Inputstreams for the text.
Regular
Expression Dialects
The
gnu.regexp package supports a number of different dialects of regular
expressions. The dialects differ in what special characters they use. The
following code segment shows how to create a RE object that will use different
dialect. The default dialect is Perl 5. See the class RESyntax for a list of
supported dialects and an explanation of the second argument.
new
RE( pattern, 0, RESyntax.RE_SYNTAX_EMACS );
Special
Characters
To
illustrate how some of the special characters operate, I added some examples.
The examples were all produced using the following method. Note that the double
quote character is added to the strings in the println method to show were the
strings start and end.
public static void replaceAll(String text,
String pattern,
String replacement) throws REException
{
RE magic = new RE( pattern );
String result = magic.substituteAll( text, replacement );
System.out.println( "Text\t\"" + text + "\"" );
System.out.println( "Pattern\t\"" + pattern + "\"");
System.out.println( "Replacement\t\"" + replacement + "\"");
System.out.println( "Result\t\"" + result + "\"");
}
Positional
Operators
^
|
matches
at the beginning of a line
|
$
|
matches
at the end of a line
|
\A
|
matches
the start of the entire string
|
\Z
|
matches
the end of the entire string
|
Examples
Text
|
"cat
cat cat"
|
Pattern
|
"cat$"
|
Replacement
|
"dog"
|
Result
|
"cat
cat dog"
|
Text
|
"cat
cat cat"
|
Pattern
|
"^cat"
|
Replacement
|
"dog"
|
Result
|
"dog
cat cat"
|
One-Character
Operators
.
|
matches
any single character
|
\d
|
matches
any decimal digit
|
\D
|
matches
any non-digit
|
\n
|
matches
a newline character
|
\r
|
matches
a return character
|
\s
|
matches
any whitespace character
|
\S
|
matches
any non-whitespace character
|
\t
|
matches
a horizontal tab character
|
\w
|
matches
any word (alphanumeric) character
|
\W
|
matches
any non-word (alphanumeric) character
|
\x
|
matches
the character x, if x is not one of the above listed escape sequences.
|
Examples
Text
|
"cat
cat cat"
|
Pattern
|
"\scat"
|
Replacement
|
"
dog"
|
Result
|
"cat
dog dog"
|
Text
|
"cat
bat sat"
|
Pattern
|
"\s.at"
|
Replacement
|
"
dog"
|
Result
|
"cat
dog dog"
|
Character
Class Operator
[abc]
|
matches
any character in the set a, b or c
|
[^abc]
|
matches
any character not in the set a, b or c
|
[a-z]
|
matches
any character in the range a to z, inclusive A leading or trailing dash will be
interpreted literally.
|
Examples
Text
|
"cat
bat mat"
|
Pattern
|
"[cm]at"
|
Replacement
|
"dog"
|
Result
|
"dog
bat dog"
|
Text
|
"cat
bat mat"
|
Pattern
|
"[^bc]at"
|
Replacement
|
"dog"
|
Result
|
"cat
bat dog"
|
Within
a character class expression, the following sequences have special meaning if
the syntax bit RE_CHAR_CLASSES is on:
[:alnum:]
|
Any
alphanumeric character
|
[:alpha:]
|
Any
alphabetical character
|
[:blank:]
|
A
space or horizontal tab
|
[:cntrl:]
|
A
control character
|
[:digit:]
|
A
decimal digit
|
[:graph:]
|
A
non-space, non-control character
|
[:lower:]
|
A
lowercase letter
|
[:print:]
|
Same
as graph, but also space and tab
|
[:punct:]
|
A
punctuation character
|
[:space:]
|
Any
whitespace character, including newline and return
|
[:upper:]
|
An
uppercase letter
|
[:xdigit:]
|
A
valid hexadecimal digit
|
Branching
(Alternation) Operator
a|b
|
matches
whatever the expression a would match, or whatever the expression b would match.
|
Example
Text
|
"cat
chet here het"
|
Pattern
|
"c(a|he)t"
|
Replacement
|
"dog"
|
Result
|
"dog
dog here het"
|
Subexpressions
and Backreferences
(abc)
|
matches
whatever the expression abc would match, and saves it as a subexpression. Also
used for grouping.
|
(?:...)
|
pure
grouping operator, does not save contents
|
(?#...)
|
embedded
comment, ignored by engine
|
\n
|
where
0 < n < 10, matches the same thing the nth subexpression matched.
|
Examples
Text
|
"cat
chet her"
|
Pattern
|
"c(a|he)t"
|
Replacement
|
"d$1g"
|
Result
|
"dag
dheg her"
|
Text
|
"cat
cata cathe chethe cheta"
|
Pattern
|
"c(a|he)t\1"
|
Replacement
|
"d$1g"
|
Result
|
"cat
dag cathe dheg cheta"
|
Text
|
"catty
cote code"
|
Pattern
|
"c(.)t(.)"
|
Replacement
|
"d$2g$1"
|
Result
|
"dtgay
dego code"
|
Repeating Operators
These
symbols operate on the previous atomic expression.
?
|
matches
the preceding expression or the null string
|
*
|
matches
the null string or any number of repetitions of the preceding expression
|
+
|
matches
one or more repetitions of the preceding expression
|
{m}
|
matches
exactly m repetitions of the one-character expression
|
{m,n}
|
matches
between m and n repetitions of the preceding expression, inclusive
|
{m,}
|
matches
m or more repetitions of the preceding expression
|
Examples
Text
|
"ct
cat caat caaat caaaat"
|
Pattern
|
"ca+t"
|
Replacement
|
"dog"
|
Result
|
"ct
dog dog dog dog"
|
Text
|
"ct
cat caat caaat caaaat"
|
Pattern
|
"ca*t"
|
Replacement
|
"dog"
|
Result
|
"dog
dog dog dog dog"
|
Text
|
"ct
cat caat caaat caaaat"
|
Pattern
|
"ca{2,3}t"
|
Replacement
|
"dog"
|
Result
|
"ct
cat dog dog caaaat"
|
Text
|
"ct
cat caat caaat caaaat"
|
Pattern
|
"ca?t"
|
Replacement
|
"dog"
|
Result
|
"dog
dog caat caaat caaaat"
|
Copyright © 1998 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA.
All rights reserved.
visitors since 13-Nov-98