Use the power of regular expressions to ease text parsing and processingIf you‘ve programmed in Perl or any other language with built-in regular-expression capabilities, then you probably know how much easier regular expressions make text processing and pattern matching. If you‘re unfamiliar with the term, a regular expression is simply a string of characters that defines a pattern used to search for a matching string.
Many languages, including Perl, PHP, Python, JavaScript, and JScript, now support regular expressions for text processing, and some text editors use regular expressions for powerful search-and-replace functionality. What about Java? At the time of this writing, a Java Specification Request that includes a regular expression library for text processing has been approved; you can expect to see it in a future version of the JDK. But what if you need a regular expression library now? Luckily, you can download the open source Jakarta ORO library from Apache.org. In this article, I‘ll first give you a short primer on regular expressions, and then I‘ll show you how to use regular expressions with the open source Jakarta-ORO API. Regular expressions 101Let‘s start simple. Suppose you want to search for a string with the word "cat" in it; your regular expression would simply be "cat". If your search is case-insensitive, the words "catalog", "Catherine", or "sophisticated" would also match: Regular expression: cat The period notationImagine you are playing Scrabble and need a three-letter word starting with the letter "t" and ending with the letter "n". Imagine also that you have an English dictionary and will search through its entire contents for a match using a regular expression. To form such a regular expression, you would use a wildcard notation -- the period (.) character. The regular expression would then be "t.n" and would match "tan", "Ten", "tin", and "ton"; it would also match "t#n", "tpn", and even "t n", as well as many other nonsensical words. This is because the period character matches everything, including the space, the tab character, and even line breaks: Regular expression: t.n The bracket notationTo solve the problem of the period‘s indiscriminate matches, you can specify characters you consider meaningful with the bracket ("[]") expression, so that only those characters would match the regular expression. Thus, "t[aeio]n" would just match "tan", "Ten", "tin", and "ton". "Toon" would not match because you can only match a single character within the bracket notation: Regular expression: t[aeio]n The OR operatorIf you want to match "toon" in addition to all the words matched in the previous section, you can use the "|" notation, which is basically an OR operator. To match "toon", use the regular expression "t(a|e|i|o|oo)n". You cannot use the bracket notation here because it will only match a single character. Instead, use parentheses -- "()". You can also use parentheses for groupings (more on that later): Continued
Regular expression: t(a|e|i|o|oo)n The quantifier notationsTable 1 shows the quantifier notations used to determine how many times a given notation to the immediate left of the quantifier notation should repeat itself:
Let‘s say you want to search for a social security number in a text file. The format for US social security numbers is 999-99-9999. The regular expression you would use to match this is shown in Figure 1. In regular expressions, the hyphen ("-") notation has special meaning; it indicates a range that would match any number from 0 to 9. As a result, you must escape the "-" character with a forward slash ("\") when matching the literal hyphens in a social security number. Figure 1. Matches: All social security numbers of the form 123-12-1234 If, in your search, you wish to make the hyphen optional -- if, say, you consider both 999-99-9999 and 999999999 acceptable formats -- you can use the "?" quantifier notation. Figure 2 shows that regular expression: Figure 2. Matches: All social security numbers of the forms 123-12-1234 and 123121234 Let‘s take a look at another example. One format for US car plate numbers consists of four numeric characters followed by two letters. The regular expression first comprises the numeric part, "[0-9]{4}", followed by the textual part, "[A-Z]{2}". Figure 3 shows the complete regular expression: Figure 3. Matches: Typical US car plate numbers, such as 8836KV The NOT notationThe "^" notation is also called the NOT notation. If used in brackets, "^" indicates the character you don‘t want to match. For example, the expression in Figure 4 matches all words except those starting with the letter X.
Figure 4. Matches: All words except those that start with the letter X The parentheses and space notationsSay you‘re trying to extract the birth month from a person‘s birthdate. The typical birthdate is in the following format: June 26, 1951. The regular expression to match the string would be like the one in Figure 5: Figure 5. Matches: All dates with the format of Month DD, YYYY The new "\s" notation is the space notation and matches all blank spaces, including tabs. If the string matches perfectly, how do you extract the month field? You simply put parentheses around the month field, creating a group, and later retrieve the value using the ORO API (discussed in a following section). The appropriate regular expression is in Figure 6: Figure 6. Matches: All dates with the format Month DD, YYYY, and extracts Month field as Group 1 Other miscellaneous notationsTo make life easier, some shorthand notations for commonly used regular expressions have been created, as shown in Table 2:
To illustrate, we can use "\d" for all instances of "[0-9]" we used before, as was the case with our social security number expressions. The revised regular expression is in Figure 7: Continued
Figure 7. Matches: All social security numbers of the form 123-12-1234 Jakarta-ORO libraryMany open source regular expression libraries are available for Java programmers, and many support the Perl 5-compatible regular expression syntax. I use the Jakarta-ORO regular expression library because it is one of the most comprehensive APIs available and is fully compatible with Perl 5 regular expressions. It is also one of the most optimized APIs around. The Jakarta-ORO library was formerly known as OROMatcher and has been kindly donated to the Jakarta Project by Daniel Savarese. You can download the package from a link in the Resources section below. The Jakarta-ORO objectsI‘ll start by briefly describing the objects you need to create and access in order to use this library, and then I will show how you use the Jakarta-ORO API. The PatternCompiler objectFirst, create an instance of the PatternCompiler compiler=new Perl5Compiler(); The Pattern objectTo compile a regular expression into aPattern object, call the compile() method of the compiler object, passing in the regular expression. For example, you can compile the regular expression "t[aeio]n" like so:Pattern pattern=null; try { pattern=compiler.compile("t[aeio]n"); } catch (MalformedPatternException e) { e.printStackTrace(); } By default, the compiler creates a case-sensitive pattern, so that the above setup only matches "tin", "tan", "ten", and "ton", but not "Tin" or "taN". To create a case-insensitive pattern, you would call a compiler with an additional mask: pattern=compiler.compile("t[aeio]n",Perl5Compiler.CASE_INSENSITIVE_MASK); Once you‘ve created the The PatternMatcher objectThe PatternMatcher matcher=new Perl5Matcher(); You can obtain a match using the
You could also pass in a
Scenarios for using the APINow let‘s discuss some example uses of the Jakarta-ORO library.Log file processingYour job: analyze a Web server log file and determine how long each user spends on the Website. An entry from a typical BEA WebLogic log file looks like this: 172.26.155.241 - - [26/Feb/2001:10:56:03 -0500] "GET /IsAlive.htm HTTP/1.0" 200 15 After analyzing this entry, you‘ll realize that you need to extract two things from the log file: the IP address and a page‘s access time. You can use the grouping notation (parentheses) to extract the IP address field and the timestamp field from the log entry. Let‘s first discuss the IP address. It consists of 4 bytes, each with values between 0 and 255; each byte is separated from the others by a period. Thus, in each individual byte in the IP address, you have at least one and at most three digits. You can see the regular expression for this field in Figure 8: Figure 8. Matches: IP addresses that consist of 4 bytes, each with values between 0 and 255 You need to escape the period character because you literally want it to be there; you do not want it read in terms of its special meaning in regular expression syntax, which I explained earlier. The log entry‘s timestamp part is surrounded by square brackets. You can extract whatever is within these brackets by first searching for the opening square bracket character ("[") and extracting whatever is not within the closing square bracket character ("]"), continuing until you reach the closing square bracket. Figure 9 shows the regular expression for this: Figure 9. Matches: At least one character until "]" is found Now you combine these two regular expressions into a single expression with grouping notation (parentheses) for extraction of your IP address and timestamp. Notice that "\s-\s-\s" is added in the middle so that matching occurs, although you won‘t extract that. You can see the complete regular expression in Figure 10.
Now that you‘ve formulated this regular expression, you can begin writing Java code using the regular expression library. Using the Jakarta-ORO libraryTo begin using the Jakarta-ORO library, first create the regular expression string and the sample string to parse:String logEntry="172.26.155.241 - - [26/Feb/2001:10:56:03 -0500] \"GET /IsAlive.htm HTTP/1.0\" 200 15 "; String regexp="([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3})\\s-\\s-\\s\\[([^\\]]+)\\]"; The regular expression used here is nearly identical to the one found in Figure 10, with only one difference: in Java, you need to escape every forward slash ("\"). Figure 10 is not in Java, so we need to escape the forward-slash character so as not to cause a compilation error. Unfortunately, this process is prone to error and you must do it carefully. You can type in the regular expression first without escaping the forward slashes, and then visually scan the string from left to right and replace every occurrence of the "\" character with "\\". To double check, print out the resulting string to the console. After initializing the strings, instantiate the PatternCompiler compiler=new Perl5Compiler(); Pattern pattern=compiler.compile(regexp); Now, create the PatternMatcher matcher=new Perl5Matcher(); if (matcher.contains(logEntry,pattern)) { MatchResult result=matcher.getMatch(); System.out.println("IP: "+result.group(1)); System.out.println("Timestamp: "+result.group(2)); } Next, print out the matched groups using the IP: 172.26.155.241 Timestamp: 26/Feb/2001:10:56:03 -0500 HTML processingYour next task is to churn through your company‘s HTML pages and perform an analysis of all of a font tag‘s attributes. The typical font tag in your HTML looks like this: <font face="Arial, Serif" size="+2" color="red"> Your program will print out the attributes for every font tag encountered in the following format: face=Arial, Serif size=+2 color=red In this case, I would suggest that you use two regular expressions. The first, shown in Figure 11, extracts Figure 11. Matches: The all-attribute part of the font tag The second regular expression, shown in Figure 12, breaks down each individual attribute into a name-value pair: Figure 12. Matches: Each individual attribute, broken down into a name-value pair Figure 12 breaks into: font Arial, Serif size +2 color red Let‘s now discuss the code to achieve this. First, create the two regular expression strings and compile them into a Next, create a String regexpForFontTag="<\\s*font\\s+([^>]*)\\s*>"; String regexpForFontAttrib="([a-z]+)\\s*=\\s*\"([^\"]+)\""; PatternCompiler compiler=new Perl5Compiler(); Pattern patternForFontTag=compiler.compile(regexpForFontTag,Perl5Compiler.CASE_INSENSITIVE_MASK); Pattern patternForFontAttrib=compiler.compile(regexpForFontAttrib,Perl5Compiler.CASE_INSENSITIVE_MASK); PatternMatcher matcher=new Perl5Matcher(); Assume you have a variable called if (matcher.contains(html,patternForFontTag)) { MatchResult result=matcher.getMatch(); String attribs=result.group(1); PatternMatcherInput input=new PatternMatcherInput(attribs); while (matcher.contains(input,patternForFontAttrib)) { result=matcher.getMatch(); System.out.println(result.group(1)+": "+result.group(2)); } } Next, create a The output of the example is as follows: face: Arial, Serif size: +1 color: red More HTML processingLet‘s continue with another HTML example. This time, imagine that your Web server has moved from <a href="http://widgets./interface.html#How_To_Buy"> <a href="http://widgets./interface.html#How_To_Sell"> etc. to <a href="http://newserver./interface.html#How_To_Buy"> <a href="http://newserver./interface.html#How_To_Sell"> etc. The regular expression to perform the search is shown in Figure 13.
If this regular expression is found, you can make your substitution for the link in Figure 13 with the following expression: <a href="http://newserver./interface.html#"> Notice that you use after the Now, back to Java. As usual, you must create your testing strings, the necessary object for compiling the regular expression into a String link="<a href=\"http://widgets./interface.html#How_To_Trade\">"; String regexpForLink="<\\s*a\\s+href\\s*=\\s*\"http://widgets./interface.html#([^\"]+)\">"; PatternCompiler compiler=new Perl5Compiler(); Pattern patternForLink=compiler.compile(regexpForLink,Perl5Compiler.CASE_INSENSITIVE_MASK); PatternMatcher matcher=new Perl5Matcher(); Next, use the static method String result=Util.substitute(matcher, patternForLink, new Perl5Substitution( "<a href=\"http://newserver./interface.html#\">"), link, Util.SUBSTITUTE_ALL); System.out.println(result); The syntax of the public static String substitute(PatternMatcher matcher, Pattern pattern, Substitution sub, String input, int numSubs) The first two parameters for this call are the Author BioBenedict Chng is a Sun-certified developer currently consulting in the Boston area. He hails from sunny and tropical Singapore and has been working in the software development field for close to four years. His current interests include writing applications for Palm devices and sightseeing in the New England region.Express yourselfIn this article, I‘ve shown you the powerful features of regular expressions. When used appropriately, they can help a great deal in string extraction and text changes. I have also shown how you can incorporate regular expressions into your Java application using the open source Jakarta-ORO library. Now, it‘s up to you to decide whether the old string manipulation approach (usingStringTokenizers , charAt, or substring ) or a regular expression library, like Jakarta-ORO, works for you. |
|