Google Analytics

Search

To search for specific articles you can use advanced Google features. Go to www.google.com and enter "site:darrellgrainger.blogspot.com" before your search terms, e.g.

site:darrellgrainger.blogspot.com CSS selectors

will search for "CSS selectors" but only on my site.


Wednesday, April 4, 2012

Regular Expression


I've been processing files and data a lot lately which means I've been using Regular Expressions.

Regular Expressions is a very powerful pattern matching tool. If you have used MSDOS or Bourne shell you are familiar with wildcards like "*.txt" will match all files ending with .txt. Regular expressions take this to a whole new level.

First thing to note is there are different implementations of Regular Expression. The basic concepts are the same and most the syntax is the same but there are subtle differences. I'll talk more about this as I give examples of the language.

The second thing to note is, some of the special symbols from MSDOS or Bourne shell are used by Regular Expression but they have a completely different meaning. Most notably is the asterisk (*).

The example above, "*.txt", would be a bad Regular Expression. Why? The asterisk means the previous character zero or more times. There is no character preceding the asterisk so it is a syntax error.

For simple things like "*.txt", Regular Expression can be overly complex. The dot (.) means any character. So if I want to emulate the "*" of MSDOS, I would use ".*" in Regular Expression. If I wanted an actual dot I would use "\." in Regular Expression. So the whole "*.txt" in MSDOS becomes ".*\.txt" in Regular Expression. In most languages, the "\." would get processes as a control character by the String implementation. The slash (\) would never make it to the Regular Expression parser. So if you want "\." to reach the Regular Expression parser, you need to use "\\." because the String implementation will parse this, resulting in "\.", then pass it to the Regular Expression parser.

The language I use most right now is Java. If you look at the Java API documentation for the Pattern class you will see this is the Regular Expression parser.

Some of the basic stuff:

  • Anything not a special character is matched verbatim. So in my example above "txt" only matches "txt".
  • If you want to match a special character you need to escape it. From my example, the dot is a special character. To match a dot and nothing else you use "\\.". I use double slash because Java will parse the "\\" before sending it to Pattern.
  • Special characters from things like println() or printf() work the same in Regular Expression. These are "\t" for tab, "\n" for newline, "\r" for return, "\f" for form-feed, "\a" for a bell. A bell in ASCII is control-g or "\x07" but "\a" is better because you shouldn't assume ASCII.
  • You can have a sets using square brackets. If I have "[abc]" this will match "a", "b" or "c".
  • You can use the square brackets for negation. If the first character in the set is caret (^) it means 'not'. For example, "[^abc]" would match anything not "a", not "b" and not "c".
  • If you want all digits you could use "[0123456789]" but there is a shorthand for this. A range can be specified using a minus (-) symbol. This example would be "[0-9]". You can also do things like "[a-z]" but alphabetic strings can be problematic if you allow different character sets.
  • If you want upper and lower case letters you might think "[a-Z]" would work but this is an error. The letter 'Z' in ASCII has a value of 90 and 'a' has a value of 97. Second attempt might be "[A-z]". This is closer but in ASCII the symbols '[', '\', ']', '^', '_' and '`' are between 'Z' and 'a'. So you have too many characters in this set. The solution is a union (like in Set Theory). You want "[a-z]" union "[A-Z]". In Regular Expression this is written as "[a-zA-Z]".
  • You can also write a union as "[a-z[A-Z]]". This might seem like extra typing and in some cases it is. What if you wanted all consonants? That would be 21 letters uppercase and 21 letters lower case. A string with 42 letters (you cannot really use a single range). You could use "[b-df-hj-np-tv-zB-DF-HJ-NP-TV-Z]" but even that is a little ugly. How about: "[a-zA-Z[^aeiouAEIOU]]". When I look at that it is pretty obvious what I'm trying to match. It reads as "all letters but not vowels".
  • There is 'syntactic sugar' for some things:
    • Rather than "[0-9]" I can use "\d" (the d is for digit)
    • Rather than "[^0-9]" I can use "\D" (uppercase implies NOT)
    • Rather than "[ \t\n\x0b\\f\r]" I can use "\s" (the s is for space or whiteSpace)
    • Rather than "[^ \t\n\x0b\\f\r]" I can use "\S" (uppercase implies NOT)
  • A 'word' is a String made of letters, digits or underscore. A character of a 'word' therefore would be: "[a-zA-Z\d_]". Syntactic sugar for this is "\w".
  • Alternately, "\W" is for not a 'word' character.

Some slightly more advanced stuff would be boundary qualifiers:

  • The caret (^) not in a set means beginning of line. So if I have the string "^a" it matches if 'a' is the first character in the string. With wildcards or substring matching this can be very helpful. For example, "^def" will not match a substring check with "abcdefghi" but "def" will match.
  • The dollar ($) is for end of line. For example, "def$" will not match "abcdefghi" but "def" will match.
  • Capture groups are used for substitution. For example, if I have a string with my full name, "Darrell Grainger" and I want to change it to "Grainger, Darrell" I would do the following:
String name = "Darrell Grainger";
String flip = name.replaceFirst("(\\w*) (\\w*)", "$2, $1");
  • The "\\w*" means get the first word. It will match "Darrell". By wrapping it with parenthesis it becomes a 'capture group'. So the first "(\\w*)" gets saved into "$1" and the second "(\\w*)" gets saved into "$2".  In other implementations of Regular Expression, capture groups are saved into things like "\1" rather than "$1".
  • Capture groups are great if you are processing a number of strings in an array. This example will flip the first and second word for any set of strings.
More advance stuff would be Greedy quantifiers versus Reluctant quantifiers. Lets look at this with capture groups.
String s = "aaabbbaaa";
String s1 = s.replaceFirst("(a*)(.*)", "$2 $1");
String s2 = s.replaceFirst("(a*?)(.*)", "$2 $1");
The string s1 will contain "bbbaaa aaa".
The string s2 will contain "aaabbbaaa ".

For s1, what happened is "(a*)" matched "aaa" and "(.*)" matched "bbbaaa".
For s2, what happened is "(a*?)" was a Reluctant quantifier. Because "(.*)" is a Greedy quantifier, it captured everything. This left nothing for "(a*?)" to capture.

What happens under the hood is that the Regular Expression parser will find the Greedy quantifiers, read in the entire string and see if it matches. If it does not it pushes one character back out, checks for a match, pushes a character back out, checks for a match. It keeps doing this until it finds a match. Whatever didn't match is used to process Reluctant quantifiers. 

While processing the Reluctant quantifiers the parser will read in one character, check for a match, read another character, check for a match, read another character, check for a match. It keeps doing this so long as things are matching. The moment there isn't a match it stops.

So the s1 string processed "(a*)" first, because it is a Greedy quantifier and captured "aaa" into "$1". Then it processed "(.*)" which matched the rest of the string. This captured "bbbaaa" into "$2".

With the string s2 it processed "(.*)" because it is a Greedy quantifier and "(a*?)" is a Reluctant quantifier. The "(.*)" grabbed the entire string and put it into "$2". This left an empty string "". The empty string is used to process the Reluctant quantifier "(a*?)" and "" gets captured into "$1".

Here is a table of the Greedy versus Reluctant quantifiers:


Greedy Reluctant Meaning
X? X?? X, once or not at all
X* X*? X, zero or more times
X+ X+? X, one or more times
X{n} X{n}? X, exactly n times
X{n,} X{n,}? X, at least n times
X{n,m} X{n,m}? X, at least n but not more than m times


There is more the Regular Expressions but this information is what you need for most situations.
.

No comments: