Monday, May 3, 2010

Regular Expression Patterns.

Regular Expressions have a formal definition as part of the POSIX standard. Different symbols define how a pattern is described. Let’s start with the basic symbols.
Characters and Numbers represent themselves. If you are searching for ‘abc’ then the matching pattern is abc.
Period (.) represents any single character or number. The pattern ‘b.e’ will match bee, bye, b3e but not bei, or b55e. Likewise the pattern ‘..-..=…’ matches any two characters, followed by a dash, followed by any two characters, followed by an equal sign, followed by any three characters.
Star (*) represent zero or more characters. The pattern ‘b.*e will match bee, bye, beee, bzyxe and be. The pattern
‘..-..=.*’ can end with zero or more characters after the equal sign.
Plus (+) represents one or more characters. This pattern is the same a ‘.*’ except that there must be one character. Using the pattern ‘b.+e’ the string “be” will not match.
Question Mark (?) represents zero or one character. The pattern ‘..-..=.?’ can only end with one character or no character after the equal sign.
If I wanted to match a US telephone number, I could use the pattern ‘…-…-….’. This pattern will match any three characters, followed by a dash, followed by 3 more characters, followed by a dash and four final characters. So the string “123-456-7891” will match this pattern. Well so will “abc-def-ghij”. So this simple patter will match a lot of strings that are not phone numbers. We will improve on this pattern in a moment.
Brackets are used to define numbers of characters.
{count} defines an exact number of characters. The pattern a{3) defines exactly three character ‘a’. Used with the period, the {count} defines the number of characters. The phone number example could be written as the pattern
‘.{3}-.{3}-.{4}’.
Note: When used with many applications, the bracket already has a meaning and to use it in a expression is must be escaped, normally with a slash. ‘.\{3\}-\{3\}-\{4\}’ In this example the slash ‘\’ simply escapes the bracket. With Oracle, this is not necessary.
{min,max} defines a minimum and maximum number of characters. The pattern ‘.{2,8}’ will match any 2 or more characters, up to 8 characters.
{min,} defines the minimum or more number of characters. The pattern ‘sto{1,}p’ will match any string that has ‘st’ followed by one or more ‘o’, followed by a ‘p’. This includes stop, stoop, stooooop, but not stp or stoip.
Square Brackets are used to define a subset of the expression. Any one character in the bracket will match the pattern. If I want only a number character then I could use a pattern like ‘[0123456789]’. The phone number pattern could be written as: ‘[0123456789]{3}-[0123456789]{3}-[0123456789]{4}’
With this pattern, I have excluded all the letters from matching strings. A range of characters can also be defined in square brackets. This is easier to type and read. The range is defined using the dash between the min and max. The phone example now becomes: ‘[0-9]{3}-[0-9]{3}-[0-9]{4}’
Ranges of letters can also be defined. The pattern ‘st[aeiou][A-Za-z]’ matches any string with the characters ‘st’ followed by a vowel, followed by any character, upper or lower case. This pattern matches stop, stay, staY, stud. The pattern ‘abc[1-9]’ matches abc1, abc2, abc3,…
The caret [^] in square brackets matches any character except those following the caret. The pattern ‘st[^o]p will match step, strp, but not stop.
So far, all the patterns match is the pattern is found anywhere in the line of text. Use the caret and dollar sign to define patterns that match the start or end of a string.
^ defines that start of a string or column 1 of the string.
$ defines the end of a string or the last column. This does not included carriage returns and line feeds.
The pattern ‘^St[a-z]*’ matches a string that starts with ‘St’ followed by zero or more lower case letters. The pattern ‘stop$’ only matches “stop” if it is the last word on the line. or vertical line defines the Boolean OR. The patter ‘[1-9][a-z]’ matched any number or lower case letter. The pattern ‘stopstep’ matches the strings stop or step. \ or backward slash is the escape character. This is use to tell the parser that the character following it is to be taken literally. In the note earlier, it was pointed out that some characters have special meaning in some applications and must be escaped to tell the application to use that literal character. Another reason to escape a character is when you want to actually use the character in your matching pattern. For example, if you want to match a number that has two decimal places you could use the pattern: ‘[0-9]+.[0-9]{2}’
This example looks right but will not match the pattern that we are looking for. The period we use to represent the decimal place, will actually match any character. We must tell the expression parser that we want the character period and we do that by escaping the period character. ‘[0-9]+\.[0-9]{2}’
Now the pattern will match one or more digits followed by a period and exactly two digits.
Class Operators
Class operators are used as an alternative way to define classes of characters.
They are defined in the format [: class :].
[:digit:] Any digit
[:alpha:] Any upper or lower case letter
[:lower:] Any lower case letter
[:upper:] Any upper case letter
[:alnum:] Any upper or lower case letter or number
[:xdigit:] Any hex digit
[:blank:] Space or Tab
[:space:] Space, tab, return, line feed, form feed
[:cntrl:] Control Character, non printing
[:print:] Printable character including a space
[:graph:] Printable characters, excluding space.
[:punct:] Punctuation character, not a control character or alphanumeric
Again, these class operators represent other characters. The phone number example can be rewritten using class operators. ‘[:digit:]{3}-[:digit:]{3}-[:digit:]{4}’
Being Greedy
Regular expressions are greedy. By this we mean that the expression will match the largest string it can. Think of it as the expression parser takes the entire sting and compares it to the pattern. The parser then gives back characters until it finds that the string has no match or if finds the match.
Lets use a string ‘0123423434’. If my pattern is ‘.*4’ (zero of more characters followed by the digit 4).
The first match will be the entire sting.
Expression Grouping
Expression Grouping allows part of the pattern to be grouped. This is also called tagging or referencing. You group an expression by surrounding it with parens. There can be only 9 groups in a pattern. Below is an example that contains two groups. ‘([a-z]+) ([a-z]+)’
This pattern matches two lower case words. Using a string defined as ‘fast stop’, the first group would contain ‘fast’ and the second group ‘stop’. The groups are referenced by a backward slash and the group number. ‘\1’ references ‘fast’ while ‘\2’ reference ‘stop’. Thus \2 \1 results in ‘stop fast’.

No comments: