Regular Expression Quick Start

windycityboy 2013-09-15

展开全文

Regular Expression Quick Start

This quick start will quickly get you up to speed with regular expressions. Obviously, this brief introduction cannot explain everything there is to know about regular expressions. For detailed information, consult the regular expression tutorial. Each topic in the quick start corresponds with a topic in the tutorial, so you can easily go back and forth between the two.

Text Patterns and Matches

A regular expression, or regex for short, is a pattern describing a certain amount of text. On this website, regular expressions are highlighted in red as regex.

This first example is actually a perfectly valid regex. It is the most basic pattern, simply matching the literal textregex. Matches are highlighted in blue on this site.

I will use the term "string" to indicate the text that I am applying the regular expression to. I will highlight them ingreen.

Literal Characters

The most basic regular expression consists of a single literal character, e.g.: a. It will match the first occurrence of that character in the string. If the string is Jack is a boy, it will match the a after the J.

This regex can match the second a too. It will only do so when you tell the regex engine to start searching through the string after the first match. In a text editor, you can do so by using its "Find Next" or "Search Forward" function. In a programming language, there is usually a separate function that you can call to continue searching through the string after the previous match.

Eleven characters with special meanings: the opening square bracket [, the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening round bracket ( and the closing round bracket ). These special characters are often called "metacharacters".

If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign will have a special meaning.

Learn more about literal characters

Character Classes or Character Sets

A "character class" matches only one out of several characters. To match an a or an e, use [ae]. You could use this in gr[ae]y to match either gray or grey. A character class matches only a single character. gr[ae]y will not match graay, graey or any such thing. The order of the characters inside a character class does not matter.

You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9. You can use more than one range. [0-9a-fA-F] matches a single hexadecimal digit, case insensitively. You can combine ranges and single characters. [0-9a-fxA-FX] matches a hexadecimal digit or the letter X.

Typing a caret after the opening square bracket will negate the character class. The result is that the character class will match any character that is not in the character class. q[^x] matches qu in question. It does not matchIraq since there is no character after the q for the negated character class to match.

Learn more about character classes

Shorthand Character Classes

\d matches a single character that is a digit, \w matches a "word character" (alphanumeric characters plus underscore), and \s matches a whitespace character (includes tabs and line breaks). The actual characters matched by the shorthands depends on the software you're using. Usually, non-English letters and numbers are included.

Learn more about shorthand character classes

Non-Printable Characters

You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A). More exotic non-printables are \a (bell, 0x07), \e (escape, 0x1B), \f (form feed, 0x0C) and \v (vertical tab, 0x0B). Remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n.

Use \xFF to match a specific character by its hexadecimal index in the character set. E.g. \xA9 matches the copyright symbol in the Latin-1 character set.

If your regular expression engine supports Unicode, use \uFFFF to insert a Unicode character. E.g. \u20ACmatches the euro currency sign.

All non-printable characters can be used directly in the regular expression, or as part of a character class.

Learn more about non-printable characters

The Dot Matches (Almost) Any Character

The dot matches a single character, except line break characters. It is short for [^\n] (UNIX regex flavors) or[^\r\n] (Windows regex flavors). Most regex engines have a "dot matches all" or "single line" mode that makes the dot match any single character, including line breaks.

gr.y matches gray, grey, gr%y, etc. Use the dot sparingly. Often, a character class or negated character class is faster and more precise.

Learn more about the dot

Anchors

Anchors do not match any characters. They match a position. ^ matches at the start of the string, and $ matches at the end of the string. Most regex engines have a "multi-line" mode that makes ^ match after any line break, and $before any line break. E.g. ^b matches only the first b in bob.

\b matches at a word boundary. A word boundary is a position between a character that can be matched by \w and a character that cannot be matched by \w. \b also matches at the start and/or end of the string if the first and/or last characters in the string are word characters. \B matches at every position where \b cannot match.

Learn more about anchors

Alternation

Alternation is the regular expression equivalent of "or". cat|dog will match cat in About cats and dogs. If the regex is applied again, it will match dog. You can add as many alternatives as you want, e.g.:cat|dog|mouse|fish.

Learn more about alternation

Repetition

The question mark makes the preceding token in the regular expression optional. E.g.: colou?r matches colouror color.

The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more. <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. <[A-Za-z0-9]+> is easier to write but matches invalid tags such as <1>.

Use curly braces to specify a specific amount of repetition. Use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999.

Learn more about quantifiers

Greedy and Lazy Repetition

The repetition operators or quantifiers are greedy. They will expand the match as far as they can, and only give back if they must to satisfy the remainder of the regex. The regex <.+> will match first inThis is a first test.

Place a question mark after the quantifier to make it lazy. <.+?> will match  in the above string.

A better solution is to follow my advice to use the dot sparingly. Use <[^<>]+> to quickly match an HTML tag without regard to attributes. The negated character class is more specific than the dot, which helps the regex engine find matches quickly.

Learn more about greedy and lazy quantifiers

Grouping and Backreferences

Place round brackets around multiple tokens to group them together. You can then apply a quantifier to the group. E.g. Set(Value)? matches Set or SetValue.

Round brackets create a capturing group. The above example has one group. After the match, group number one will contain nothing if Set was matched or Value if SetValue was matched. How to access the group's contents depends on the software or programming language you're using. Group zero always contains the entire regex match.

Use the special syntax Set(?:Value)? to group tokens without creating a capturing group. This is more efficient if you don't plan to use the group's contents. Do not confuse the question mark in the non-capturing group syntax with the quantifier.

Learn more about grouping and backreferences

Unicode Properties

\p{L} matches a single character that has a given Unicode property. L stands for letter. \P{L} matches a single character that does not have the given Unicode property. You can find a complete list of Unicode properties in the tutorial.

Learn more about Unicode regular expressions

Lookaround

Lookaround is a special kind of group. The tokens inside the group are matched normally, but then the regex engine makes the group give up its match and keeps only the result. Lookaround matches a position, just like anchors. It does not expand the regex match.

q(?=u) matches the q in question, but not in Iraq. This is positive lookahead. The u is not part of the overall regex match. The lookahead matches at each position in the string before a u.

q(?!u) matches q in Iraq but not in question. This is negative lookahead. The tokens inside the lookahead are attempted, their match is discarded, and the result is inverted.

To look backwards, use lookbehind. (?<=a)b matches the b in abc. This is positive lookbehind. (?<!a)b fails to match abc.

You can use a full-fledged regular expression inside the lookahead. Most regular expression engines only allow literal characters and alternation inside lookbehind, since they cannot apply regular expressions backwards.

Learn more about lookaround