Contents |
Regular expressions (regex's) are used to find specific strings in text. The syntax described here is defined in the POSIX 1003.2 standard as modern or extended regular expressions.
Here is a really boring example of a regex:
| regex | matches |
|---|---|
hello | hello
|
Regex's have some symbols that has special meanings, called metacharacters. Among these are . ( ) | ? + *. The point '.' can match any character:
| regex | matches |
|---|---|
h.llo | hello, hallo, h8llo, h@llo, ...
|
The pipe '|' matches one regex or another:
| regex | matches |
|---|---|
hello|hi | hello, hi
|
Regex's can be nested using the '( )' metacharacters:
| regex | matches |
|---|---|
h(e|a)llo | hello, hallo
|
If you want to match a string with a character that may or may not occur, you can use the ? metacharacter:
| regex | matches |
|---|---|
mpe?g | mpg, mpeg
|
If you want to match strings where a character occurs 0 or more times, you can use the '*' metacharacter. Similarly, the '+' metacharacter matches the character 1 or more times.
| regex | matches |
|---|---|
ba*h | bh, bah, baah, baaah, ...
|
ba+h | bah, baah, baaah, ...
|
You can even specify the excact number or interval that characters may occur with '{ }' curly brackets. Inside the brackets you can write a single number, a range, or open-ended intervals like so:
| regex | matches |
|---|---|
ba{3}h | baaah
|
ba{2,3}h | baah, baaah
|
ba{,2}h | bh, bah, baah
|
ba{2,}h | baah, baaah, baaaah, ...
|
A set of characters can be specified using the '[ ]' operators. Inside the brackets, you can write the matched characters. Contrarily, '[^ ]' matches the characters not inside the brackets. Furthermore, a range of characters can be specified with syntax like '[a-z]'
| regex | matches |
|---|---|
h[ea]llo | hello, hallo
|
b[^abcdef]h | bgh, bhh, b5h, b@h, ...
|
h[a-c]llo | hallo, hbllo, hcllo
|
These are the most important of the metacharacters:
. ^ $ ( ) | ? + * { } [ ] [^ ]
The caret '^' and the dollar sign '$' matches the start and the end of the string, respectively.
Using the metacharacters as ordinary characters:
Of course, if you want to match on of the characters . ^ $ ( ) | ? + * { } [ ] as an ordinary character, you will have to do some trick. The trick is prefixing the character with a backslash '\', called escaping the character:
| regex | matches |
|---|---|
\?+ | ?, ??, ???, ...
|
www\.h[ea]llo\.org | www.hello.org , www.hallo.org
|
If you want to match a backslash, you should also escape it '\\'
The basic syntax can be combined to more clever matching:
| regex | matches |
|---|---|
h([ea]llo|i) | hello, hallo, hi
|
(cos|sin)\([xy]\) | cos(x), sin(x), cos(y), sin(y)
|
.+\.(mpe?g|avi|mov|qt|wmv) | movie file names |
The power of regex's really show when doing substitution. This is basically the same as 'search and replace'. This is important in editors and programming languages.
In general, the syntax for substitution is
s/search/replace/g
's' means 'substitute' and 'g' means 'global', signifying that the substitution should be done for all matches in the string. A simple example:
| string | substitution command | result |
|---|---|---|
abracadabra | s/a/u/g | ubrucudubru
|
hi | s/i/ello/g | hello
|
When the search string is a regex, the substitution replaces all the substrings matching the regex:
| string | substitution command | result |
|---|---|---|
hi and hello | s/h([ea]llo|i)/good morning/g
| good morning and good morning
|
hi | s/.*/hello/g | hello
|
The replacement is not a regex, which, if you think about it, is quite understandable. However, the replacement string does have some special characters used for more flexible substutition.
The parts of the search regex enclosed in parentheses '( )' is called a group. In the replacement string, the group in the actual matched string can be inserted by '\n', where n is the number of the group. So the first group is inserted by writing '\1' and so on. This requires a few examples:
| string | substitution command | result |
|---|---|---|
hello there | s/(hello|hi) there/\1, yourself/g
| hello, yourself
|
hi there | "
| hi, yourself
|
bra | s/(.*)/a\1cada\1/g | abracadabra
|
2x | s/([0-9]+)([a-z])/\2 times \1/g
| x times 2
|
9a + 12c | "
| a times 9 + c times 12
|
Below are some programs that use regular expressions:
| Program | Function | RegEx type | Example of Usage |
|---|---|---|---|
| vim | editor | variant | %s/\(hello\|hi\) there/\1, yourself/g
|
| find | file finding | emacs variant | $ find /usr/bin -regex ".*\(.+\)/f\1."
|
| grep | text searching | POSIX (egrep) | $ egrep "[abcdr]{9,}" /usr/share/dict/words
|
| sed | text filtering | POSIX (with -r option) | $ echo "hi there" |
|
| PHP | programming language | POSIX | <?php
|
| Python | programming language | POSIX (with extensions) | import re
|
| Ruby | programming language | POSIX (with extensions) | a = "hi there"
|
| Perl | programming language | POSIX (with extensions) | $s = "hi there";
|
| awk | programming language | POSIX (with extensions) | $ echo "hi there" |
|
man 7 regex