Regular Expressions

Strangely regular expressions are not built-in but are part of a module called re, a regular expression is a way of extracting data based on a pattern, many languages use regular expressions. The regex in simple terms is a pattern you are looking for (matching) in a string or piece of text, once you have the text you can remove it, change it, etc.

Let see a simple regular expression example

Simple regex

import re                                               # import the regular expression module

regexp = re.compile("hello")                            # a simple regex, the word hello
count = 0
text = "hello world and a hello to you all"

for line in text.split():                               # loop through the string, splitting into words
    if regexp.search(line):                             # search the string for the regex, hello in this case
        count = count + 1

print(count)                                            # results in 2 x hello

Special Characters

There are a number of special characters which can be used inside the patterns, which enables you to match any of a number of character strings, these are what make patterns useful.

Regular expressions start to come into there own when you use these special characters like or (|) in the example below

Using or (|) in regex

import re                                               # import the regular expression module

regexp = re.compile("[h|H]ello")                        # using or (|) to get match hello or Hello
count = 0
text = "hello world and a Hello to you all"

for line in text.split():                               # loop through the string, splitting into words
    if regexp.search(line):                             # search the string for the regex, hello in this case
        count = count + 1

print(count)                                            # results in 2 x hello

There are a number of special characters as per the below table, you can see the re module documentation for a full listing.

Special Characters
. character	matches any character except the newline character, the special combination of .* tries to match as much as possible.
+ character	means one or more of the preceding characters
[ ] character	enable you to define patterns that match one of a group of alternatives, you can also uses ranges such as [0-9] or [a-z,A-Z]
* character	match zero or more occurrences of the preceding character
? character	match zero or one occurrence of the preceding character
Pattern anchor	there are a number of pattern anchors, match at beginning of a string (^ or \A), match at the end of a string ($ or \Z), match on word boundary (\b) and match inside a work (\B - opposite of \b)
Escape sequence	if you want to include a character that is normally treated as a special character, you must precede the character with a backslash, you can use the \Q to tell perl to treat everything after as a normal character until it see's \E
Excluding	you can exclude words or characters by using the ^ inside square brackets [^]
Character-Range escape sequences	there are special character range escape sequences such as any digit (\d), anything other than a digit (\D), to see the full list see
Specified number of occurrences	you can define how any occurrences you want to match using the {<minimum>,<maximum>}
specify choice	the special character \| (pipe) enables you to specify two or more alternatives to choose from when matching a pattern
Portition reuse	some times you want to store what has been matched, you can do this by using (), the first set will be store in \1 (used in pattern matching) or $1 (used when assigning to variables) , the second set \2 or $2 and so on.
Different delimiter	you can specify a different delimiter
Special Characters Examples
. character	/d.f/ # could match words like def, dif, duf /d.*f/ # could match words like deaf, deef, def, dzzf, etc
+ character	/de+f/ # could match words like def, deef, deeef, deeeef, etc / +/ # match words between multiple spaces
[ ] character	/d[eE]f/ # match words def or dEf /a[456]c/ # match a followed by any digit then c such as a4c, a5c or a6c /d[eE]+f/ # match words like def, dEf, deef, dEeF, dEEeeEef, etc /d[a-z]f/ # match words like def, def, dzf, dsf, etc /1[0-9]0/ / match numbers like 100, 110, 120, 150, 170, 190, etc
* character	/de*f/ # match words like df, def, deef, deeef, etc
? character	/de?f/ # match only the words df and def (not deef only matches one occurence)
Pattern anchors	/^hello/ # match only if line starts with hello /hello$/ # match only if hello is at end of line /\bdef/ # only matches when def is at the beginning of a word define, defghi /def\b/ # only matches when def is at the end of a word abcdef /\Bdef/ # matches abcdef (opposite of \b) /def\B/ # matches defghi (opposite of \b)
Escape sequence	/\+salary/ # will match the word +salary, the + (plus) is treated as a normal character because of the \ /\Q++\E/ # will match ++
Excluding	/d[^eE]f/ # 1st character is d, 2nd character is anything other than e or E, last character is f
Character-Range escape sequences	/\d/ # match any digit /\d+/ # match any number of digits
Specified number of occurrences	/de{3}f/ # match only deeeef the {3} means three preceding e's /de{1,3} # match only deef, deeef and deeeef ( minimum = 1, maximum = 3 occurrences)
specify choice	/def\|ghi/ # match either def or ghi
Portition reuse	/(def)(ghi)/ # the first matched pattern will be store in \1 or $1, the second in \2 or $2 $result = $1; # assign the obtained matched pattern above in $result $result2 = $2; # assign the second obtained matched pattern above in $result2
Different delimiter	!/usr/sbin! # match /usr/sbin, here we are using the ! (bang) character as a delimiter

Substituting Text

Using the reular expression module you can also substitute text in a string or file, you can substitue characters, words, numbers, etc.

Substitute text in a string

import re
string = "If the the problem is textual, use the the re module"
pattern = r"the the"
regexp = re.compile(pattern)

# substitute "the the" with "the
string = regexp.sub("the", string)
print(string)