Strangely regular expressions are not built-in but are part of a module called re, a regular expression is a way of extracting data based on a pattern, many languages use regular expressions. The regex in simple terms is a pattern you are looking for (matching) in a string or piece of text, once you have the text you can remove it, change it, etc.
Let see a simple regular expression example
Simple regex | import re # import the regular expression module regexp = re.compile("hello") # a simple regex, the word hello count = 0 text = "hello world and a hello to you all" for line in text.split(): # loop through the string, splitting into words if regexp.search(line): # search the string for the regex, hello in this case count = count + 1 print(count) # results in 2 x hello |
There are a number of special characters which can be used inside the patterns, which enables you to match any of a number of character strings, these are what make patterns useful.
Regular expressions start to come into there own when you use these special characters like or (|) in the example below
Using or (|) in regex | import re # import the regular expression module regexp = re.compile("[h|H]ello") # using or (|) to get match hello or Hello count = 0 text = "hello world and a Hello to you all" for line in text.split(): # loop through the string, splitting into words if regexp.search(line): # search the string for the regex, hello in this case count = count + 1 print(count) # results in 2 x hello |
There are a number of special characters as per the below table, you can see the re module documentation for a full listing.
Special Characters |
|
. character | matches any character except the newline character, the special combination of .* tries to match as much as possible. |
+ character | means one or more of the preceding characters |
[ ] character | enable you to define patterns that match one of a group of alternatives, you can also uses ranges such as [0-9] or [a-z,A-Z] |
* character | match zero or more occurrences of the preceding character |
? character | match zero or one occurrence of the preceding character |
Pattern anchor | there are a number of pattern anchors, match at beginning of a string (^ or \A), match at the end of a string ($ or \Z), match on word boundary (\b) and match inside a work (\B - opposite of \b) |
Escape sequence | if you want to include a character that is normally treated as a special character, you must precede the character with a backslash, you can use the \Q to tell perl to treat everything after as a normal character until it see's \E |
Excluding | you can exclude words or characters by using the ^ inside square brackets [^] |
Character-Range escape sequences | there are special character range escape sequences such as any digit (\d), anything other than a digit (\D), to see the full list see |
Specified number of occurrences | you can define how any occurrences you want to match using the {<minimum>,<maximum>} |
specify choice | the special character | (pipe) enables you to specify two or more alternatives to choose from when matching a pattern |
Portition reuse | some times you want to store what has been matched, you can do this by using (), the first set will be store in \1 (used in pattern matching) or $1 (used when assigning to variables) , the second set \2 or $2 and so on. |
Different delimiter | you can specify a different delimiter |
Special Characters Examples |
|
. character | /d.f/ # could match words like def, dif, duf /d.*f/ # could match words like deaf, deef, def, dzzf, etc |
+ character | /de+f/ # could match words like def, deef, deeef, deeeef, etc / +/ # match words between multiple spaces |
[ ] character | /d[eE]f/ # match words def or dEf /d[a-z]f/ # match words like def, def, dzf, dsf, etc |
* character | /de*f/ # match words like df, def, deef, deeef, etc |
? character | /de?f/ # match only the words df and def (not deef only matches one occurence) |
Pattern anchors | /^hello/ # match only if line starts with hello /\Bdef/ # matches abcdef (opposite of \b) |
Escape sequence | /\+salary/ # will match the word +salary, the + (plus) is treated as a normal character because of the \ /\Q**++\E/ # will match **++ |
Excluding | /d[^eE]f/ # 1st character is d, 2nd character is anything other than e or E, last character is f |
Character-Range escape sequences | /\d/ # match any digit /\d+/ # match any number of digits |
Specified number of occurrences | /de{3}f/ # match only deeeef the {3} means three preceding e's /de{1,3} # match only deef, deeef and deeeef ( minimum = 1, maximum = 3 occurrences) |
specify choice | /def|ghi/ # match either def or ghi |
Portition reuse | /(def)(ghi)/ # the first matched pattern will be store in \1 or $1, the second in \2 or $2 $result = $1; # assign the obtained matched pattern above in $result $result2 = $2; # assign the second obtained matched pattern above in $result2 |
Different delimiter | !/usr/sbin! # match /usr/sbin, here we are using the ! (bang) character as a delimiter |
Using the reular expression module you can also substitute text in a string or file, you can substitue characters, words, numbers, etc.
Substitute text in a string | import re string = "If the the problem is textual, use the the re module" pattern = r"the the" regexp = re.compile(pattern) # substitute "the the" with "the string = regexp.sub("the", string) print(string) |