Strangely regular expressions are not built-in but are part of a module called re, a regular expression is a way of extracting data based on a pattern, many languages use regular expressions. The regex in simple terms is a pattern you are looking for (matching) in a string or piece of text, once you have the text you can remove it, change it, etc.
Let see a simple regular expression example
| Simple regex | import re # import the regular expression module
regexp = re.compile("hello") # a simple regex, the word hello
count = 0
text = "hello world and a hello to you all"
for line in text.split(): # loop through the string, splitting into words
if regexp.search(line): # search the string for the regex, hello in this case
count = count + 1
print(count) # results in 2 x hello |
There are a number of special characters which can be used inside the patterns, which enables you to match any of a number of character strings, these are what make patterns useful.
Regular expressions start to come into there own when you use these special characters like or (|) in the example below
| Using or (|) in regex | import re # import the regular expression module
regexp = re.compile("[h|H]ello") # using or (|) to get match hello or Hello
count = 0
text = "hello world and a Hello to you all"
for line in text.split(): # loop through the string, splitting into words
if regexp.search(line): # search the string for the regex, hello in this case
count = count + 1
print(count) # results in 2 x hello |
There are a number of special characters as per the below table, you can see the re module documentation for a full listing.
Special Characters |
|
| . character | matches any character except the newline character, the special combination of .* tries to match as much as possible. |
| + character | means one or more of the preceding characters |
| [ ] character | enable you to define patterns that match one of a group of alternatives, you can also uses ranges such as [0-9] or [a-z,A-Z] |
| * character | match zero or more occurrences of the preceding character |
| ? character | match zero or one occurrence of the preceding character |
| Pattern anchor | there are a number of pattern anchors, match at beginning of a string (^ or \A), match at the end of a string ($ or \Z), match on word boundary (\b) and match inside a work (\B - opposite of \b) |
| Escape sequence | if you want to include a character that is normally treated as a special character, you must precede the character with a backslash, you can use the \Q to tell perl to treat everything after as a normal character until it see's \E |
| Excluding | you can exclude words or characters by using the ^ inside square brackets [^] |
| Character-Range escape sequences | there are special character range escape sequences such as any digit (\d), anything other than a digit (\D), to see the full list see |
| Specified number of occurrences | you can define how any occurrences you want to match using the {<minimum>,<maximum>} |
| specify choice | the special character | (pipe) enables you to specify two or more alternatives to choose from when matching a pattern |
| Portition reuse | some times you want to store what has been matched, you can do this by using (), the first set will be store in \1 (used in pattern matching) or $1 (used when assigning to variables) , the second set \2 or $2 and so on. |
| Different delimiter | you can specify a different delimiter |
Special Characters Examples |
|
| . character | /d.f/ # could match words like def, dif, duf /d.*f/ # could match words like deaf, deef, def, dzzf, etc |
| + character | /de+f/ # could match words like def, deef, deeef, deeeef, etc / +/ # match words between multiple spaces |
| [ ] character | /d[eE]f/ # match words def or dEf /d[a-z]f/ # match words like def, def, dzf, dsf, etc |
| * character | /de*f/ # match words like df, def, deef, deeef, etc |
| ? character | /de?f/ # match only the words df and def (not deef only matches one occurence) |
| Pattern anchors | /^hello/ # match only if line starts with hello /\Bdef/ # matches abcdef (opposite of \b) |
| Escape sequence | /\+salary/ # will match the word +salary, the + (plus) is treated as a normal character because of the \ /\Q**++\E/ # will match **++ |
| Excluding | /d[^eE]f/ # 1st character is d, 2nd character is anything other than e or E, last character is f |
| Character-Range escape sequences | /\d/ # match any digit /\d+/ # match any number of digits |
| Specified number of occurrences | /de{3}f/ # match only deeeef the {3} means three preceding e's /de{1,3} # match only deef, deeef and deeeef ( minimum = 1, maximum = 3 occurrences) |
| specify choice | /def|ghi/ # match either def or ghi |
| Portition reuse | /(def)(ghi)/ # the first matched pattern will be store in \1 or $1, the second in \2 or $2 $result = $1; # assign the obtained matched pattern above in $result $result2 = $2; # assign the second obtained matched pattern above in $result2 |
| Different delimiter | !/usr/sbin! # match /usr/sbin, here we are using the ! (bang) character as a delimiter |
Using the reular expression module you can also substitute text in a string or file, you can substitue characters, words, numbers, etc.
| Substitute text in a string | import re
string = "If the the problem is textual, use the the re module"
pattern = r"the the"
regexp = re.compile(pattern)
# substitute "the the" with "the
string = regexp.sub("the", string)
print(string) |