“Suspect each moment, for it is a thief, tiptoeing away with more than it brings.” ― John Updike, A Month Of Sundays
- 1. Introduction
- 2. Performing a Pattern Search
- 3. How It Works
- 4. Special Characters
- 4.1. Any Character (.)
- 4.2. Anchor Search to Start (^)
- 4.3. Anchor Search to End ($)
- 4.4. Match Multiple Characters (*)
- 4.5. Find Zero or One Character (?)
- 4.6. Find One or More Characters (+)
- 4.5. Match Characters in a Set ([ and ‘]’)
- 4.6. Exclude From Set ([^ and ])
- 4.7. Range of Characters (- within [ and ])
Python includes a module called re which provides support for regular expressions. In addition to a few functions, objects and methods, regular expressions include a syntax which we discuss below.
2. Performing a Pattern Search
The steps involved in performing a regular expression pattern search are as follows:
Compile the regular expression to create a regex object from a pattern string.
import re rx = re.compile('ab')
Use the regex object method search() to search the string and find the first position where the pattern matches. If a match is found the method returns a MatchObject instance, None otherwise. Use the group() method to retrieve the matched part of the string.
mo = rx.search('vegetable') print mo.group() # prints ab
You can also use the match() method to start matching the string from the beginning. Similar to search(), this method also returns a MatchObject on success and None on failure.
mo = rx.match('vegetable') print mo # prints None
The same example as above fails here because the search for the pattern
ab starts from the beginning and fails.
To find the position of the match in the input string, use the methods start() and *end(). Here is an example of indicating the position of the match in the string using carets (
^). (The input string can be referenced using the attribute string from the MatchObject).
print mo.string, '\n', ' ' * mo.start() + '^' * (mo.end() - mo.start()) # prints vegetable ^^
I hope the difference between
match() is clear.
search() searches anywhere in the string, while
match() matches starting from the beginning. (I always find these two confusing.) You can also get the behavior of
search() by anchoring the pattern to the beginning using
^ (just a reminder – see below).
3. How It Works
A regular expression in python is composed of a string describing the pattern you want to match. This string is compiled into an internal form by python. This is returned as a regular expression object, which you can use further. The compiled bytecode inside this object can be used to repeatedly match multiple input strings.
4. Special Characters
A regular expression pattern is composed of characters: regular and special. A regular character stands for itself and has no special meaning to the regex engine other than its character value.
A special character, on the other hand, means something special to the regex engine rather than what its meaning usually signifies. For example, a caret (
^) is a special character to the engine and means “search for the following pattern from the beginning of the input string.”
To indicate to the engine to use the “ordinary” meaning of a character even for a so-called “special” character, precede the special character by a backslash (
\). (In fact, this is the special meaning of the backslash character in regular expressions – to remove the “special” meaning of the next character.)
These special characters are what impart much of the power to regular expressions.
We now explain each of these special characters.
4.1. Any Character (
. to act as a wildcard for any single character. In other words, this is a “match any character”. For example, if you need to skip the next three characters when matching, you can include
... in the pattern.
The following matches any string starting with
V and the third character being
Pattern: ^V.g Matches: Vegetable ^^^ Vogue ^^^
Skip 3 characters:
Pattern: a...d aladdin ^^^^^ constrained ^^^^^
4.2. Anchor Search to Start (
To match the pattern only at the beginning, start the pattern with a
^. Whatever follows this character only matches at the beginning. The following example matches only strings starting with
Pattern: ^P Pentagon ^ Pensacola ^
4.3. Anchor Search to End (
$ to indicate that the string should end with the pattern just before the
Find 4-character words ending with
Pattern: ^...d$ Bird ^^^^ Birdseye Blvd ^^^^
4.4. Match Multiple Characters (
Following any character with a star (
*) is used to match zero or more of the preceding.
Find words with repeated
Pattern: aaa* Isaac ^^ Kierkegaard ^^
4.5. Find Zero or One Character (
To match zero or one instance of a character, use
Pattern: ball? ball ^^^^ garibaldi ^^^
The following matches both
Pattern: https? http ^^^^ https ^^^^^
4.6. Find One or More Characters (
Sometimes it is irritating to specify
* for multiple matches and end up with nothing. Because
* matches 0 characters too. When such is the case, use
+ in place of
Pattern: ball+ garibaldi ball ^^^^
4.5. Match Characters in a Set (
[ and ‘]’)
Till now, we have looked at specifying single character matches. To specify a bunch of characters, each of which can be matched enclose them in
Check whether a version variable is set to the right value.
Pattern: version= version=1 version=2 ^^^^^^^^^ version=3 ^^^^^^^^^ version=4 ^^^^^^^^^ version=5
4.6. Exclude From Set (
Similar to inclusive search above, when starting the pattern with a caret (
^), the characters in the set will be excluded.
Pattern: version=[^15] version=1 version=2 ^^^^^^^^^ version=3 ^^^^^^^^^ version=4 ^^^^^^^^^ version=5
4.7. Range of Characters (
Gets somewhat cumbersome to specify a set of characters that are large, doesn’t it? How do you include only alphabetic characters in your string, excluding numbers and other junk?
[abcdefgh...z]? Surely there has got to be a better way?
Well, there is. Specify the pattern as
Pattern: ^[a-z]*: http://example.com ^^^^^ mailto:firstname.lastname@example.org ^^^^^^^ text-align: left
The last example above does not match because the pattern does not include
-. Include it in the pattern as
Pattern: ^[a-z-]*: text-align: left ^^^^^^^^^^^
We have covered the basics of regular expression in general and python in particular. The field of regular expressions is huge and we have just scratched the surface, enough to wet a beginner’s feet. We will cover more aspects of this subject so please look forward to it.
If you need any help at all, please do not hesitate to post in the comment section below.