Python Regular Expressions Tutorial – Part 1

A Gentle Introduction to Python Regular Expressions

“Suspect each moment, for it is a thief, tiptoeing away with more than it brings.” ― John Updike, A Month Of Sundays

1. Introduction

Regular Expressions are a powerful mechanism for specifying a pattern to match a string. You can target specific segments you are interested in, and specify what it should include or exclude. In spite of all the power it provides, regular expressions are rather easy to pick-up and master. Almost every modern programming language, including python, perl, java, c/c++, javascript, etc. provide support for string manipulation using regular expressions. In this article, we show you python’s regular expression syntax with examples.

Python includes a module called re which provides support for regular expressions. In addition to a few functions, objects and methods, regular expressions include a syntax which we discuss below.

2. Performing a Pattern Search

The steps involved in performing a regular expression pattern search are as follows:

Compile the regular expression to create a regex object from a pattern string.

import re
rx = re.compile('ab')

Use the regex object method search() to search the string and find the first position where the pattern matches. If a match is found the method returns a MatchObject instance, None otherwise. Use the group() method to retrieve the matched part of the string.

mo = rx.search('vegetable')
print mo.group()
# prints
ab

You can also use the match() method to start matching the string from the beginning. Similar to search(), this method also returns a MatchObject on success and None on failure.

mo = rx.match('vegetable')
print mo
# prints
None

The same example as above fails here because the search for the pattern ab starts from the beginning and fails.

To find the position of the match in the input string, use the methods start() and *end(). Here is an example of indicating the position of the match in the string using carets (^). (The input string can be referenced using the attribute string from the MatchObject).

print mo.string, '\n', ' ' * mo.start() + '^' * (mo.end() - mo.start())
# prints
vegetable
     ^^

I hope the difference between search() and match() is clear. search() searches anywhere in the string, while match() matches starting from the beginning. (I always find these two confusing.) You can also get the behavior of match() from search() by anchoring the pattern to the beginning using ^ (just a reminder – see below).

3. How It Works

A regular expression in python is composed of a string describing the pattern you want to match. This string is compiled into an internal form by python. This is returned as a regular expression object, which you can use further. The compiled bytecode inside this object can be used to repeatedly match multiple input strings.

4. Special Characters

A regular expression pattern is composed of characters: regular and special. A regular character stands for itself and has no special meaning to the regex engine other than its character value.

A special character, on the other hand, means something special to the regex engine rather than what its meaning usually signifies. For example, a caret (^) is a special character to the engine and means “search for the following pattern from the beginning of the input string.”

To indicate to the engine to use the “ordinary” meaning of a character even for a so-called “special” character, precede the special character by a backslash (\). (In fact, this is the special meaning of the backslash character in regular expressions – to remove the “special” meaning of the next character.)

These special characters are what impart much of the power to regular expressions.

We now explain each of these special characters.

4.1. Any Character (.)

Use a . to act as a wildcard for any single character. In other words, this is a “match any character”. For example, if you need to skip the next three characters when matching, you can include ... in the pattern.

The following matches any string starting with V and the third character being g

Pattern: ^V.g

Matches:
Vegetable
^^^
Vogue
^^^

Skip 3 characters:

Pattern: a...d

aladdin
^^^^^
constrained
      ^^^^^

4.2. Anchor Search to Start (^)

To match the pattern only at the beginning, start the pattern with a ^. Whatever follows this character only matches at the beginning. The following example matches only strings starting with P.

Pattern: ^P

Pentagon
^
Pensacola
^

4.3. Anchor Search to End ($)

Specify $ to indicate that the string should end with the pattern just before the $.

Find 4-character words ending with d.

Pattern: ^...d$

Bird
^^^^
Birdseye

Blvd
^^^^

4.4. Match Multiple Characters (*)

Following any character with a star (*) is used to match zero or more of the preceding.

Find words with repeated as

Pattern: aaa*

Isaac
  ^^
Kierkegaard
       ^^

4.5. Find Zero or One Character (?)

To match zero or one instance of a character, use ?.

Pattern: ball?

ball
^^^^
garibaldi
    ^^^

The following matches both http and https:

Pattern: https?

http
^^^^
https
^^^^^

4.6. Find One or More Characters (+)

Sometimes it is irritating to specify * for multiple matches and end up with nothing. Because * matches 0 characters too. When such is the case, use + in place of *.

Pattern: ball+

garibaldi

ball
^^^^

4.5. Match Characters in a Set ([ and ‘]’)

Till now, we have looked at specifying single character matches. To specify a bunch of characters, each of which can be matched enclose them in [ and ].

Check whether a version variable is set to the right value.

Pattern: version=[234]

version=1

version=2
^^^^^^^^^
version=3
^^^^^^^^^
version=4
^^^^^^^^^
version=5

4.6. Exclude From Set ([^ and ])

Similar to inclusive search above, when starting the pattern with a caret (^), the characters in the set will be excluded.

Pattern: version=[^15]

version=1

version=2
^^^^^^^^^
version=3
^^^^^^^^^
version=4
^^^^^^^^^
version=5

4.7. Range of Characters (- within [ and ])

Gets somewhat cumbersome to specify a set of characters that are large, doesn’t it? How do you include only alphabetic characters in your string, excluding numbers and other junk? [abcdefgh...z]? Surely there has got to be a better way?

Well, there is. Specify the pattern as [a-z].

Pattern: ^[a-z]*:

http://example.com
^^^^^
mailto:joe@example.com
^^^^^^^
text-align: left

The last example above does not match because the pattern does not include -. Include it in the pattern as [-a-z] or [a-z-]

Pattern: ^[a-z-]*:

text-align: left
^^^^^^^^^^^

Conclusion

We have covered the basics of regular expression in general and python in particular. The field of regular expressions is huge and we have just scratched the surface, enough to wet a beginner’s feet. We will cover more aspects of this subject so please look forward to it.

If you need any help at all, please do not hesitate to post in the comment section below.

Leave a Reply

Your email address will not be published. Required fields are marked *