Python Regular Expressions Tutorial – Part 2

Learn about Character Classes in python regular expressions.

“Anyone who says he can see through women is missing a lot.” ― Groucho Marx

1. Introduction

New to regular expressions? Start here.

In the last article, we covered some beginning aspects of regular expressions in python. We now continue to examine details of character classes.

2. Character Classes

We have previously covered specifying ranges of characters using something like [a-z] for all the lowercase alphabets and [0-9] for all the digits. These are more easily specified using Character Classes.

2.1. Digits (\d)

Use \d to specify all the digits, instead of [0-9].

Pattern: age=\d+

age=23
^^^^^^
age=-1

age=n/a

Note that this is not specifically number parsing, since the value -1 is not acceptable. To parse for negative numbers too, use:

Pattern: ^[-+]\d+$

1

-1
^^
+20
^^^
25d

2.2. Non-Digits (\D)

Reverse the sense of \d by specifying \D by matching only non-digits. This is equivalent to [^0-9]

Pattern: name=\D+

name=23

name=john smith
^^^^^^^^^^^^^^^

2.3. Alpha-Numeric (\w)

Match all alphabetic, numeric and underscore with \w. Equivalent to [a-zA-Z0-9_].

A simple email address pattern (not perfect – do not use this directly).

Pattern: \w+@[\w.]+

john@

john@example
^^^^^^^^^^^^
john@example.com
^^^^^^^^^^^^^^^^

2.4. Non-Alphanumeric (\W)

This pattern reverses the sense of the alphanumeric match above (\w). It matches [^a-zA-Z0-9_].

2.3. White Space (\s) and Non White Space (\S)

Match only white-space (newline, carriage return, space and tab) with \s. Match non white space using \S.

3. Grouping and Repeating with ( and )

Grouping and repetition uses parantheses (()). See examples below.

4. Some Examples

Let us use these patterns that we learned above in some examples that are commonly encountered.

4.1. Email Address

A simple email address. Word followed by @ followed by multiple names.

Pattern: \w+@(\w+\.?)+

joe@blow
^^^^^^^^
john@example.com
^^^^^^^^^^^^^^^^

As you can see, this pattern allows domain name without even a single ., which is not right.

Here is the best I could come up with. The following pattern has been anchored at the beginning and the end for a complete match.

Pattern: ^\w+@(\w+\.)+(\w+)+$

joe@blow

john@example.com
^^^^^^^^^^^^^^^^
joe@a.b.c
^^^^^^^^^

Let us enhance the above to allow periods in the user name part (like Gmail does).

Pattern: ^[\w+.]+@(\w+\.)+(\w+)+$

joe@blow

john@example.com
^^^^^^^^^^^^^^^^
joe@a.b.c
^^^^^^^^^
j.s@a.b
^^^^^^^

Since Gmail also allows % and - in the user name, let us enhance it some more.

Pattern: ^[\w+.%-]+@(\w+\.)+(\w+)+$

j+nospam@a.b
^^^^^^^^^^^^

4.2. IPV4 Address

The following matches an IPV4 address in the form n.n.n.n. Each of the numbers is restricted from 1 to 3 digits, with 4 components. This is close match for an IP, but not complete because each of the numbers should be between 0 and 255. Such conditions are harder to implement in regular expressions.

Pattern: ^([0-9]{1,3}\.){3}[0-9]{1,3}$

1

1.2

1.2.3

1.2.3.4
^^^^^^^
1.2.3.4.5

1.2.3.4.

4.3. Number in the Range 000-255

For this, we need some special care to exclude numbers greater than 255, but include 199, etc.

Pattern: ^([01][0-9][0-9]|2[0-4][0-9]|25[0-5])$

000
^^^
001
^^^
256

255
^^^
199
^^^
290

To also include 0, 99, etc (1 digit and 2 digit numbers), use the one below.

Pattern: ^([0-9]|[0-9][0-9]|[01][0-9][0-9]|2[0-4][0-9]|25[0-5])$

1
^
0
^
99
^^
100
^^^
300

4.4. ISO 8601 Date

An ISO 8601 date looks like: 2018-03-01 – 4 digit year, followed by 2 digit month followed by 2 digit day of the month, separated by hyphens (-).

Pattern: ^\d{4}\-(0[1-9]|1[012])\-([012][0-9]|3[01])$

1993-03-31
^^^^^^^^^^
1994-13-29

1995-12-32

1996-12-31
^^^^^^^^^^
1997-00-13

Conclusion

We covered details of regular expression character classes in this article. The character classes make it somewhat easier to include or exclude specific classes of characters from strings. These classes include: digits, alpha-numeric, and white space.