“Anyone who says he can see through women is missing a lot.” ― Groucho Marx
Contents
1. Introduction
New to regular expressions? Start here.
In the last article, we covered some beginning aspects of regular expressions in python. We now continue to examine details of character classes.
2. Character Classes
We have previously covered specifying ranges of characters using something like [a-z]
for all the lowercase alphabets and [0-9]
for all the digits. These are more easily specified using Character Classes.
2.1. Digits (\d
)
Use \d
to specify all the digits, instead of [0-9]
.
Pattern: age=\d+ age=23 ^^^^^^ age=-1 age=n/a
Note that this is not specifically number parsing, since the value -1
is not acceptable. To parse for negative numbers too, use:
Pattern: ^[-+]\d+$ 1 -1 ^^ +20 ^^^ 25d
2.2. Non-Digits (\D
)
Reverse the sense of \d
by specifying \D
by matching only non-digits. This is equivalent to [^0-9]
Pattern: name=\D+ name=23 name=john smith ^^^^^^^^^^^^^^^
2.3. Alpha-Numeric (\w
)
Match all alphabetic, numeric and underscore with \w
. Equivalent to [a-zA-Z0-9_]
.
A simple email address pattern (not perfect – do not use this directly).
Pattern: \w+@[\w.]+ john@ john@example ^^^^^^^^^^^^ john@example.com ^^^^^^^^^^^^^^^^
2.4. Non-Alphanumeric (\W
)
This pattern reverses the sense of the alphanumeric match above (\w
). It matches [^a-zA-Z0-9_]
.
2.3. White Space (\s
) and Non White Space (\S
)
Match only white-space (newline, carriage return, space and tab) with \s
. Match non white space using \S
.
3. Grouping and Repeating with (
and )
Grouping and repetition uses parantheses (()
). See examples below.
4. Some Examples
Let us use these patterns that we learned above in some examples that are commonly encountered.
4.1. Email Address
A simple email address. Word followed by @
followed by multiple names.
Pattern: \w+@(\w+\.?)+ joe@blow ^^^^^^^^ john@example.com ^^^^^^^^^^^^^^^^
As you can see, this pattern allows domain name without even a single .
, which is not right.
Here is the best I could come up with. The following pattern has been anchored at the beginning and the end for a complete match.
Pattern: ^\w+@(\w+\.)+(\w+)+$ joe@blow john@example.com ^^^^^^^^^^^^^^^^ joe@a.b.c ^^^^^^^^^
Let us enhance the above to allow periods in the user name part (like Gmail does).
Pattern: ^[\w+.]+@(\w+\.)+(\w+)+$ joe@blow john@example.com ^^^^^^^^^^^^^^^^ joe@a.b.c ^^^^^^^^^ j.s@a.b ^^^^^^^
Since Gmail also allows %
and -
in the user name, let us enhance it some more.
Pattern: ^[\w+.%-]+@(\w+\.)+(\w+)+$ j+nospam@a.b ^^^^^^^^^^^^
4.2. IPV4 Address
The following matches an IPV4 address in the form n.n.n.n
. Each of the numbers is restricted from 1 to 3 digits, with 4 components. This is close match for an IP, but not complete because each of the numbers should be between 0 and 255. Such conditions are harder to implement in regular expressions.
Pattern: ^([0-9]{1,3}\.){3}[0-9]{1,3}$ 1 1.2 1.2.3 1.2.3.4 ^^^^^^^ 1.2.3.4.5 1.2.3.4.
4.3. Number in the Range 000-255
For this, we need some special care to exclude numbers greater than 255, but include 199, etc.
Pattern: ^([01][0-9][0-9]|2[0-4][0-9]|25[0-5])$ 000 ^^^ 001 ^^^ 256 255 ^^^ 199 ^^^ 290
To also include 0, 99, etc (1 digit and 2 digit numbers), use the one below.
Pattern: ^([0-9]|[0-9][0-9]|[01][0-9][0-9]|2[0-4][0-9]|25[0-5])$ 1 ^ 0 ^ 99 ^^ 100 ^^^ 300
4.4. ISO 8601 Date
An ISO 8601 date looks like: 2018-03-01
– 4 digit year, followed by 2 digit month followed by 2 digit day of the month, separated by hyphens (-
).
Pattern: ^\d{4}\-(0[1-9]|1[012])\-([012][0-9]|3[01])$ 1993-03-31 ^^^^^^^^^^ 1994-13-29 1995-12-32 1996-12-31 ^^^^^^^^^^ 1997-00-13
Conclusion
We covered details of regular expression character classes in this article. The character classes make it somewhat easier to include or exclude specific classes of characters from strings. These classes include: digits, alpha-numeric, and white space.