Contents
. Introduction
Regular Expressions provide a powerful method to handle parsing tasks while handling texts. Whether the problem is a simple search for a pattern, or splitting a string into components, regular expressions are widely used today.
Let us learn how to build regular expressions for matching some common patterns. The purpose of this article is to teach you how to build regex patterns so you can understand and build your own.
2. A Review of Regular Expressions
Before we begin building regular expressions, let us review the basics a bit. There are many excellent regular expression guides out on the Internet. Wikipedia provides a general outline of regular expressions including the history. Mozilla’s site provides a regular expression reference which applies to regexes in Javascript. And Java’s Pattern class describes regex patterns as used in Java.
3. Floating Point Numbers
We will now attempt to build a regular expression for matching floating point numbers in a variety of formats. In the output below, we see an x in the first column when input is not matched. Let us now gradually build the regex to match the input shown.
An decimal integer can easily matched with a regular expression of the form \d+
. Suitable for a number without a fractional part or a preceding sign.
Regex: \d+ | 1 |2389 x | 2 |3.14 x | 3 |23. x | 4 |45.0 x | 5 |.388 x | 6 |0.564 x | 7 |278e-8 x | 8 |399e4
3.1. Fractional Part
Let us update for handling the fractional part. We append \.\d+
which matches a decimal point followed by numbers.
Regex: \d+\.\d+ x | 1 |2389 | 2 |3.14 x | 3 |23. | 4 |45.0 x | 5 |.388 | 6 |0.564 x | 7 |278e-8 x | 8 |399e4
To handle plain integers, let us make the fractional part optional by using the construct (\.\d+)?
.
Regex: \d+(\.\d+)? | 1 |2389 | 2 |3.14 x | 3 |23. | 4 |45.0 x | 5 |.388 | 6 |0.564 x | 7 |278e-8 x | 8 |399e4
How about making the fractional digits optional? After all, 32.
is still a valid floating point number!
Regex: \d+(\.\d*)? | 1 |2389 | 2 |3.14 | 3 |23. | 4 |45.0 x | 5 |.388 | 6 |0.564 x | 7 |278e-8 x | 8 |399e4
Looks like we need to make the integer part optional too.
Regex: (\d+)?(\.\d*)? | 1 |2389 | 2 |3.14 | 3 |23. | 4 |45.0 | 5 |.388 | 6 |0.564 x | 7 |278e-8 x | 8 |399e4 | 9 |
However, if both the integer and fractional part are optional, then our regex will match an empty string. Which is not good at all! So let us make one of them mandatory. Here is one way to achieve that.
Regex: (\d+(\.\d*)?|(\d+)?\.\d+) | 1 |2389 | 2 |3.14 | 3 |23. | 4 |45.0 | 5 |.388 | 6 |0.564 x | 7 |278e-8 x | 8 |399e4 x | 9 |
3.2. Exponents
Looks like we almost got it. Let us now update the regex for covering the exponent part. We append (e[-+]?\d+)?
to the above expression.
| 1 |2389 | 2 |3.14 | 3 |23. | 4 |45.0 | 5 |.388 | 6 |0.564 | 7 |278e-8 | 8 |399e4 x | 9 |
3.3. Optional Sign
And that covers most of it. Let us now update for an optional preceding sign. We prepend [-+]?
to the regex.
| 1 |2389 | 2 |3.14 | 3 |23. | 4 |45.0 | 5 |.388 | 6 |0.564 | 7 |278e-8 | 8 |399e4 | 9 |-69 | 10 |+44.474774e-499 x | 11 |
And here is the complete regex for matching a floating point number: [-+]?(\d+(\.\d*)?|(\d+)?\.\d+)(e[-+]?\d+)?
4. Phone Numbers
Next up in the list of commonly used regex pattern is the phone number. Let us start with a US-style phone number matched using \d+
.
Regex: \d+ | 1 |4159892626 x | 2 |(823)383-2245 x | 3 |377-333-3459 x | 4 |945 330-3322 x | 5 |447.332.4455 | 6 |18008982334 x | 7 |1-800-238-4767 x | 8 |+14669374402
Let us make arrangement for separators between the area code and the number.
Regex: (\d{3})[-. ](\d{3})[-. ](\d{4}) x | 1 |4159892626 x | 2 |(823)383-2245 | 3 |377-333-3459 | 4 |945 330-3322 | 5 |447.332.4455 x | 6 |18008982334 x | 7 |1-800-238-4767 x | 8 |+14669374402
Oops! No longer matches a number without separators. Let us make the separators optional.
Regex: (\d{3})[-. ]?(\d{3})[-. ]?(\d{4}) | 1 |4159892626 x | 2 |(823)383-2245 | 3 |377-333-3459 | 4 |945 330-3322 | 5 |447.332.4455 x | 6 |18008982334 x | 7 |1-800-238-4767 x | 8 |+14669374402
How about accounting for enclosing the area code in parantheses?
Regex: \(?(\d{3})[\)-. ]?(\d{3})[-. ]?(\d{4}) | 1 |4159892626 | 2 |(823)383-2245 | 3 |377-333-3459 | 4 |945 330-3322 | 5 |447.332.4455 x | 6 |18008982334 x | 7 |1-800-238-4767 x | 8 |+14669374402
Let us now take care of the country code, possibly preceded by a +
.
| 1 |4159892626 | 2 |(823)383-2245 | 3 |377-333-3459 | 4 |945 330-3322 | 5 |447.332.4455 | 6 |18008982334 x | 7 |1-800-238-4767 | 8 |+14669374402
Looks like we need to update the separator between the country code and the area code.
Regex: (\+?\d+)?[\(-.]?(\d{3})[\)-. ]?(\d{3})[-. ]?(\d{4}) | 1 |4159892626 | 2 |(823)383-2245 | 3 |377-333-3459 | 4 |945 330-3322 | 5 |447.332.4455 | 6 |18008982334 | 7 |1-800-238-4767 | 8 |+14669374402 | 9 |+1.800.399.3378
A note about this regex for parsing phone numbers. It accepts phone numbers of the form 1-800)456 2334
, which might be characterized ugly if not downright wrong. Correcting for these cases would make the regex more complex, so maybe it is better to cover 90% of the cases and not worry about these edge cases.
5. Email Addresses
An email address has the general form: name@company.com
. Let us start with this form and progressively enhance it.
Regex: \w+@\w+\.\w+ | 1 |j@x.org | 2 |abc@joe.com x | 3 |abc@joe-blow.org
The first problem is that characters like + and – are valid within names (both before the @ and after).
Regex: [-\+\w]+@[-+\w]+\.[-+\w]+ | 1 |j@x.org | 2 |abc@joe.com | 3 |abc@joe-blow.org
Next, the domain name part can have two or three components separated by period (.). Here is one way to take care of it.
Regex: [-\+\w]+@[-+\w]+\.[-+\w]+(\.[-+\w]+)? | 1 |j@x.org | 2 |abc@joe.com | 3 |abc@joe-blow.org | 4 |abc+def@joe.co.uk
We have one more issue. The name can include a period. Update the part before the @ to reflect this condition.
Regex: [-\+\.\w]+@[-+\w]+\.[-+\w]+(\.[-+\w]+)? | 1 |j@x.org | 2 |abc@joe.com | 3 |abc@joe-blow.org | 4 |abc+def@joe.co.uk | 5 |abc.d+efg@joe.com
6. Simple HTML
Parsing HTML is complex and not possible entirely with regular expressions. It is better to use a library suitable for the programming language to accomplish this task. Having said that, there are a few situations where using a regex to parse HTML might be useful. The code below provides a solution with the following caveats.
- Does not handle common HTML errors such as missing starting or ending tags.
- XML type start-tag only (ex:
<br/>
) is not accepted. - These characters must be escaped properly: <, > and &.
Let us start with a few simple situations and enhance the regex.
Regex: <(\w+)>[^<]*</\1> | 1 |<p>Hello world</p>
This does not accept attributes in the start tag. Not exactly HTML, eh? Well let us cover that case.
Regex: <(\w+)[^>]*>[^<]*</\1> | 1 |<p>Hello world</p> | 2 |<a href="http://www.google.com">Google it!</a>
The regex for the content between the HTML tags is unnecessarily strict. As it stands, the regex does not allow nested HTML such as <p>Hello <b>there</b></p>
. So let us fix it.
Regex: <(\w+)[^>]*>.*</\1> | 1 |<p>Hello world</p> | 2 |<a href="http://www.google.com">Google it!</a> | 3 |<p>Hello <b>world</b></p>
This is about the extent of the HTML that can be parsed with a simple regex.
Conclusion
We covered some regular expression samples for common use cases including parsing for floating point numbers, phone numbers, email addresses and HTML code. With the emphasis on learning how to write regex patterns rather than handing out ready-made recipes.