# Regular Expressions Examples

Learn how to build regular expression patterns for parsing phone numbers, emails, etc.

## . Introduction

Regular Expressions provide a powerful method to handle parsing tasks while handling texts. Whether the problem is a simple search for a pattern, or splitting a string into components, regular expressions are widely used today.

Let us learn how to build regular expressions for matching some common patterns. The purpose of this article is to teach you how to build regex patterns so you can understand and build your own.

## 2. A Review of Regular Expressions

Before we begin building regular expressions, let us review the basics a bit. There are many excellent regular expression guides out on the Internet. Wikipedia provides a general outline of regular expressions including the history. Mozilla’s site provides a regular expression reference which applies to regexes in Javascript. And Java’s Pattern class describes regex patterns as used in Java.

## 3. Floating Point Numbers

We will now attempt to build a regular expression for matching floating point numbers in a variety of formats. In the output below, we see an x in the first column when input is not matched. Let us now gradually build the regex to match the input shown.

An decimal integer can easily matched with a regular expression of the form `\d+`. Suitable for a number without a fractional part or a preceding sign.

```Regex: \d+
|  1 |2389
x |  2 |3.14
x |  3 |23.
x |  4 |45.0
x |  5 |.388
x |  6 |0.564
x |  7 |278e-8
x |  8 |399e4```

### 3.1. Fractional Part

Let us update for handling the fractional part. We append `\.\d+` which matches a decimal point followed by numbers.

```Regex: \d+\.\d+
x |  1 |2389
|  2 |3.14
x |  3 |23.
|  4 |45.0
x |  5 |.388
|  6 |0.564
x |  7 |278e-8
x |  8 |399e4```

To handle plain integers, let us make the fractional part optional by using the construct `(\.\d+)?`.

```Regex: \d+(\.\d+)?
|  1 |2389
|  2 |3.14
x |  3 |23.
|  4 |45.0
x |  5 |.388
|  6 |0.564
x |  7 |278e-8
x |  8 |399e4```

How about making the fractional digits optional? After all, `32.` is still a valid floating point number!

```Regex: \d+(\.\d*)?
|  1 |2389
|  2 |3.14
|  3 |23.
|  4 |45.0
x |  5 |.388
|  6 |0.564
x |  7 |278e-8
x |  8 |399e4```

Looks like we need to make the integer part optional too.

```Regex: (\d+)?(\.\d*)?
|  1 |2389
|  2 |3.14
|  3 |23.
|  4 |45.0
|  5 |.388
|  6 |0.564
x |  7 |278e-8
x |  8 |399e4
|  9 |```

However, if both the integer and fractional part are optional, then our regex will match an empty string. Which is not good at all! So let us make one of them mandatory. Here is one way to achieve that.

```Regex: (\d+(\.\d*)?|(\d+)?\.\d+)
|  1 |2389
|  2 |3.14
|  3 |23.
|  4 |45.0
|  5 |.388
|  6 |0.564
x |  7 |278e-8
x |  8 |399e4
x |  9 |```

### 3.2. Exponents

Looks like we almost got it. Let us now update the regex for covering the exponent part. We append `(e[-+]?\d+)?` to the above expression.

```  |  1 |2389
|  2 |3.14
|  3 |23.
|  4 |45.0
|  5 |.388
|  6 |0.564
|  7 |278e-8
|  8 |399e4
x |  9 |```

### 3.3. Optional Sign

And that covers most of it. Let us now update for an optional preceding sign. We prepend `[-+]?` to the regex.

```  |  1 |2389
|  2 |3.14
|  3 |23.
|  4 |45.0
|  5 |.388
|  6 |0.564
|  7 |278e-8
|  8 |399e4
|  9 |-69
| 10 |+44.474774e-499
x | 11 |```

And here is the complete regex for matching a floating point number: `[-+]?(\d+(\.\d*)?|(\d+)?\.\d+)(e[-+]?\d+)?`

## 4. Phone Numbers

Next up in the list of commonly used regex pattern is the phone number. Let us start with a US-style phone number matched using `\d+`.

```Regex: \d+
|  1 |4159892626
x |  2 |(823)383-2245
x |  3 |377-333-3459
x |  4 |945 330-3322
x |  5 |447.332.4455
|  6 |18008982334
x |  7 |1-800-238-4767
x |  8 |+14669374402```

Let us make arrangement for separators between the area code and the number.

```Regex: (\d{3})[-. ](\d{3})[-. ](\d{4})
x |  1 |4159892626
x |  2 |(823)383-2245
|  3 |377-333-3459
|  4 |945 330-3322
|  5 |447.332.4455
x |  6 |18008982334
x |  7 |1-800-238-4767
x |  8 |+14669374402```

Oops! No longer matches a number without separators. Let us make the separators optional.

```Regex: (\d{3})[-. ]?(\d{3})[-. ]?(\d{4})
|  1 |4159892626
x |  2 |(823)383-2245
|  3 |377-333-3459
|  4 |945 330-3322
|  5 |447.332.4455
x |  6 |18008982334
x |  7 |1-800-238-4767
x |  8 |+14669374402```

How about accounting for enclosing the area code in parantheses?

```Regex: \(?(\d{3})[\)-. ]?(\d{3})[-. ]?(\d{4})
|  1 |4159892626
|  2 |(823)383-2245
|  3 |377-333-3459
|  4 |945 330-3322
|  5 |447.332.4455
x |  6 |18008982334
x |  7 |1-800-238-4767
x |  8 |+14669374402```

Let us now take care of the country code, possibly preceded by a `+`.

```  |  1 |4159892626
|  2 |(823)383-2245
|  3 |377-333-3459
|  4 |945 330-3322
|  5 |447.332.4455
|  6 |18008982334
x |  7 |1-800-238-4767
|  8 |+14669374402```

Looks like we need to update the separator between the country code and the area code.

```Regex: (\+?\d+)?[\(-.]?(\d{3})[\)-. ]?(\d{3})[-. ]?(\d{4})
|  1 |4159892626
|  2 |(823)383-2245
|  3 |377-333-3459
|  4 |945 330-3322
|  5 |447.332.4455
|  6 |18008982334
|  7 |1-800-238-4767
|  8 |+14669374402
|  9 |+1.800.399.3378```

A note about this regex for parsing phone numbers. It accepts phone numbers of the form `1-800)456 2334`, which might be characterized ugly if not downright wrong. Correcting for these cases would make the regex more complex, so maybe it is better to cover 90% of the cases and not worry about these edge cases.

An email address has the general form: `name@company.com`. Let us start with this form and progressively enhance it.

```Regex: \w+@\w+\.\w+
|  1 |j@x.org
|  2 |abc@joe.com
x |  3 |abc@joe-blow.org```

The first problem is that characters like + and are valid within names (both before the @ and after).

```Regex: [-\+\w]+@[-+\w]+\.[-+\w]+
|  1 |j@x.org
|  2 |abc@joe.com
|  3 |abc@joe-blow.org```

Next, the domain name part can have two or three components separated by period (.). Here is one way to take care of it.

```Regex: [-\+\w]+@[-+\w]+\.[-+\w]+(\.[-+\w]+)?
|  1 |j@x.org
|  2 |abc@joe.com
|  3 |abc@joe-blow.org
|  4 |abc+def@joe.co.uk```

We have one more issue. The name can include a period. Update the part before the @ to reflect this condition.

```Regex: [-\+\.\w]+@[-+\w]+\.[-+\w]+(\.[-+\w]+)?
|  1 |j@x.org
|  2 |abc@joe.com
|  3 |abc@joe-blow.org
|  4 |abc+def@joe.co.uk
|  5 |abc.d+efg@joe.com```

## 6. Simple HTML

Parsing HTML is complex and not possible entirely with regular expressions. It is better to use a library suitable for the programming language to accomplish this task. Having said that, there are a few situations where using a regex to parse HTML might be useful. The code below provides a solution with the following caveats.

• Does not handle common HTML errors such as missing starting or ending tags.
• XML type start-tag only (ex: `<br/>`) is not accepted.
• These characters must be escaped properly: <, > and &.

Let us start with a few simple situations and enhance the regex.

```Regex: <(\w+)>[^<]*</\1>
|  1 |<p>Hello world</p>```

This does not accept attributes in the start tag. Not exactly HTML, eh? Well let us cover that case.

```Regex: <(\w+)[^>]*>[^<]*</\1>
|  1 |<p>Hello world</p>
The regex for the content between the HTML tags is unnecessarily strict. As it stands, the regex does not allow nested HTML such as `<p>Hello <b>there</b></p>`. So let us fix it.
```Regex: <(\w+)[^>]*>.*</\1>