Regular Expressions Revisited

This article is for anyone who wants to stop hacking on regular expressions and spend the few minutes to truly understand them so that you can write them correctly on the first try and won’t be guessing anymore.

100 years ago in Internet time (aka 2004) I wrote an article about Regular Expressions for Macromedia (think Flash) DevNet. It seams to me most of the times I hear developers talking about RegExes they don’t really understand the cryptic syntax and are guessing more than understanding. Hopefully this article will help.

I have made some small tweaks to the content to make it strictly JavaScript based and have pared it down to the real interesting parts to condense it to a single page.

An easy way to test these examples and your own work: Node.js in a terminal.

Regular Expression Syntax

This section explains the special shorthand used to specify a text pattern with regular expressions. You will start by handling the simplest type of pattern and incrementally build upon it until you understand the most complicated patterns, including the one in the introductory example.

Simple Sequences

The simplest type of regular expression is called a “sequence” or simply a literal string pattern. The following code creates a regular expression called “re” with the literal pattern “hello”. This is used on the next line to see if that pattern exists in the text we are testing — namely, the string “hello world”. The following line shows that the expression will match any part of the string:

var re = new RegExp("hello");re.test("hello world"); // returns: true
re.test("world hello"); // returns: true

We found the defined pattern, “hello”, in the string “hello world”. Note that the entire regular expression pattern has to be found, even though the pattern need not encompass the entire string. Also, the test() method returns the Boolean true to signify it was found. If you try the same pattern on different strings it may not be found:

var re = new RegExp("hello");re.test("aabbcc"); // returns: false, "aabbcc" does 
// not contain the pattern "hello"
// orre.test("Hello World"); // returns: false, regular expressions are
// case sensitive by default

Metacharacters

Characters which are used to help define a pattern, but are not searched for in their literal sense, are called metacharacters. One of the most common is the period (.), which matches any single character. Look how it is used in the following examples:

var re = new RegExp("h.llo");re.test("hello"); // returns true, the ‘.’ in the regular expression
// matched the ‘e’ in the string
re.test("h llo"); // returns true
re.test("hllo"); // returns false

Remember that metacharacters are not searched for literally. In the above example we aren’t actually trying to match a period. But if you are, you must use another metacharacter to escape the others — namely, the backslash (\). However, due to the way that JavaScript handles backslashes you must escape it twice (\\). This double backslash must precede every metacharacter if you want to use it in its literal sense. For example:

var re = new RegExp("hello\\.");re.test("hello."); // returns true
re.test("hellox"); // returns false

Here are all the metacharacters you can use:

\ | ( ) [ { ^ $ * + ? .

Ranges

A pattern is not terribly interesting if you have to specify every character that you expect. Instead, with regular expressions you can specify a set of characters that you might expect, such as “any h or H,” “any digit,” “any uppercase letter,” or “any whitespace character.” These sets are referred to as ranges.

Using the previous example, to create a pattern that matches both “hello world” and “Hello World,” let’s define a range of characters we can expect using the special metacharacters “[“ and “]”:

var re = new RegExp("[Hh]ello"); 
// now we will look for either 'h' or 'H' followed by "ello"
re.test("hello world"); // returns: true
re.test("Hello world"); // returns: true
re.test("HELLO WORLD"); // returns: false
re.test("Hhello world"); // returns: ?

What do you think is returned in the last line? Does it return true? While the range is a set of more than one character, it still only matches one character at a time. In this example we are saying that we expect exactly one of either H or h. So will it fail to match and return false? Remember that the regular expression tries to match any part of the string, so in this case it finds a match starting with the second character in the string; that is, by skipping over the H and starting with the h. Here are some more examples:

var re = new RegExp("[Hh]ello");re.test("asdfhelloasdfasdf"); // returns: true
re.test("hhhhHelloooooo"); // returns: true

Now you are equipped to define a range for, say, all lowercase letters — something like this: [abcdefghijklmnopqrstuvwxyz]. But this is pretty painful, so regular expressions are smart enough to allow you to specify a range using the “–” metacharacter like this: [a-z]. Here are some examples:

var re1 = new RegExp("[abcdefghijklmnopqrstuvwxyz]ello");
var re2 = new RegExp("[a-z]ello"); // equivalent to the above
re1.test("xello world"); // returns: true
re2.test("rello world"); // returns: true

Remember that regular expressions are case-sensitive; you need to define the uppercase range separately. Digit ranges can also be defined with “–”:

var re1 = new RegExp("[A-Z]ello"); // matches any uppercase alpha
// character and then "ello"
var re2 = new RegExp("[0–9]"); // matches one of any digit// ::tricky::var re3 = new RegExp("[A-Za-z]ello"); // matches any uppercase or
// lowercase alpha character and then "ello"

Many ranges are so commonly used that there is an even easier way to represent them. Common sets of characters are represented with special sequences, like the following: \\d \\s \\w. The \\d set matches any one digit, so \\d is equivalent to [0–9]. A backslashed letter is also used to define special single characters like a new line or tab. Table 1 lists these special character sets.

Table 1: Special Character Sets

Now you can use these special characters in your patterns. Here are some examples

var re1 = new RegExp("hello\\sworld");
var re2 = new RegExp("\\wello");
re1.test("hello world"); // returns true
re1.test("hello world"); // returns true, if this is one tab in
// between words in this string
re2.test("Hello world"); // returns true
re2.test("+ello world"); // returns false, '+' is not a
// word character

Remember that unless you specify otherwise, a regular expression does not have to match the whole string. In order to match, however, some part of the string must match the whole regular expression.

Multipliers

With multipliers you no longer work only with a single character match in your regular expression. Instead, you can add the following type of specification to your patterns: “zero or more occurrences of…,” “either zero or one occurrence of…,” “between two and five occurrences of…,” “at least one or more occurrences of…,” etc.

Two common multipliers are “*” and “+”. The * symbol means “zero or more instances of” and the + means “one or more instances of.” Going back to the familiar example we could do the following:

var re1 = new RegExp("hel+o");var re2 = new RegExp("hel*o");re1.test("helllllllo world"); // returns true
re1.test("heo world"); // returns false
re2.test("heo world"); // returns true

You can now combine this with some of the other metacharacters to make more generic and powerful expressions:

var re1 = new RegExp("h\\w+ w\\w+"); // match any two words starting 
// with 'h' and 'w', separated by a space,
// in which the first starts with "h" and the second a "w", and each// word must have at least one character other than the firstre1.test("howdy wally"); // returns true
re1.test("hello world"); // returns true
re1.test("won’t match"); // returns false

Other notable multipliers include “?”, which means “zero or one”; “{x}”, which means you expect exactly x number of times; and “{x,y}”, in which you specify integers x and y, and then a pattern must repeat at least x times but not more than y times. In the last example, if x is left out, the value defaults to 0; if y is left out, the value defaults to an open-ended “more”. For instance:

var re1 = new RegExp("h?el{2,}o w{1,2}orld"); 
// "hello world", but the 'h' is optional, there can be two or
// more 'l's and either one or two 'w's
re1.test("ello world") // returns true
re1.test("hellllo wworld"); // returns true
re1.test("helo world"); // returns false, not enough l's
re1.test("hello wwworld"); // returns false, too many w's

Grouping

Now that you understand what multipliers are you may be wondering how to apply them to groups of characters. To do this you need the “( )” grouping metacharacters. By putting a portion of a regular expression inside parentheses you can apply multipliers to that entire portion.

The following examples illustrate the use of grouping. What we’re trying to do is match the laughing sounds “hohoho,” “hehehe,” or “hahaha.” Note the difference in the regular expressions and results:

var re1 = new RegExp("h[aeo]{3}");re1.test("hahaha"); // returns false
re1.test("haaa"); // returns true
// instead we should group the 'h' and the following vowelvar re2 = new RegExp("(h[aeo]){3}");re2.test("hahaha"); // returns true
re2.test("haaa"); // returns false

This isn’t exactly what we want, however, because the following also matches:

re2.test("hahohe"); // returns true

What we want is some way to say “make sure this part of the pattern looks like this other part.” Thankfully, we can use the “memory” capabilities of regular expressions. Memory is also tied to the “( )” metacharacters; when you use them you create a memory holder in the regular expression that you can reference later using “\\” followed by a number. This number is the order in which the “(“ appears left to right. That is, the first grouping is referenced by \\1 and, if there is a second one, it will be referenced by \\2. For example:

var re1 = new RegExp("(h[aeo])\\1{2}"); 
// here the \1 refers to whatever was found in the first grouping
// and the {2} multiplies the \1 meta character.
// Now:re1.test("hohoho"); // returns true
re1.test("hahohe"); // returns false
var re2 = new RegExp("he(\\w)\\1o world(\\W)\\2");
// note the use of the \w and \W as well as memory
// and the need to double the backslash '\\'
re2.test("hello world!!"); // returns true
re2.test("hemmo world??"); // returns true

In the second example we used grouping to allow memory usage, although the created group only had one character in it. Memory has other powerful implications that I will go into more deeply when I discuss the RegExp.prototype.match() command.

Anchors

A very useful bit of syntax, anchors are characteristics of a string that are not actual characters but, instead, occur between characters. There are two important types of anchors. The first are string boundary anchors: The beginning of the string matches “^” and the end of the string matches “$”:

var re1 = new RegExp("he");
var re2 = new RegExp("^he");
re1.test("the quick brown fox…"); // matches in the word "the"
re2.test("the quick brown fox…"); // does not match because the "he"
// is not at the beginning of the string

The second type of anchors are word boundary anchors, where the boundary of a word matches the character “\\b”. What I mean by word boundary is the place between a \\w and a \\W character or a string boundary. This is not the same as a \\s character, which matches whitespace. Anchors do not match characters; they match what occurs between characters, such as the ending of a word or the beginning of a line (see Table 2). Similarly, \\B matches a non-word boundary. Here are some examples:

var re3 = new RegExp("cart\\b"); 
// Matches the string "cart"
// followed by a word boundary, since 't' is a word character, then
// searches for a non-word character next to it.
var re4 = new RegExp("cart\\B");
// Searches for "cart" that is not followed by a word boundary.
So the next character has to be a word character too.re3.test("cart before the horse"); // returns true
re3.test("cartwheel before the horse"); // returns false
re4.test("cart before the horse"); // returns false
re4.test("cartwheel before the horse"); // returns true
Table 2: Anchors

Wrap Up

Besides the regular expression syntax I’ve covered so far, you can use the alternation “|” which allows you to use Boolean OR in your patterns. There are additional grouping methods, such as “(?: )”, and flags that you can set. Table 3 recaps what I’ve covered.

Table 3: RegEx Syntax Review

Technologist. Leader. Passionate about growth and helping others.