Author: Shea Stewart


RegEx Roundup

Jamie Zawinski, then a developer at Netscape, once famously opined:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

A nice pithy quote to be sure, but there’s truth in it and if anything its theme is understated. Many developers have fallen hard on the sword of regular expressions, and part of the problem is that it’s not well understood that there are multiple regular expression engines in widespread use and they have different features and sometimes even conflicting syntaxes for the same features. Once this playing field is understood, one can have a fighting chance at retaining their sanity long enough to solve their initial problem.

A handy site to learn about regular expressions is Regular-Expressions.info. The things I will describe here can be studied in further detail there.

BRE

Regular expressions began with Unix text processing utilities such as sed, awk and grep. The basic regular expression syntax used by these tools is codified by the POSIX (Portable Operating System Interface) standards group as the BRE (Basic Regular Expression) specification. Unix tools generally use BRE by default, although not always all of its features - for example, awk does not recognize the \d digit specification so you must use [0-9] instead.

ERE

The grep utility optionally allows an extended syntax described by POSIX ERE (Extended Regular Expression), which behaves slightly differently with quantifiers, alternation and grouping, and completely reverses the escaping mechanism - i.e. where special characters in BRE require a backslash, in ERE a backslash negates the specialness of a character. ERE is used by egrep, or when the -E option is supplied to grep.

PCRE

Perl added many features to its regular expression engine. The PCRE (Perl Compatible Regular Expression) library, first released in 1997 and the current PCRE2 version active since 2015, is the de facto reference implementation of this specification, and is the basis for most modern implementations of regular expressions. It has the largest set of features and is most widely used for advanced regular expression capabilities. Many modern languages use PCRE to implement regular expressions, and you can even use them with grep by using the -P option (if you are using GNU grep, i.e. not on Mac OS). Treat yourself to a trip to the excellent RegExr site to interactively learn the full power of PCRE.

Example

The following commands use regular expressions to find North American phone numbers in a list. While the ERE and PCRE regexes are identical, the awk and BRE versions are very different, and the particularly austere BRE implemented by awk requires character ranges rather than the simpler \d to define digits.

> cat list
(555)789-0123
abcde
1234567890

# awk - BRE but cannot use \d for digits
> cat list | awk '/^\([0-9]{3}\)[0-9]{3}-[0-9]{4}$/'
(555)789-0123

# BRE: quantifiers must be escaped but literal parentheses are not
> cat list | grep '^(\d\{3\})\d\{3\}-\d\{4\}$'
(555)789-0123

# ERE: quantifiers not escaped but literal parentheses are
> cat list | grep -E '^\(\d{3}\)\d{3}-\d{4}$'
(555)789-0123

# PCRE: for this example identical to ERE
> cat list | grep -P '^\(\d{3}\)\d{3}-\d{4}$'
(555)789-0123

Know the landscape

BRE, ERE, and PCRE are the main Regular Expression specifications in wide use but there are others used by specific languages and tools. The upshot of all this is that depending on what tool you are using, you may have to know a very particular dialect of regular expressions to get the job done. Remi Rampin provides a great Regex Cheat Sheet which helps you to understand the syntax and feature differences between the standards, and maps many popular tools to the syntax you need to use, but you should check documentation to be fully aware of the features of the engine you are using.

Choose your battles

If you avoid complex sed and awk expressions and when you need a complex regex instead use higher-level languages and grep -E, you can train yourself to always be in a position to take advantage of the fulsome PCRE feature set and not have to diverge into the lesser capabilities of BRE and ERE syntax. Sometimes, however, the most direct tool for the job means dusting off some exotic syntax.

saddle

Saddle up!

Forewarned is forearmed, so now that you’ve been warned, you should be prepared to roll up your sleeves and get those forearms into wrangling some RegExes.

Photos from Unsplash

Roundup Photo by Mahir Uysal

Saddle Photo by Roger van de Kimmenade

Tagged:



//comments


//blog search


//other topics