Being able to whip up a regular expression in the ordinary course of data wrangling is one of those skills that separates the computing neophyte from the skilled. Years ago, documentation on the subject was scanty at best, relegated to man(1) pages and the occasional ‘NIX book footnote or appendix. With the widespread adoption of regular expressions in programming languages and the rise of the web, there’s been an explosion of poorly-written Web tutorials which purport to teach you regular expressions, but seldom does more than give your shift and number keys a workout.
Ben Forta’s Learning Regular Expressions, published by Addison-Wesley, aims to change this, and for the most part does an excellent job. Situated somewhere between the first page of Google search results for “regular expression” and Jeffrey Friedl’s Mastering Regular Expressions, it provides a step-by-step tutorial on using regular expressions to match text. Illustrated with common problems including matching phone numbers, postal codes (from three different countries, an nice addition to the tried-and-true example), email addresses, URLs, and snippets of HTML.
The book’s chapters are structured as lessons, each adding on to what you’ve learned in previous chapters. A careful study of the material and working of the examples will bring you not just a basic understanding of word and pattern matching, but more advanced use of regular expressions including position matching, backreferences, look-ahead and look-behind, and even conditional embedding. I’ve been using regular expressions (poorly, I admit, for the most part) for thirty years, and I had several “Oh, that’s how you can do that” moments when reading, especially the last few chapters.
Each chapter provides a set of increasingly sophisticated examples. I hesitate to call this a cookbook, because a competitor I will not name has largely co-pted that format, and truthfully, it’s more didactic than culinary. The presentation works, however; for each section, you’re presented with a problem, some sample text, a regular expression that may or may not solve the problem, and then a discussion of the regular expression and why it did or did not work. Including regular expressions that do not satisfy your goal permits Forta to link one section to another, building on your expectation of how things might work to how they do work. It turns out to be an effective way to present the material, and it’s easy to follow along using a tool such as grep.
The book closes with an appendix on the differences between several popular evironments’ implementation of regular expressions. This is helpful, because almost every reader will come to the book with a slightly different expectation of where they will be using what they learn.
I found Forta’s step-by-step presentation refreshing without being condescending. I would have preferred perhaps a few additional examples on some of the more advanced topics, like look-ahead and look-behind, and even subexpressions. However, he does the job and does it quickly; a motivated reader can go from knowing nothing about the subject to being proficient in just a few evenings, and I found it a quick read.