This week I read the “Regular Expression Cookbook” by Jan Goyvaerts and Steven Levithan. Like the rest of the O’Reilly Cookbook series it contains ‘recipes’ for using regular expressions for a variety of tasks, divided up into chapters of broad topics such as ‘Validation and Formatting’ and ‘Numbers’. Because of its structure the book is more useful as a reference than something you would read cover-to-cover. But that doesn’t mean such a read is a waste of time.
Flavors and Languages
The Regular Expressions Cookbook was published in 2009 and so it is relatively up-to-date on modern flavors of regular expressions, specifically those found in:
- Perl, Versions 5.6, 5.8, and 5.10
- Microsoft .NET Framework 3.5
- Java 4–6, Specifically
- Python, Versions 2.4 and 2.5
- Ruby, Versions 1.8 and 1.9
Web developers will notice the omission of PHP. However, PHP defers directly to PCRE and so that covers it. All you need to complement that is to learn PHP’s PCRE interface.
If you know nothing about regular expressions then you will find a great introduction in the book’s initial chapters. The second chapter in particular breaks down the syntax of regular expressions in terms of their behavior and features, providing useful demonstrates of each. Even if you do have experience with regular expressions I still strongly recommend reading through that chapter as a refresher (e.g. I had forgotten about conditional expressions).
The Cookbook is a great reference in general. However, I have two problems with it.
First, the recipe for validating URLs only matches those for
file://, FTP, and HTTP. Admittedly this is a nit-picking complaint; and the recipe itself claims to match “almost any URL,” but I have a pedantic pet-peeve about anything which spreads the idea that URLs are only a tiny subset of addresses for locations on the Web. The same chapter later goes on to discuss matching URNs and generic URIs, so I did appreciate that. But I wish an explanation about general URIs and their relationship to URLs appeared at the beginning of the chapter instead of being little more than a side-note to a particular recipe.
Second, the final chapter is ‘Markup and Data Interchange’ and focuses on using regular expressions for dealing with HTML and XML (and others). Often it is a terrible idea. For simple, trivial tasks it is ok to use regular expressions on such markup, e.g. in a Perl script that is perform some simple data scrapping from a site. But anything remotely non-trivial is better served by using an actual HTML or XML parser. Personally if I have to write more than two regular expressions for dealing with markup data then I seriously begin to consider using a parser instead. The authors briefly mention this at the start of the chapter:
Although it’s not always apparent on the surface, some of these formats can be sur- prisingly complex to process and manipulate accurately, at least using regular expres- sions. It’s usually best to use dedicated parsers and APIs instead of regular expressions when performing many of the tasks in this chapter, especially if accuracy is critical (e.g., if your processing might have security implications). Nevertheless, these recipes show useful techniques that can be used with many quick processing tasks.
It is sage advice. Nonetheless I do not feel the book stresses that point strongly enough. In my experience using regular expressions to deal with HTML and XML is a vortex that will quickly pull you into a realm of madness.
All that said, the Regular Expressions Cookbook is a terrific reference. It covers many common tasks which are well-suited to regular expressions, and it explains them clearly and in a variety of popular programming languages. It is entirely suitable for people who know nothing about regular expressions and for people who have used them for decades.
My next Book of the Week is not going to be a book. Instead I am going to re-read as many of my old issues of 2600 Magazine as I can. It’s a publication that I’ve loved and enjoyed for years, so revisiting all of my old issues will be a lot of fun.