a macwright.org project Regex Party

Why you should care

Knowing how to use regular expressions will let you solve a lot of problems that vex other humans. They're one of the fastest and simplest ways to match patterns in text, and are endlessly useful for searching, editing, and analyzing text in all sorts of places. They're also one of those skills you can take with you: the same or similar syntax will work for a regular expression regardless of what programming language or editor you're using.

What are regular expressions?

Regular expressions are also called regexps and regexes. Regular expressions are tiny programs that look for patterns in text. These programs are defined in their own specific little language that's tailored just to this task.

Programs, you say?

Yep. Regular expressions are kind of like nested stories. When you use a regular expression within a language like Python or JavaScript, it has all the properties of a programming language within a programming language. Properties like: * Syntax * Compilation * Performance

What do regular expressions do?

In short, they're elaborate ways to make "fill-in-the-blank" puzzles. They let you define the context for a pattern, as well as the kinds of specific holes in that pattern you're looking for. Let's use terrible internet memes as an example. Okay, "The Most Interesting Man in The World": The original line: > I don't always drink beer, but when I do, I prefer Dos Equis Some internet meme version: > I don't always use internet explorer, but when I do, it's usually to download a better browser Humans are great at pattern recognition - regular expressions are a distant second. Reading these two quotes, you probably immediately noticed that the form is: > I don't always ___, but when I do, ___ That's the pattern, expressed with blanks. And the bits that fit into this pattern are, for the original, "drink beer", and "I prefer Dos Equis". For the meme adaptation, "use internet explorer", and "it's usually to download a better browser".
I don't always .* but when I do, .*
Here's that pattern, but with regular expressions instead of blanks. Pretty much the same thing, except instead of ___, there's .*. Here: play around with it:

Getting specific

So in the previous example, we used .* as a 'blank'. In regular expression speak, that's the most broad kind of blank: anything can fit there. It could be a number, letter, word, or even just nothing - "I don't always but when I do," passes the test. It probably shouldn't, right? Well, that's why regular expressions have a lot of different ways to specify what you're looking for. . - a period character - is the most lenient, serving as a stand-in for everything but a linebreak. You can be more specific by looking for \w - any letter, digit, or underscore, \d - any digit, or many other ways of specifying the match you want. For instance, let's say that you're looking for someone's age from an input that looks like > My age is 15 So you want to grab that number - 15 - from the text that surrounds it. If someone dodges the question and instead writes > My age is none of your business You probably don't want to save 'none of your business' to your database. So in this case, instead of using ., you'll use \d - the shorthand for 'any digit, 0-9':
My age is \d*
Here: play around with it:

Digits?

We keep saying 'digits' - why? Two pretty important reasons: 1. Regular expressions work with strings, and match strings. The input of a regular expression is always a string, and the output, if any, is also always a string. Programming languages tend to have lots of types: a value can be a number, true or false, an array, or something else. Regular expressions work with strings. You might immediately take the output of a regular expression and turn it into a number or a boolean or another kind of value, but that's a separate step. So a regular expression will always give you the string "15" - digits "1" and "5", not the number 15. 2. Numbers mean different things to people than computers. Like we said before, people are really good at pattern matching and fuzzy problems. Most adults will know that the numbers 1000, 1000.0, and 1,000 mean the same thing, but computers will have lots of trouble, especially with the last one. So when we say we're matching digits, we mean digits - not decimal points or commas. You can match commas and decimal points too, to be complete, and we'll discuss that in a bit. But keep in mind: computers are brutal, simple minded machines.

Dial * for however many

Okay, so we've discussed ., which meant "anything can go in this blank", and also \d, which let any digit go in the blank. What about that * symbol we keep seeing? Well, the . or \d declared what kind of thing can go in the blank, and the * tells the regular expression how many to look for. It's what we call a quantifier: it declares the quantity of characters that we want. Like ., we started with the most broad example. * means anywhere from zero to lots and lots. It's the ¯\_(ツ)_/¯ of quantifiers. Sometimes that's exactly what you want, but a lot of times it isn't. Some of the other popular choices:
+ 1 or more
? 0 or 1
{5} exactly 5 (or whatever number you put there)
{6,} 6 or more (or whatever number you put there)
{2,7} between 2 and 7 (you get the drill)
So, let's use this in action. Phone numbers are a pretty good example - you know that US phone numbers have three digits, three more digits, and then 4 digits. So, instead of taking any number of digits with *, we'll use the curly-brackets option to specify exactly how many we want.
\d{3}-\d{3}-\d{4}
Here: play around with it:

Matching and testing

Regular expressions can be used to test whether text fits a pattern, but they can do even more: they can grab the things that fit in blanks and give them to you. When we were testing that phrase: > My age is 15 We don't just want to know that someone is writing their age, we want the value of that age: we want "15". Well, it's time for more punctuation: parentheses.
My age is (\d*)
Putting parentheses around a part of a regular expression tells the program that you'd like to keep those values. That's what we call a capturing group.

Special letters and turning them off

We've discussed a lot of magic characters: . means 'anything', \d means 'any digit', and () means 'capture everything between these parentheses'. These are kind of like code words, and have the same problem as codewords. Now that these letters have special meaning, how do you refer to non-special meanings? If the codeword is 'banana', how do you just ask for a normal non-codeword banana? That's what we call escaping! Escaping lets you say, no, I don't want a stand-in for any character, I want a normal period, like the kind at the end of this sentence. You can escape anything by putting a backslash before it, like this:
My age is \(\\d\*\)
So, no, I don't want the special meanings of (, \d, *, or ) - just those actual verbatim letters, please. That's kind of a dumb example. Here's a better one. Let's say you're looking for a number like 0.50 - a decimal number. You might start with this:
\d+.\d+
But that . has its magical meaning! So instead of parsing just numbers, it'll also accept input like 0a2, which you definitely don't want. So, put a \ before that . and now, instead of looking for 'anything', you're looking for a period again.
\d+\.\d+

Why are regular expressions special?

Usually in programming we try to use as few languages as possible. Whether you're writing JavaScript or Python, most functionality will be in that language, not some weird special sub-language like regular expressions. In JavaScript, there are only a handful of similar language-in-language constructs - JSON parsing being one of them. You can see why regular expressions are necessary by trying to implement them from scratch. As it turns out, finding patterns in text is a pretty difficult task. To do it quickly and efficiently, you need to keep as little data in memory as possible and make sure that every new character has a low, fixed cost. Regular expressions do just that: finding patterns, character-by-character, efficiently. Conceptually, they can be thought of as deterministic finite automata - tiny theoretical computers that make decisions piece-by-piece, by jumping between different states. Writing regular expressions in their own little language is kind of restrictive - you can't drop 'down' to JavaScript in the expression - but it's restrictive for a reason, so that the vast majority of expressions run incredibly quickly and efficiently.

How do I tinker with regular expressions?

- regex101 is awesome