Regular expressions
Regular expressions (also known as regex) are powerful string processing tools (including text, numbers and special characters) which allow you to match, capture and even modify different fragments of any given string matching a pattern. They’re by far one of the most useful tools in programming, data processing, data filtering and even a key aspect in advanced data processing tools like sed or AWK.
The concept of regex was invented in the 1950s by Stephen Cole Kleene. Since then, multiple implementations and multiple regular expressions engines were developed and implemented, but luckily most of them follow a relatively similar set of rules guaranteeing a decent level of interoperability. Today the two most popular regex standards are the POSIX Basic Regular Expressions (BRE) and PCRE (Perl Compatible Regular Expressions). Of these two PCRE is the newest one, and it was designed to be more powerful and flexible than BRE. Thankfully PCRE adopts many of the rules and conventions set by BRE, so most of the time (depending on the complexity of your patterns) you’ll be able to just write the patterns without having to worry about the underlying system. In this guide we’ll tell you when some of the rules and concepts are not compatible with BRE and vice versa.
This is also a sequential guide, we’ll start with the simplest concepts and with each new section, we’ll cover more advanced topics.
Topics
The pattern – Metacharacters – Basic matching – Character classes – Quantifiers – Greediness – Positioning – Lookarounds – Logic – Subpatterns and backreferences
The pattern
As we’ve mentioned earlier regular expressions work by using algorithms that identify different fragments of data inside a string matching one or more patterns. The pattern is the soul of a regular expression, and you’ll need to write matching patterns for the data you’re trying to “catch”.
The pattern delimiter
A quirk that was adopted by a lot of programming languages and tools implementing regex is that you’ll have to enclose the pattern in-between two delimiters. The delimiters are symbols, like for example /pattern/ or $pattern$. Most tools and programming languages usually let you chose the symbol you want to use as delimiters. Other programming languages, like for example Python, don’t require you to enclose the pattern inside two delimiters.
In this guide we’ll obviate placing the delimiters on every pattern just to make some of the examples more legible.
Metacharacters
Metacharacters are reserved symbols used in regular expressions to accomplish different tasks. Since metacharacters are interpreted as such, if you want to use them as their literal character equivalent you’ll have to escape them using the metacharacter \. For example, if you want to match two plus two “2+2” you’ll have to escape the + like this: “2\+2”
Metacharacters are: . ^ $ * + ? { } [ ] \ | ( )
Regex basic matching
Literals
The most basic kind of regex matching are the literals. The literals are regular characters and numbers like a, A, b, B, c, C, d, D… etc. 1, 2, 3, 4… etc. and also some symbols like _, @, etc. (basically any symbol that isn’t a metacharacter). When you write a pattern using literals you’ll just write exactly what you want to catch. For example:
- “Hello” will only match the text ‘Hello’ and it will ‘hello’ or ‘olleh’.
Note 1: the order and position matter. “12” will match ’12’ but it will not match ’21’.
Note 2: each literal represents a single character. “Tea” will match only ‘Tea’, it will not match ‘Teea’ or ‘Teeaaa’.
The wildcard
Nearly all regular expressions engines use the dot . as a wildcard that will match any character.
- For example: “a.cd” will match ‘abcd’, ‘afcd’, ‘a cd’, ‘a1cd’, etc.
Special sequences
A set of special sequences can be used to create basic patterns. These special sequences represent a single character in the matched expression, and their position in the pattern matters. You can combine them with the literal characters to create more complex patterns.
- \d will match any digit character.
- For example: “\d\d\d” will match ‘123’, ‘345’, etc.
- \w will match any “word” character
- For example “N\w\w\w” will match ‘News’, ‘Norm’, ‘Nate’, etc.
- \s will match a whitespace character.
Some regex implementations also define:
- \v vertical whitespaces (returns)
- \h horizonal whitespaces (spaces, tabs, etc.).
- \n new lines (this one is used mostly in code or text editors that implement regular expressions engines).
Negation sequences
Many regex implementations also have negation sequences, which do the opposite to the ones shown above. They’re written using uppercase letters.
- \W will match any non-word character.
- \D will match any non-digit character.
- \S will match any non-whitespace character.
- \H will match any non-horizontal whitespace character.
- \V will match any non-vertical whitespace character.
Note: modern regular expressions engines have a much more powerful negation system, see below in this article.
Matching unicode characters
- \p{xx} a character with the property xx.
- \P{xx} ― a character without the property xx.
- \X extended unicode sequence
Character classes
Character classes are where the fun begins. Classes are defined by brackets [ ]. We can either define our own highly specific classesor use more generic ranges. A character class matches only one character.
Defining our own classes
We simply enclose any sequence in-between brackets. For example, [aei] will match either the letter a, or the letter e or the letter i, whichever one appears first.
[HP]ole # will match either 'Hole' or 'Pole'
# will not match 'HPole'.
Remember a character class will match only one character (unless you’re using quantifiers, see below).
Ranges
Ranges are what turn character classes into extremely powerful pattern making tools. They’re defined by a – symbol inside two brackets: [ – ], and they allow us to define ranges of characters:
- [a-z] all lowercase letters.
- [a-z] all uppercase letters.
- [a-zA-Z] all lowercase and uppercase letters.
- [0-9] all numbers.
- [a-zA-Z0-9] all lowercase, uppercase letters and numbers.
For example:
[A-Z][a-z][a-z][0-9] # Will match:
# Run5
# Cab9
# Tri8
# Etc...
Note 1: ranges can be partial. For example, “[a-c]” will match ‘a’, ‘b’ or ‘c’ and “[1-3]” will match ‘1’, ‘2’ or ‘3’.
Note 2: you can combine ranges by just placing them next to each other. For example: “[a-cq-z]” will match all letters from a to c and from q to z.
Note 3: you can combine classes and ranges with the old sequences. For example: “[c-e\d]” will match c, d, e or a digit. The previous example is equivalent to: “[c-e0-9]”.
Note 4: if you want to match the literal – character (without specifying a range) you’ll have to add it either at the beginning or at the end of your character class: “[abc-]” or “[-abc]”.
BRE (POSIX) categories
Character classes are mostly a PCRE implementation, but BRE (POSIX) has something similar, albeit not as powerful and a bit unreliable. You should only use these if the tool or programming language you’re using doesn’t accept PCRE character classes. In the BRE context these are usually called categories rather than classes.
- [:alnum:] alphanumeric (similar to [A-Za-z0-9])
- [:alpha:] alphabetic character (similar to [A-Za-z])
- [:blank:] space or tab.
- [:space:] whitespace character.
- [:digit:] numeric digit (similar to [0-9])
- [:lower:] lowercase letters (similar to [a-z])
- [:punct:] printable characters, excluding spaces and alphanumerics.
- [:upper:] uppercase letter (similar to [A-Z])
Quantifying regular expressions
So far, everything we’ve seen matches only one character. Thankfully, regular expressions have quantifiers. Quantifiers can take one character class, a literal or a subpattern (see below) and repeat it multiple times.
Basic regular expressions quantifiers
These are expressed by adding either *, + or ? immediately after a character class.
- * zero or more times
- + one or more times
- ? zero or one times
For example:
c[a]*t # will match:
# ct
# cat
# caat
# caaat
# ... and so on.
Note 1: regular expressions quantifiers can also be used with literals and subpatterns (see below). For example “a(bc)+” will match an ‘a’ and one or more ‘bc’.
Note 2: quantifiers are closely related to the concept of greediness (see below).
Ranged and numbered quantifiers
- {n,m} at least ”n” times, and no more than ”m”
- {n} at least ”n-times”
- {,m} at most ”m-times”
For example:
c[a]{2,3}t # will only match:
# caat
# caaat
a(bc){1,5} # 'a' and at least one or at most five 'bc'
Greediness
By default, regular expressions are greedy, they’ll keep capturing characters until they find the last occurrence of the character or class signaling the end of your pattern. If you want to avoid that and match only until the first occurrence you’ll need to make the regex lazy or non-greedy by adding a ? after the quantifier.
Example
- Let’s say you want to write a pattern to match the first value enclosed in dashes in the following string:
- -valueA- random text -valueB- more random text -valueC-
- If you write the following pattern: “-(.+)-“ instead of capturing valueA you’ll capture the whole string because technically the whole string is enclosed in two dashes.
- The solution is adding a ? after the quantifier: “-(.+?)-“
Catching all occurrences
Now suppose you want to catch all the occurrences separately. For this operation you’ll have to use a pattern structure like this: .?([pattern]).?
Note: some regex implementations specially those found in code editors like Sublime text and VS code will always catch all the occurrences no matter what
Example
Let’s suppose you’re parsing an HTML document and want to catch all the opening elements. For this operation, you’ll have to write a pattern like this one: <.?([a-z]+).?>. This pattern will match every single opening HTML element.
Positioning
Beginnings and ends
Another extremely useful tool to create complex patterns are the positioning symbols. These symbols are placed immediately before (for ^) or immediately after (for $) a character class, literal or subpattern and will allow you to specify the position within each line (not the text as a whole) where the pattern should appear in order to be considered a match.
- ^ beginning of the line
- $ end of the line
For example:
^[hc]at # will match either "hat" or "cat"
# but only if it appears at the
# beginning of the line.
[hc]at$ # will match either "hat" or "cat"
# but only if it appears at the
# end of the line.
Note 1: as we can see in examples they affect the whole pattern.
Note 2: you can combine them with quantifiers. For example, if we want to match all lines of text ending with three dots we’ll write the following pattern: “.*\.{3}$” (the first dot is the metacharacter wildcard and the second dot is an escaped dot to represent the literal dot character).
Lookaheads and lookbehinds
Lookaheads and lookbehinds are rules that allow you to specify following and preceding elements to your patterns. Their syntax is a bit tricky. Both lookaheads and lookbehinds can be either positive (follow or precede) or negative (don’t follow or precede).
- Negative lookahead: c(?!d) c not followed by d.
- Positive lookahead: c(?=d) c followed by d.
- Negative lookbehind: (?<!c)d c not preceded by d.
- Positive lookbehind: (?<=c)d c preceeded by d.
For example: we’ll use a negative lookahed to match any word starting with a ‘d’ not followed by an ‘o’.
d(?!o)[a-zA-Z]+ # Will match:
# deal
# dial
# dentist
# will not match:
# dog
# donor
Note: In the previous examples we used literals to simplify them, but you can also use sequences, character classes or subpatterns. For example: “^[a-zA-Z](?!o)[a-zA-Z]” will match first words whose second letter isn’t an ‘o’.
Logic
You can perform some basic logic operations inside your character classes by either negating its contents or using an ‘or’ operator.
Negation
Negation is performed by placing the ^ symbol inside a character class. For example, [^aeiou] will match every character except vowels. Simultaneously, [^a-Z] will match any character that isn’t a letter.
Note 1: the symbol for negation is the same symbol used to specify the location of a character class. The difference, is that the negation ^ goes inside the class brackets while the positioning ^ goes outside and immediately before the class brackets.
Note 2: you only use the negation symbol once. For example, [^a|i] will match any character that isn’t an a or an i (this is the same as writing [^ai])
Negating specific words
You can negate specific words with a negative look around. For example, ^(?!.cat).$ will negate the word cat.
Or
The | symbol specifies an or operation.
- Alternation: [a|b] a or b.
Subpatterns and backreferences
Subpatterns
Subpatterns are one of the most useful and powerful tools you can use when making complex regular expressions. Basically, they let you capture different portions of your matches and store said matches inside numbered variables. They’re extremely useful when you use them along with replacement commands. A subpattern is created by simply enclosing any part of your main pattern within ( ).
For example: c(a)(n) will match the word ‘can’ and capture the ‘a’ in the variable $1 and the ‘n’ in the variable $2.
If you’re using a tool that allows you to work with the output of a regex, then you could use these variables to perform other operations.
Backreferences
Subpatterns become an even more powerful tool when used alongside backreferences (also called retroreferences). Backreferences allow you to use something captured by a subpattern and reuse it within the same pattern.
Backreferences are numbered. For example, the backreference \1 will refer to the first subpattern in your pattern while \2 to the second subpattern, and so on.
The c(at) (h)as a \2\1 # Will match:
# 'The cat has a hat'
As we can see in the prior example we have two subpatterns (at) and (h) we can invoke them in the pattern by using their respectively numbered backreferences.
You can repeat backreferences as many times as you wish:
([a-c])x\1x\1 # Will match:
# axaxa, bxbxb and cxcxc.
Leave a Reply