My goal in this article is to teach you everything you ever wanted to know about Regular Expressions. Regular Expressions are used to decide if a set of characters match what you are searching for. You use a series of codes to define what you are looking for.
Because these codes are so simple and look confusing when they are crammed together, people think Regular Expressions are too hard to use. My mission is to make them easy to use in this article and the next. Promise!
The Basics of Regular Expressions
Let’s say you are given a 10,000 page document and your asked to retrieve every street address from that document. A very easy task with a Regular Expression, which will be known as Regex from now on.
I’m going to start off simple. Let’s say you just need the street address and that every city is the same and there are no apartments or suites. You can trust that what you are looking for will be of the form 123 Main St. You are also told that:
The Regex you would use to define these addresses would be:
To find a sequence of characters, you have to define the rules that will always be true for them and then turn those rules into an expression. Here are codes used to represent different types of characters:
Searching for a Name
If you were tasked with searching through a mountain of documents for anyone named Jennifer, how would you do that. You can also search for literal text and you would do that with this expression:
‘Jennifer\s\w+\s’ This will search for the word Jennifer followed by a space, 1 or more characters and then another space. The plus sign (+) stands for 1 or more of the code that precedes it. In this case I’m stating, I’m looking for 1 or more characters (\w).
There are other codes like the plus (+):
Some Characters Need Special Care
While we can search for literal terms, like we did with Jennifer above, some characters require escaping. By escaping I mean they must be followed by a backslash. The dollar sign ($) and period (.) are two such examples. You could search for a dollar amount with this regex:
Other characters that need escaped with a backslash include:
How to Search for Specific White Space
If you want to search for specific white space, you use the following codes:
Just place them in the code as if the were any other character.
Match One of a Couple of Characters
What could you do if you wanted to search for commonly miss-spelled words. Calendar is commonly miss-spelled, and here is how you could search for Calendar and Calender.
calend[ae]r : This regex will come back positive if it is spelled in either way. Only one of the letters inside of the braces will be used however. This brace can also be used to search for a series of characters, like these examples:
Searching for Missing Jennifer’s
Remember when you searched through 1000’s of pages to find everyone named Jennifer? Well you missed a few. Don’t worry, we can easily find the Jen’s, and Jenny’s with the vertical bar code. The vertical bar (|) is looked at as the word OR in Regex. To find all the Jennifer’s, we can use this code instead:
Note: The code \b will match for any space that precedes or follows a whole word. \B will match for when their is no space separating characters.
The following code would match for all of the different Jennifer’s and also return the last name.
This code is stating, return the first and last name if the first name:
Using Search Codes Multiple Times
Did you notice in the last example how I surrounded the first name options with braces()? By surrounding parts of a search in braces, you can then call for it with a backslash (\), followed by a number representing it’s location in the Regex.
So since this was the first time the braces where used in the Regex, I can use it again with \1. The next braced code block would be referenced with \2 and so on up to \9. Everyone after that would be referenced by surrounding them with carrot braces <\10>.
It would also be useful to grab just the text that lies between tags in html code. You code do that with the following code:
Are you starting to see why people get confused by Regex’s? I’ll break this down for you:
I trust you can understand the rest.
Last Few Codes
You can also reference the beginning of a line of text with the carot symbol (^). So if you wanted to capture and sentence that starts with “The cat”, you’d use this code:
Try figuring that out on your own. You can reference the end of a line of text with the Dollar Sign ($), in the same way.
That’s All Folks… For Now
That’s all I can go over on Regex’s in this article. I will continue in the next article to show you how to apply them in real world code with ton’s of examples.
Till Next Time