# Regular Expressions

My goal in this article is to teach you everything you ever wanted to know about Regular Expressions. Regular Expressions are used to decide if a set of characters match what you are searching for. You use a series of codes to define what you are looking for.

Because these codes are so simple and look confusing when they are crammed together, people think Regular Expressions are too hard to use. My mission is to make them easy to use in this article and the next. Promise!

I’m going to provide a lot of code you can use. The language I’m using to demonstrate Regular Expressions in this tutorial is Python. I previously talked about Regular Expressions in Javascript as well. If you’d like me to go over how Regular Expressions are different in other languages, just leave a comment below.

The Basics of Regular Expressions

Let’s say you are given a 10,000 page document and your asked to retrieve every street address from that document. A very easy task with a Regular Expression, which will be known as Regex from now on.

I’m going to start off simple. Let’s say you just need the street address and that every city is the same and there are no apartments or suites. You can trust that what you are looking for will be of the form 123 Main St. You are also told that:

• House numbers will be no longer than 5 digits in length
• Street names are exactly 1 word
• Every address is either a St. or Ave. and a period is always used

The Regex you would use to define these addresses would be:

• \d {1,5} – Between 1 to 5 numbers in length
• \s – Followed by a whitespace character
• \w+ – Followed by 1 or more letters
• \s – Another white space
• \w+ – Followed by 1 or more letters
• \. – Followed by a period

To find a sequence of characters, you have to define the rules that will always be true for them and then turn those rules into an expression. Here are codes used to represent different types of characters:

• \d – This represents any number
• \D – This represents anything that isn’t a number
• \s – This represents anything considered white space (space, tab, newline, etc.)
• \S – This represents anything not considered white space
• \w – This represents any character
• \W – This represents anything that is not a character
• . – Matches any character, except a line break
• \b – Matches for a space that precedes or follows a whole word

Searching for a Name

If you were tasked with searching through a mountain of documents for anyone named Jennifer, how would you do that. You can also search for literal text and you would do that with this expression:

‘Jennifer\s\w+\s’ This will search for the word Jennifer followed by a space, 1 or more characters and then another space. The plus sign (+) stands for 1 or more of the code that precedes it. In this case I’m stating, I’m looking for 1 or more characters (\w).

There are other codes like the plus (+):

• ? – Signifies you are looking for 0 or 1 repetitions of the code that precedes
• * – Signifies you expect 0 or more repetitions
• {n} – Used when you expect a specific number (n) of repetitions
• {x,y} – Used when you expect between (x) to (y) repetitions

Some Characters Need Special Care

While we can search for literal terms, like we did with Jennifer above, some characters require escaping. By escaping I mean they must be followed by a backslash. The dollar sign (\$) and period (.) are two such examples. You could search for a dollar amount with this regex:

\\$\d*\.\d{2}

• Looking for a dollar sign
• Followed by 0 to more numbers
• Followed by a period
• Followed by 2 numbers

Other characters that need escaped with a backslash include:

• (
• )
• *
• +
• ?
• [
• \
• ^
• {
• |

How to Search for Specific White Space

If you want to search for specific white space, you use the following codes:

• \e – Escape
• \f – Form Feed
• \n – Newline
• \r – Carriage Return
• \t – Horizontal Tab

Just place them in the code as if the were any other character.

Match One of a Couple of Characters

What could you do if you wanted to search for commonly miss-spelled words. Calendar is commonly miss-spelled, and here is how you could search for Calendar and Calender.

calend[ae]r : This regex will come back positive if it is spelled in either way. Only one of the letters inside of the braces will be used however. This brace can also be used to search for a series of characters, like these examples:

• [a-z] : This would match any lower case letter
• [0-9] : This would match any number
• [A-Fa-z1-4] : This would match uppercase letters from A to F, all lowercase and the numbers 1 to 4

Searching for Missing Jennifer’s

Remember when you searched through 1000’s of pages to find everyone named Jennifer? Well you missed a few. Don’t worry, we can easily find the Jen’s, and Jenny’s with the vertical bar code. The vertical bar (|) is looked at as the word OR in Regex. To find all the Jennifer’s, we can use this code instead:

(Jennifer|Jenny|Jen)\b\w+\b

Note: The code \b will match for any space that precedes or follows a whole word. \B will match for when their is no space separating characters.

The following code would match for all of the different Jennifer’s and also return the last name.

‘(Je[nnifer|nny|n]{1,6}\s\w+\s)’

This code is stating, return the first and last name if the first name:

• Starts with ‘Je’ and then either…
• Ends with the letters ‘nnifer’ or …
• Ends with the letters ‘nny’ or just ‘n’
• The curly braces make sure that between 1 to 6 characters have to be used

Using Search Codes Multiple Times

Did you notice in the last example how I surrounded the first name options with braces()? By surrounding parts of a search in braces, you can then call for it with a backslash (\), followed by a number representing it’s location in the Regex.

So since this was the first time the braces where used in the Regex, I can use it again with \1. The next braced code block would be referenced with \2 and so on up to \9. Everyone after that would be referenced by surrounding them with carrot braces <\10>.

It would also be useful to grab just the text that lies between tags in html code. You code do that with the following code:

‘\<\w+>(.+)\<\\\w+>’

Are you starting to see why people get confused by Regex’s? I’ll break this down for you:

• Everything is surrounded with quotes
• \< : Used to search for a carot brace (You have to escape them with a \)
• \w+ : 1 or more characters
• > : You don’t have to escape the closing carot brace
• (.*) : Capture 1 or more characters  and store them in \1
• \< : Escape the brace again
• \\ : Escape the Backslash character

I trust you can understand the rest.

Last Few Codes

You can also reference the beginning of a line of text with the carot symbol (^). So if you wanted to capture and sentence that starts with “The cat”, you’d use this code:

‘^The cat\s\w*\.’

Try figuring that out on your own. You can reference the end of a line of text with the Dollar Sign (\$), in the same way.

That’s All Folks… For Now

That’s all I can go over on Regex’s in this article. I will continue in the next article to show you how to apply them in real world code with ton’s of examples.

Till Next Time

Think Tank

### 26 Responses to “Regular Expressions”

1. Mike Joes says:

This was really helpful! Thanks 🙂 i’ll be coming up with lots of Questions on this 😛

2. Daniel says:

Oh my God your tutorials are amazing! Wow that’s so much! I’ve been shying away from Texmate’s RegEx features because it was simply terrifying!

Now I’m gonna use them as a matter of habit 🙂

3. chris says:

Thanks for the tutorial! Now it begins to make sense to me 🙂

You’re very welcome. Thanks for taking the time to show your appreciation

4. John Smith says:

Thank you so much for the tutorials, they were really helpful. regular expressions aren’t that difficult after all ;). I have a question though, how can I write a standard expression for detecting hexadecimal numbers from text ? I mean numbers like ‘0x0f4’, ‘0acdadecf822eeff32aca5830e438cb54aa722e3’, ‘8BADF00D’ are all hex numbers but I can’t seem to find a common pattern to be able to write a regular expression. any help ?

You could create a regex that grabs all strings of characters if they contain both numbers and letters? You could refine this to only words with numbers and a – f? I know that wouldn’t work if only characters are used.

Their aren’t many words that can be made with just a, b, c, d, e and f. Just don’t count these as hex codes.

If you don’t count these as hex codes that gets you even better results.

You will normally be able to grab the codes by looking for patterns in the strings that normally surround hex codes. Maybe those codes are normally proceeded by a :, 0x, #, etc.

I hope that helps? Thanks for the fun question 🙂

• John Smith says:

Thanks for your reply. I think I’ll have to specify different patterns for matching these expressions. For example re.compile (pat1 | pat2 | pat3). I don’t think there can be any one standard regex to find all the hex numbers from a given text file.

You’re probably right, but the information that surrounds the hex code normally has a pattern. Here is a tutorial in which I grabbed pretty complicated stuff from a site Regex on Steroids

5. Peter Osariemen says:

My Comment

You’re simply awesome. It’s nice Listening to you on video, as well as reading your articles. Thanks, and remain blessed forever.

Thank you very much for the very kind words 🙂 That makes me very happy and I’ll continue making videos for as long as I can. I’m glad you enjoy them

6. Anonymous says:

Explained in a simple and lucid fashion !!
Thanks man 😀

7. Amay says:

Explained in a simple and lucid fashion !!
Thanks man 😀

8. Roger Gosselin says:

I love you (I’m not gay).

that’s amazing,very intersting article. It was a pleasure to see you on video as well as to read your article. thank you so much. I need more examples about Regex. Is there any video or article containg more examples? Eg. if we want to search for a word only as name or numbers written with caracters…. what are the codes?
thank you so much, you are awesome 🙂 🙂

10. Dennis Kean says:

In the above article you probably meant to say “precedes” but you wrote “proceeds” in 4 different instances.

” \b – Matches for a space that proceeds or follows a whole word ”
” The plus sign (+) stands for 1 or more of the code that proceeds it.”
” ? – Signifies you are looking for 0 or 1 repetitions of the code that proceeds”
” Note: The code \b will match for any space that proceeds or follows a whole word.”

(Please, feel free to delete this post when you read it I don’t want to affect your great work adversely. You invited me and I’m glad to help.)

And once again, Regex, something I avoided for years, suddenly is so clear! You are a unique individual, Derek. Humble and open minded you deserve my respect.

Dennis

• Derek Banas says:

Thank you Dennis. I fixed the errors. Sorry about that. I’m happy I was able to help with regular expressions 🙂

11. Mudabbir says:

Jazakallah! The idea of explaining them vertically instead of in a single line helped me. Thank you so much.

12. Siri says:

Derek, you are awesome. I cannot express my appreciation in words but let me tell you I am very grateful to you for the awesome tutorials by an excellent tutor. Your teaching skills are very unique. May GOD bless you!

• Derek Banas says:

Thank you for the kind compliments 🙂 May God bless you as well.

13. Fabiano Clavé says:

Hi Derek!

Just wondering if you would consider to make a series of tutorials aiming the public aspiring to become System Administrators? More specifically for Linux environment. I know there are tons of them out there already but you are the best teacher ever!
If so I have some suggestions for topics.
Another thing but also related to that. I would love to watch your approach on LFS ( Linux From Scratch ) Project. It’s there, but again, I think you’re the one who can really make that stick on our brains.

Whatever your comments on that will be, in my opinion you already are the coolest and best teacher for many things I’ve learned from you! So thanks a lot for that!

Take care!

• Derek Banas says:

Thank you for all the nice compliments 🙂 I have been using Linux forever. I’ll see what I can put together. Thanks for the request.

14. Deborah says:

Hi Derek. Keep the good work up. I’m moving into Bioinformatics from a Biochem background. You can’t imagine how helpful your tutorials are.
Thanks.

• Derek Banas says:

Thank you 🙂 That is very exciting! I’m glad that I’m able to help.