The power of search: Regular Expressions

Years ago I was teaching myself JavaScript when I got to Regular Expressions. At this time I knew coding was the right choice for me as there is something very pure and magical about them. I learned and loved the power of this simple computer language inside of another computer language. This mini computer language parses through the data you feed it (typically in string format) and identifies the patterns and or groups of characters, letters, and digits you specify. The rules of Regular Expressions are simple and fairly uniform across different computer languages much like “for” , “while” and “return”. There are a slew of resources online to work with regular expressions or “regex” in IDEs such as Rubular(regex in ruby), regex101, or regexr. Regular expressions is meant to be more meta and less literal. However it can be literal for example looking for literally exactly one phone number. Regex usually is identifying patterns in data. This post will focus of the Python version of RegEx and specifically the regular expression module.

Python has a built in module just like everything else and you have to import it into your file to use RegEx. This follows the single responsibly principal and separation of concerns and also allows us to only load the functionality our program needs.

Simply put ‘import re’ at the top of your python file and you are ready to use the regular expressions module.

If you are working in Visual Studio Code(if you’re not you should be) you can type “ctrl f” for the the find word prompt will appear and then “alt r” to activate regular expressions to parse your file. This is a good way to test your regex to ensure they are working properly and reminds me of using your browser console to test out JavaScript. I will be working with anaconda in a Jupiter notebook environment however all of the commands and regex arguments are universal to all environments.

Lets create some searchable text and then start to use regex so we can see what I’m wring about.

import module, initialize and assign search_text

Once we import re we need to use the re.compile method. This will create a regular expression object that will have helper methods to let us access the data.

re.compile method

Lets pass this Regex object the finditir method which takes the text you want to look within as an argument.

output of the matches from find.finditer(search_text)

This will parse the text area we pass it and find any non-overlapping matches with the regex pattern. Here we see the match=’ok’ and also the span=(index of the find). the span is providing us the index within the string. We can use this it access the data directly from the string:

We literally pull ok from the string which is the first ok is in our search_text. We are compiling this regular expression argument literally and this will look for literally exactly whatever we put in the re.compile method. What happens when we want to find only ok at the beginning of the string ok at the word boundary? Lets see.

Looking for literally ‘\b^YO’

As you see nothing prints! Curious that nothing was picked up by regex. What is happening is re is compiling literally what we put in and it is looking for exactly ‘\b^YO’ in our search_text. For this to treat \b^ as a word boundary at the beginning of the string we need to pass the compile method the raw argument. Once we do this it will treat \b as a word boundary and ^ as the start of a string. Lets try again with the “raw” argument:

re.compile passed the r for ‘raw’ argument

We return a re.Match object which is at idex 0 through 2 in our search string match=’Y’! Now we are matching lets take Regex to find something useful. Forms commonly ask for phone numbers and emails. Regex is designed to identify these patterns. What are common to all phone numbers and emails?

U.S. phone numbers have 3 digits followed by a dash, then 3 digits and another dash and then 4 digits. There are a few ways to do this. Lets google a regex character cheat sheet and look for what will match this.

Hopefully you found something that looks like this:

‘\d{3}[-.]?\d{3}-?\d{4}’

Above we have ‘\d{3}’ that will look for 3 digits in a row. Next we have ‘[-.]’ this will look for a period or a dash. Brackets in Regex will treat anything literally and will look for either a - or . and will continue on to the next pattern. Lets test it out:

We were able to identify our phone number

This can used for validations to ensure your users are typing in legitimate telephone numbers.