If you’ve dealt with text-based data before, you may be no stranger to how a messy dataset can make your life miserable. The fact that most of the world’s data come in nonstructural form is an ugly truth to be known sooner or later. In this post, we will talk about what RegEx (regular expression) is, what you can do with RegEx, and some specific examples with a free RegEx tool.
What Is the Regular Expression (RegEX)
“A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations. The concept arose in the 1950s when the American mathematician Stephen Kleene formalized the description of a regular language and came into common use with the Unix text-processing utility ed (a line editor for the Unix operating system), an editor, and grep (a command-line utility for searching plain-text data sets for lines matching a regular expression), a filter (a computer program or subroutine to process a stream, producing another stream).” This is an excerpt from Wikipedia used to define regular expression.
As obscure as it sounds, the concept is actually quite easy to understand. Say that you want to find a certain movie on Netflix, you’d probably search with the title of the Movie or even part of the title. Netflix’s search engine would then go on to look for any movie with titles matching what you’ve input into the search box and show you a list of search results that matches your search keywords. Likewise, regular expressions are like the words you’ve used to search for the movie that you want to find.
Essentially, regular expressions are text patterns that you can use to match elements or replace elements throughout strings of text. RegEx can be more powerful than you think because of how incredibly flexible it is for cleansing text-based data.
What You Can Do with RegEX
In short, regular expressions can be used to match HTML tags and extract the data in HTML documents.
Common RegEx Use Cases
Regular expressions are really helpful for matching common patterns of text, such as emails, phone numbers, zip codes, etc.
HTML is practically made up of strings, and what makes regular expression so powerful is, that a regular expression can match different strings. Admittedly, using regular expressions for parsing HTML can often lead to mistakes like missing closing tags, mismatching some tags, etc. Programmers are more likely to use other HTML parsers like PHPQuery, BeautifulSoup, html5lib-Python, etc. However, if you want to quickly match HTML tags, you can use this incredibly convenient tool to identify patterns in HTML documents. Every programmer or anyone who wants to extract web data is strongly recommended to learn about regular expressions for how this tool is able to greatly improve work efficiency and productivity.
Let’s look at a few examples of regular expressions to match HTML tags.
Regular expression to match :
We can match a variety of HTML tags by using such a regular expression and therefore easily extract data in HTML documents.
You can also check this Regular Expressions Cheat Sheet to have a quick reference for RegEx.
Also, here are some popular online RegEx testing and debugging tools to help generate or verify the right expressions:
If you need to scrape and reformat web data at the same time, download Octoparse, it is a Free RegEx tool that’s ready to use. Just open the software and click on the “Tools” icon on the sidebar menu.
Free RegEx Tool – Octoparse
With Octoparse, the best web scraping tool, you can use RegEx to match out/replace characters in a field value to refine the extracted data directly.
Octoparse RegEx tool is a built-in tool that offers a handy way to generate Regular Expressions automatically by setting up various criteria. When knowing little about how to create a regular expression syntax, the RegEx tool would be especially helpful.
Case 2: Write RegEx to extract specific info (like email, websites, etc)
If you want to extract emails from the source code (especially for some URLs sharing different structures), you can use the RegEx below directly to match the email. You can test and debug your own regular expressions right away with the tool.
Regular expressions, ever versatile, will help up locate HTML tags in a string today.
Pattern matching HTML strings serves at least one crucial function in web dev: sanitizing user input. Allowing user-submitted strings opens one’s application to significant vulnerability. Supposing, for example, some ne’er-do-well on the internet submitted a comment that includes . Regular expressions allow us to match HTML tags in a string, because HTML tags conform to a certain pattern:
begin and end with brackets (<>)
contain a string name consisting of one or more lowercase letters, like p, a, div, strong, script
contain zero or more attributes, such as class=»btn» , src=»https://gist.github.com/steal_your_data.js» , or href=»https://github.com/gavin-asay»
be accompanied by a closing tag in brackets with a slash and its tag name, e.g.,
be a self-closing tag, which has one or more whitespace characters, then a slash before the closing bracket (>).
So, to pick out an HTML tag, we write a regex that can account for these various possibilities. Consider this regex:
If that looks like gibberish, that’s because a regex often does at first glance. It takes some time to break down a lengthy regex and make sense of its pattern. Let’s break this regex down piece by piece. Look in the table of contents for an explanation for each part of this lengthy regex.
Every regex is enclosed in forward slashes. Programming languages recognize this syntax to denote a regular expression.
When you see a carat ^ at the beginning of the regex, it means the beginning of the string we’re comparing. Thus, only an HTML tag found immediately at the start of our string will fit the pattern. (Note that we also have a character that matches the end of the string, which we’ll discuss later.)
Square brackets [] mark a character class. Any character within the brackets will match the pattern. In this case, we match any lowercase letter from a to z. Note that for letters, regex is case sensitive. If we wanted to match capital letters as well, our character class would be [A-Za-z]. If we only wanted to match a handful of characters, we could use [abc123] to match only lowercase a, b, c, or the digits, 1, 2, and 3.
The plus sign + is a quantifier. It describes how many times the previous character class can be repeated. Plus means one more times. That means we must have at least one character that matches [a-z], but two or any quantity beyond that will also match. Other quantifiers include the asterisk *, meaning zero more times (essentially making the character class optional), while a question mark ? means zero or one times.
Finally, you’ll notice that this segment is enclosed in parentheses ( ). Parentheses mark a capturing group. This means that the regex will remember the segment of the pattern matching everything inside those parentheses. We can refer back to this capturing group later. JavaScript will also keep track of the contents of this capturing group.
And what about the first capturing group? That’s all of the letters, so a, div, or p would be the capturing group. That’s our tag name, which we’re keeping track of now.
You’ll notice that we’re isolating a second capturing group.
Last time we saw a carat ^, it denoted the start of the string. Within a character class, however, ^ has a different meaning: to exclude a character from the class. We’re excluding > here, but that’s the only definition of this class. If a character class only describes exclusions, then any character EXCEPT the exluded characters will match. Any character that isn’t >, including letters, digits, symbols, and whitespace match this character class.
As before, + matches one or more non-> characters.
Like we mentioned above, the asterisk * matches zero or more times. Thus, our second capturing group ([^ <]+)* is optional and will include any collection of one or more non->characters. What is this very flexible pattern looking for? Anything that comes after the tag name and before the closing bracket >. That includes the tags attributes. That includes anything like classes or ids, href, src, or flags like selected or disabled.
The first capturing group ([a-z]+) grabs the tag name (option) and remembers it for later. The second capturing group ([^>]+)* matches all of the attributes and flags ( value=»United States» selected ). That’s stored as well.
Here we have another group that begins with ?:. These characters ?: denote a non-capturing group. A string must match everything inside a non-capturing group, but this group will not be remembered later. You’ll notice that there are capturing groups within this non-capturing group. It’s those sub-group that we’ll be more concerned with.
The first character matched in this segment is >, signifying the end of the HTML tag. But why does the end of the tag appear in the middle of the regex?
Next is the third capturing group (.*). The period . matches any character. So, following the complete HTML tag, the third capturing group matches any string, or no string at all.
What is /\ supposed to be? Programmers will recognize the backslash \ to escape the following character. To match a forward slash /, we need to escape it. This is because / is a functional character in regex, marking the beginning and end of the pattern.
What about \1? We don’t need to escape digits, do we? An escaped character is a reference to the contents of a capturing group. Capturing group 1 matched the tag name. This doesn’t simply repeat the pattern of capturing group 1, it matches the exact same text that capturing group 1 found. Thus, if the tag name was div, \1 must also match div; it can’t match span or any other tag name.
Putting this segment together, we match . You’ve likely caught on that this segment finds the closing tag that pairs with the opening tag we found previously. (.*) allows for any text that comes in between them. That means it can match any text or enclosed tags!
The pipe | separates alternate patterns.\1> is a valid pattern, but what follows | can match instead of\1> .
An escaped letter is a shorthand for a commonly used character class. Here, \s matches any whitespace character: space, tab, or a newline character. Other useful classes include \w (any word character [a-zA-Z0-9_]) and \d (any digit 4).
Altogether, this alternate pattern matches one or more whitespace characters, then /, then >. The alternate to a separate closing tag is, naturally, the/> found in self-closing tags like or .
Finally, the dollar sign $ matches the end of the string. Then / closes out the regex pattern.
Given string str, the task is to check whether it is a valid HTML tag or not by using Regular Expression. The valid HTML tag must satisfy the following conditions:
It should start with an opening tag ( <).
It should be followed by a double quotes string or single quotes string.
It should not allow one double quotes string, one single quotes string or a closing tag (>) without single or double quotes enclosed.
It should end with a closing tag (>).
Input: str = “’>”; Output: true Explanation: The given string satisfies all the above mentioned conditions. Input: str = “ ”; Output: true Explanation: The given string satisfies all the above mentioned conditions. Input: str = “br/>”; Output: false Explanation: The given string doesn’t starts with an opening tag “Input: str = “”; Output: false Explanation: The given string has one single quotes string that is not allowed. Therefore, it is not a valid HTML tag. Input: str = “ >”; Output: false Explanation: The given string has a closing tag (>) without single or double quotes enclosed that is not allowed. Therefore, it is not a valid HTML tag.
Approach: The idea is to use Regular Expression to solve this problem. The following steps can be followed to compute the answer.
Get the String.
Create a regular expression to check valid HTML tag as mentioned below:
represents the string should start with an opening tag ( <).
( represents the starting of the group.
“[^”]*” represents the string should allow double quotes enclosed string.
| represents or.
‘[^’]*‘ represents the string should allow single quotes enclosed string.
| represents or.
[^’”>] represents the string should not contain one single quote, double quotes, and “>”.
) represents the ending of the group.
* represents 0 or more.
> represents the string should end with a closing tag (>).
Below is the implementation of the above approach: