Any html tag regex
How to validate HTML tag using Regular Expression
Given string str, the task is to check whether it is a valid HTML tag or not by using Regular Expression.
The valid HTML tag must satisfy the following conditions:
- It should start with an opening tag ( <).
- It should be followed by a double quotes string or single quotes string.
- It should not allow one double quotes string, one single quotes string or a closing tag (>) without single or double quotes enclosed.
- It should end with a closing tag (>).
Input: str = “’>”;
Output: true
Explanation: The given string satisfies all the above mentioned conditions.
Input: str = “
”;
Output: true
Explanation: The given string satisfies all the above mentioned conditions.
Input: str = “br/>”;
Output: false
Explanation: The given string doesn’t starts with an opening tag “Input: str = “”;
Output: false
Explanation: The given string has one single quotes string that is not allowed. Therefore, it is not a valid HTML tag.
Input: str = “ >”;
Output: false
Explanation: The given string has a closing tag (>) without single or double quotes enclosed that is not allowed. Therefore, it is not a valid HTML tag.
Approach: The idea is to use Regular Expression to solve this problem. The following steps can be followed to compute the answer.
- Get the String.
- Create a regular expression to check valid HTML tag as mentioned below:
- Where:
- represents the string should start with an opening tag ( <).
- ( represents the starting of the group.
- “[^”]*” represents the string should allow double quotes enclosed string.
- | represents or.
- ‘[^’]*‘ represents the string should allow single quotes enclosed string.
- | represents or.
- [^’”>] represents the string should not contain one single quote, double quotes, and “>”.
- ) represents the ending of the group.
- * represents 0 or more.
- > represents the string should end with a closing tag (>).
Below is the implementation of the above approach:
Источник
gavin-asay / regex_html_tag.md
Regular expressions, ever versatile, will help up locate HTML tags in a string today.
Pattern matching HTML strings serves at least one crucial function in web dev: sanitizing user input. Allowing user-submitted strings opens one’s application to significant vulnerability. Supposing, for example, some ne’er-do-well on the internet submitted a comment that includes . Regular expressions allow us to match HTML tags in a string, because HTML tags conform to a certain pattern:
- begin and end with brackets (<>)
- contain a string name consisting of one or more lowercase letters, like p, a, div, strong, script
- contain zero or more attributes, such as class=»btn» , src=»https://gist.github.com/steal_your_data.js» , or href=»https://github.com/gavin-asay»
- be accompanied by a closing tag in brackets with a slash and its tag name, e.g.,
,
or
be a self-closing tag, which has one or more whitespace characters, then a slash before the closing bracket (>). So, to pick out an HTML tag, we write a regex that can account for these various possibilities. Consider this regex:
If that looks like gibberish, that’s because a regex often does at first glance. It takes some time to break down a lengthy regex and make sense of its pattern. Let’s break this regex down piece by piece. Look in the table of contents for an explanation for each part of this lengthy regex.
Every regex is enclosed in forward slashes. Programming languages recognize this syntax to denote a regular expression.
When you see a carat ^ at the beginning of the regex, it means the beginning of the string we’re comparing. Thus, only an HTML tag found immediately at the start of our string will fit the pattern. (Note that we also have a character that matches the end of the string, which we’ll discuss later.)
Square brackets [] mark a character class. Any character within the brackets will match the pattern. In this case, we match any lowercase letter from a to z. Note that for letters, regex is case sensitive. If we wanted to match capital letters as well, our character class would be [A-Za-z]. If we only wanted to match a handful of characters, we could use [abc123] to match only lowercase a, b, c, or the digits, 1, 2, and 3.
The plus sign + is a quantifier. It describes how many times the previous character class can be repeated. Plus means one more times. That means we must have at least one character that matches [a-z], but two or any quantity beyond that will also match. Other quantifiers include the asterisk *, meaning zero more times (essentially making the character class optional), while a question mark ? means zero or one times.
Finally, you’ll notice that this segment is enclosed in parentheses ( ). Parentheses mark a capturing group. This means that the regex will remember the segment of the pattern matching everything inside those parentheses. We can refer back to this capturing group later. JavaScript will also keep track of the contents of this capturing group.
And what about the first capturing group? That’s all of the letters, so a, div, or p would be the capturing group. That’s our tag name, which we’re keeping track of now.
You’ll notice that we’re isolating a second capturing group.
Last time we saw a carat ^, it denoted the start of the string. Within a character class, however, ^ has a different meaning: to exclude a character from the class. We’re excluding > here, but that’s the only definition of this class. If a character class only describes exclusions, then any character EXCEPT the exluded characters will match. Any character that isn’t >, including letters, digits, symbols, and whitespace match this character class.
As before, + matches one or more non-> characters.
Like we mentioned above, the asterisk * matches zero or more times. Thus, our second capturing group ([^ <]+)* is optional and will include any collection of one or more non->characters. What is this very flexible pattern looking for? Anything that comes after the tag name and before the closing bracket >. That includes the tags attributes. That includes anything like classes or ids, href, src, or flags like selected or disabled.
The first capturing group ([a-z]+) grabs the tag name (option) and remembers it for later. The second capturing group ([^>]+)* matches all of the attributes and flags ( value=»United States» selected ). That’s stored as well.
Here we have another group that begins with ?:. These characters ?: denote a non-capturing group. A string must match everything inside a non-capturing group, but this group will not be remembered later. You’ll notice that there are capturing groups within this non-capturing group. It’s those sub-group that we’ll be more concerned with.
The first character matched in this segment is >, signifying the end of the HTML tag. But why does the end of the tag appear in the middle of the regex?
Next is the third capturing group (.*). The period . matches any character. So, following the complete HTML tag, the third capturing group matches any string, or no string at all.
What is /\ supposed to be? Programmers will recognize the backslash \ to escape the following character. To match a forward slash /, we need to escape it. This is because / is a functional character in regex, marking the beginning and end of the pattern.
What about \1? We don’t need to escape digits, do we? An escaped character is a reference to the contents of a capturing group. Capturing group 1 matched the tag name. This doesn’t simply repeat the pattern of capturing group 1, it matches the exact same text that capturing group 1 found. Thus, if the tag name was div, \1 must also match div; it can’t match span or any other tag name.
Putting this segment together, we match . You’ve likely caught on that this segment finds the closing tag that pairs with the opening tag we found previously. (.*) allows for any text that comes in between them. That means it can match any text or enclosed tags!
The pipe | separates alternate patterns.\1> is a valid pattern, but what follows | can match instead of\1> .
An escaped letter is a shorthand for a commonly used character class. Here, \s matches any whitespace character: space, tab, or a newline character. Other useful classes include \w (any word character [a-zA-Z0-9_]) and \d (any digit 5).
Altogether, this alternate pattern matches one or more whitespace characters, then /, then >. The alternate to a separate closing tag is, naturally, the/> found in self-closing tags like
or .
Finally, the dollar sign $ matches the end of the string. Then / closes out the regex pattern.
Источник