Sling Academy

Remove HTML Tags From String in Python

While collecting data, we often need to process texts with HTML tags. In this article, we will discuss different ways to remove HTML tags from string in python.

Remove HTML tags from string in python Using Regular Expressions

Regular expressions are one of the best ways to process text data. We can also remove HTML tags from string in python using regular expressions. For this, we can use the sub() method defined in the regex module.

The sub() method takes the pattern of the sub-string that needs to be replaced as its first argument, the string that will be substituted at the place of the replaced sub-string as the second input argument, and the original string as the third input argument.

Читайте также:  Рекурсивный вызов метода java

After execution, it returns the modified string by replacing all the occurrences of the substring given as the first input argument with the substring given as the second input argument in the original string.

To remove HTML tags from string in python using the sub() method, we will first define a pattern that represents all the HTML tags. For this, we will create a pattern that reads all the characters inside an HTML tag <> . The pattern is as follows.

After creating the pattern, we will substitute each substring having the defined pattern with an empty string «» using the sub() method. In this way, we can remove the HTML tags from any given string in Python.

Following is the source code to remove HTML tags from string in python using the sub() method.

I am a sentence inside an HTML string.

I am just another sentence written by Aditya.

Further reading:

Remove Urls from Text in Python
Get HTML from URL in Python

Remove HTML tags from string in python Using the lxml Module

Instead of using regular expressions, we can also use the lxml module to remove HTML tags from string in python. For this, we will first parse the original string using the fromstring() method.

The fromstring() method takes the original string as an input and returns a parser. After getting the parser, we can extract the text using the text_content() method, leaving behind the HTML tags. The text_content() method returns an object of lxml.etree._ElementUnicodeResult data type. Therefore, we need to convert the output to string using the str() function.

You can observe this in the following example.

I am a sentence inside an HTML string.

I am just another sentence written by Aditya.

Remove HTML tags from string in python Using the Beautifulsoup Module

Like the lxml module, the BeautifulSoup module also provides us with various functions to process text data. To remove HTML tags from a string using the BeautifulSoup module, we can use the BeautifulSoup() method and the get_text() method.

In this approach, we will first create a parser to parse the string that contains HTML tags using the BeautifulSoup() method. The BeautifulSoup() method takes the original string as its first input argument and the type of parser to be created as its second input argument, which is optional. After execution, it returns the parser. We can invoke the get_text() method on the parser to get the output string.

The following program demonstrates how to remove HTML tags from string in python using the BeautifulSoup module.

Источник

Python delete all html tags

Last updated: Feb 19, 2023
Reading time · 4 min

banner

# Table of Contents

# Remove the HTML tags from a String in Python

Use the re.sub() method to remove the HTML tags from a string.

The re.sub() method will remove all of the HTML tags in the string by replacing them with empty strings.

The code sample uses a regular expression to strip the HTML tags from a string.

The re.sub method returns a new string that is obtained by replacing the occurrences of the pattern with the provided replacement.

If the pattern isn’t found, the string is returned as is.

The brackets < and >match the opening and closing characters of an HTML tag.

The dot . matches any character except a newline character.

Adding a question mark ? after the qualifier makes it perform a non-greedy or minimal match.

For example, using the regular expression <.*?>will match only .

In its entirety, the regular expression matches all opening and closing HTML tags.

# Remove the HTML tags from a String using xml.etree.ElementTree

You can also use the xml.etree.ElementTree module to strip the HTML tags from a string.

The fromstring method parses an XML section from a string constant and returns an Element instance.

The itertext method creates a text iterator that we can join with the str.join() method.

# Remove the HTML tags from a String using lxml

You can also use the lxml module to strip the HTML tags from a string.

Make sure you have the module installed by running the following command.

Copied!
pip install lxml # 👇️ or pip3 pip3 install lxml

Now you can import and use the lxml module to strip the HTML tags from the string.

The text_content method removes all markup from a string.

# Remove the HTML tags from a String using BeautifulSoup

You can also use the BeautifulSoup4 module to remove the HTML tags from a string.

Make sure you have the module installed to be able to run the code sample.

Copied!
pip install lxml pip install beautifulsoup4 # 👇️ or pip3 pip3 install lxml pip3 install beautifulsoup4

Now you can import and use the BeautifulSoup module to strip the HTML tags from the string.

The text attribute on the BeautifulSoup object returns the text content of the string, excluding the HTML tags.

# Remove the HTML tags from a String using HTMLParser in Python

This is a four-step process:

  1. Extend from the HTMLParser class from the html.parser module.
  2. Implement the handle_data method to get the data between the HTML tags.
  3. Store the data in a list on the class instance.
  4. Call the get_data() method on an instance of the class.

The remove_html_tags function takes a string and strips the HTML tags from the supplied string.

We extended from the HTMLParser class. The code snippet is very similar to the one used internally by the django module.

The HTMLParser class is used to find tags and other markup and call handler functions.

The data between the HTML tags is passed from the parser to the derived class by calling self.handle_data() .

When convert_charrefs is set to True , character references automatically get converted to the corresponding Unicode character.

If convert_charrefs is set to False , character references are passed by calling the self.handle_entityref() or self.handle_charref() methods.

The str.join method takes an iterable as an argument and returns a string which is the concatenation of the strings in the iterable.

The remove_html_tags() function takes a string that contains HTML tags and returns a new string where all opening and closing HTML tags have been removed.

The function instantiates the class and feeds the string containing the html tags to the parser.

Lastly, we call the get_data() method on the instance to join the list of strings into a single string that doesn’t contain any HTML tags.

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.

Источник

Python: 5 ways to remove HTML tags from a string

This concise, example-based article will walk you through some different approaches to stripping HTML tags from a given string in Python (to get plain text).

The raw HTML string we will use in the examples to come is shown below:

html_string = """     

This is a heading

Some meaningless text

Sample link Sample link sample image

"""

As you can see, it contains several common HTML tags like , , ,

, self-closing ones like
,


, and a sample comment. The reason we use such a long HTML string is to make sure that our methods can work well in many different scenarios. If the test HTML string is too short and simple, potential pitfalls might be overlooked.

Using lxml

lxml is a powerful tool for processing HTML and XML. It’s fast, safe, and reliable. This is an external package, so we need to install it first:

from lxml import etree html_string = """     

This is a heading

Some meaningless text

Sample link Sample link sample image

""" def remove_html_tags(text): parser = etree.HTMLParser() tree = etree.fromstring(text, parser) return etree.tostring(tree, encoding='unicode', method='text') plan_text = remove_html_tags(html_string) print(plan_text.strip())
Sling Academy This is a heading Some meaningless text Sample link Sample link

Using Regular Expressions

You can use the re module to create a pattern that matches any text inside < and >, and then use the re.sub() method to replace them with empty strings.

import re html_string = """     

This is a heading

Some meaningless text

Sample link Sample link sample image

""" def remove_html_tags(text): clean = re.compile('<.*?>') return re.sub(clean, '', text) result = remove_html_tags(html_string) # print the result without leading and trailing white spaces print(result.strip())

The output looks exactly as what we got after using the previous method:

Sling Academy This is a heading Some meaningless text Sample link Sample link

Using BeautifulSoup

This solution involves using the popular BeautifulSoup library, which provides convenient methods to parse and manipulate HTML.

pip install beautifulsoup4
from bs4 import BeautifulSoup html_string = """     

This is a heading

Some meaningless text

Sample link Sample link sample image

""" def remove_html_tags(input): soup = BeautifulSoup(input, 'html.parser') return soup.get_text() print(remove_html_tags(html_string).strip())

Still, the same plain text you got in the previous examples, but the indentation is automatically removed:

Sling Academy This is a heading Some meaningless text Sample link Sample link

Using a for loop and if…else statements

This technique is super flexible, and you can customize it as needed. Our weapons are just a for loop, some if. else statements, and some basic string operations.

html_string = """     

This is a heading

Some meaningless text

Sample link Sample link sample image

""" def remove_html_tags(text): inside_tag = False result = '' for char in text: if char == '': inside_tag = False else: if not inside_tag: result += char return result print(remove_html_tags(html_string).strip())
Sling Academy This is a heading Some meaningless text Sample link Sample link

Using HTMLParser

This solution makes use of the built-in html.parser module in Python for parsing HTML and extracting the text. However, it’s a little bit longer in comparison to the preceding approaches.

from html.parser import HTMLParser class HTMLTagRemover(HTMLParser): def __init__(self): super().__init__() self.result = [] def handle_data(self, data): self.result.append(data) def get_text(self): return ''.join(self.result) def remove_html_tags(text): remover = HTMLTagRemover() remover.feed(text) return remover.get_text() html_string = """     

This is a heading

Some meaningless text

Sample link Sample link sample image

""" print(remove_html_tags(html_string).strip())
Sling Academy This is a heading Some meaningless text Sample link Sample link

That’s it. Happy coding & have a nice day!

Источник

Оцените статью