Sling Academy

html-stripper 0.3

A simple package to extract text from (even broken/invalid) HTML. No dependencies, it just uses Python’s internal HTMLParser with a few tweaks.

Подробности проекта

Ссылки проекта

Статистика

Метаданные

Лицензия: GNU General Public License v3 (GPLv3)

Требует: Python >=3.6

Сопровождающие

Классификаторы

История выпусков Уведомления о выпусках | Лента RSS

Загрузка файлов

Загрузите файл для вашей платформы. Если вы не уверены, какой выбрать, узнайте больше об установке пакетов.

Source Distribution

Uploaded 30 июл. 2020 г. source

Хеши для html_stripper-0.3.tar.gz

Хеши для html_stripper-0.3.tar.gz
Алгоритм Хеш-дайджест
SHA256 b9ea66bc75d00adc06447f3c3a278899c10cf12fad0c0faab39457057b4056b9 Копировать
MD5 50dfb87e9e4fe54b52f35dfff89cca5e Копировать
BLAKE2b-256 21e0c6b141679eed08bb139a7a82f36ed30336b15d69c9b2c4a735549a53efad Копировать

Помощь

О PyPI

Внесение вклада в PyPI

Использование PyPI

Разработано и поддерживается сообществом Python’а для сообщества Python’а.
Пожертвуйте сегодня!

PyPI», «Python Package Index» и логотипы блоков являются зарегистрированными товарными знаками Python Software Foundation.

Источник

Python stripping html tags

Last updated: Feb 19, 2023
Reading time · 4 min

banner

# Table of Contents

# Remove the HTML tags from a String in Python

Use the re.sub() method to remove the HTML tags from a string.

The re.sub() method will remove all of the HTML tags in the string by replacing them with empty strings.

remove html tags from string

The code sample uses a regular expression to strip the HTML tags from a string.

The re.sub method returns a new string that is obtained by replacing the occurrences of the pattern with the provided replacement.

If the pattern isn’t found, the string is returned as is.

The brackets < and >match the opening and closing characters of an HTML tag.

The dot . matches any character except a newline character.

Adding a question mark ? after the qualifier makes it perform a non-greedy or minimal match.

For example, using the regular expression <.*?>will match only .

In its entirety, the regular expression matches all opening and closing HTML tags.

# Remove the HTML tags from a String using xml.etree.ElementTree

You can also use the xml.etree.ElementTree module to strip the HTML tags from a string.

remove html tags from string using element tree

The fromstring method parses an XML section from a string constant and returns an Element instance.

The itertext method creates a text iterator that we can join with the str.join() method.

# Remove the HTML tags from a String using lxml

You can also use the lxml module to strip the HTML tags from a string.

Make sure you have the module installed by running the following command.

Copied!
pip install lxml # 👇️ or pip3 pip3 install lxml

Now you can import and use the lxml module to strip the HTML tags from the string.

remove html tags from string using lxml

The text_content method removes all markup from a string.

# Remove the HTML tags from a String using BeautifulSoup

You can also use the BeautifulSoup4 module to remove the HTML tags from a string.

Make sure you have the module installed to be able to run the code sample.

Copied!
pip install lxml pip install beautifulsoup4 # 👇️ or pip3 pip3 install lxml pip3 install beautifulsoup4

Now you can import and use the BeautifulSoup module to strip the HTML tags from the string.

The text attribute on the BeautifulSoup object returns the text content of the string, excluding the HTML tags.

# Remove the HTML tags from a String using HTMLParser in Python

This is a four-step process:

  1. Extend from the HTMLParser class from the html.parser module.
  2. Implement the handle_data method to get the data between the HTML tags.
  3. Store the data in a list on the class instance.
  4. Call the get_data() method on an instance of the class.

The remove_html_tags function takes a string and strips the HTML tags from the supplied string.

We extended from the HTMLParser class. The code snippet is very similar to the one used internally by the django module.

The HTMLParser class is used to find tags and other markup and call handler functions.

The data between the HTML tags is passed from the parser to the derived class by calling self.handle_data() .

When convert_charrefs is set to True , character references automatically get converted to the corresponding Unicode character.

If convert_charrefs is set to False , character references are passed by calling the self.handle_entityref() or self.handle_charref() methods.

The str.join method takes an iterable as an argument and returns a string which is the concatenation of the strings in the iterable.

The remove_html_tags() function takes a string that contains HTML tags and returns a new string where all opening and closing HTML tags have been removed.

The function instantiates the class and feeds the string containing the html tags to the parser.

Lastly, we call the get_data() method on the instance to join the list of strings into a single string that doesn’t contain any HTML tags.

# Additional Resources

You can learn more about the related topics by checking out the following tutorials:

I wrote a book in which I share everything I know about how to become a better, more efficient programmer.

Источник

Python: 5 ways to remove HTML tags from a string

This concise, example-based article will walk you through some different approaches to stripping HTML tags from a given string in Python (to get plain text).

The raw HTML string we will use in the examples to come is shown below:

html_string = """     

This is a heading

Some meaningless text

Sample link Sample link sample image

"""

As you can see, it contains several common HTML tags like , , ,

, self-closing ones like
,


, and a sample comment. The reason we use such a long HTML string is to make sure that our methods can work well in many different scenarios. If the test HTML string is too short and simple, potential pitfalls might be overlooked.

Using lxml

lxml is a powerful tool for processing HTML and XML. It’s fast, safe, and reliable. This is an external package, so we need to install it first:

from lxml import etree html_string = """     

This is a heading

Some meaningless text

Sample link Sample link sample image

""" def remove_html_tags(text): parser = etree.HTMLParser() tree = etree.fromstring(text, parser) return etree.tostring(tree, encoding='unicode', method='text') plan_text = remove_html_tags(html_string) print(plan_text.strip())
Sling Academy This is a heading Some meaningless text Sample link Sample link

Using Regular Expressions

You can use the re module to create a pattern that matches any text inside < and >, and then use the re.sub() method to replace them with empty strings.

import re html_string = """     

This is a heading

Some meaningless text

Sample link Sample link sample image

""" def remove_html_tags(text): clean = re.compile('<.*?>') return re.sub(clean, '', text) result = remove_html_tags(html_string) # print the result without leading and trailing white spaces print(result.strip())

The output looks exactly as what we got after using the previous method:

Sling Academy This is a heading Some meaningless text Sample link Sample link

Using BeautifulSoup

This solution involves using the popular BeautifulSoup library, which provides convenient methods to parse and manipulate HTML.

pip install beautifulsoup4
from bs4 import BeautifulSoup html_string = """     

This is a heading

Some meaningless text

Sample link Sample link sample image

""" def remove_html_tags(input): soup = BeautifulSoup(input, 'html.parser') return soup.get_text() print(remove_html_tags(html_string).strip())

Still, the same plain text you got in the previous examples, but the indentation is automatically removed:

Sling Academy This is a heading Some meaningless text Sample link Sample link

Using a for loop and if…else statements

This technique is super flexible, and you can customize it as needed. Our weapons are just a for loop, some if. else statements, and some basic string operations.

html_string = """     

This is a heading

Some meaningless text

Sample link Sample link sample image

""" def remove_html_tags(text): inside_tag = False result = '' for char in text: if char == '': inside_tag = False else: if not inside_tag: result += char return result print(remove_html_tags(html_string).strip())
Sling Academy This is a heading Some meaningless text Sample link Sample link

Using HTMLParser

This solution makes use of the built-in html.parser module in Python for parsing HTML and extracting the text. However, it’s a little bit longer in comparison to the preceding approaches.

from html.parser import HTMLParser class HTMLTagRemover(HTMLParser): def __init__(self): super().__init__() self.result = [] def handle_data(self, data): self.result.append(data) def get_text(self): return ''.join(self.result) def remove_html_tags(text): remover = HTMLTagRemover() remover.feed(text) return remover.get_text() html_string = """     

This is a heading

Some meaningless text

Sample link Sample link sample image

""" print(remove_html_tags(html_string).strip())
Sling Academy This is a heading Some meaningless text Sample link Sample link

That’s it. Happy coding & have a nice day!

Источник

strip-tags

To run against just specific areas identified by CSS selectors:

strip-tags  -i input.html > output.txt

This can be called with multiple selectors:

cat input.html  strip-tags   > output.txt

To return just the first element on the page that matches one of the selectors, use —first :

cat input.html  strip-tags .content --first > output.txt

To remove content contained by specific selectors — e.g. the section of a page, use -r or —remove :

cat input.html  strip-tags -r nav > output.txt

To minify whitespace — reducing multiple space and tab characters to a single space, and multiple newlines and spaces to a maximum of two newlines — add -m or —minify :

cat input.html  strip-tags -m > output.txt

You can also run this command using python -m like this:

Keeping the markup for specified tags

When passing content to a language model, it can sometimes be useful to leave in a subset of HTML tags —

This is the heading

for example — to provide extra hints to the model.

The -t/—keep-tag option can be passed multiple times to specify tags that should be kept.

This example looks at the section of https://datasette.io/ and keeps the tags around the list items and elements:

curl -s https://datasette.io/ | strip-tags header -t h1 -t li 

All attributes will be removed from the tags, except for the id= and class= attribute since those may provide further useful hints to the language model.

The href attribute on links, the alt attribute on images and the name and value attributes on meta tags are kept as well.

You can also specify a bundle of tags. For example, strip-tags -t hs will keep the tag markup for all levels of headings.

The following bundles can be used:

As a Python library

You can use strip-tags from Python code too. The function signature looks like this:

This has tags

And whitespace too

strip-tags —help

Usage: strip-tags [OPTIONS] [SELECTORS]. Strip tags from HTML, optionally from areas identified by CSS selectors Example usage: cat input.html | strip-tags > output.txt To run against just specific areas identified by CSS selectors: cat input.html | strip-tags .entry .footer > output.txt Options: --version Show the version and exit. -r, --remove TEXT Remove content in these selectors -i, --input FILENAME Input file -m, --minify Minify whitespace -t, --keep-tag TEXT Keep these --all-attrs Include all attributes on kept tags --first First element matching the selectors --help Show this message and exit. 

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

 strip-tags python -m venv venv  venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e To run the tests:

Источник

Читайте также:  Dictionary in python for loop
Оцените статью