How to scrape only visible webpage text with BeautifulSoup?
Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of tags and html comments which I don’t want. I can’t figure out the arguments I need for the function findAll() in order to just get the visible texts on a webpage. So, how should I find all visible text excluding scripts, comments, css etc.?
11 Answers 11
from bs4 import BeautifulSoup from bs4.element import Comment import urllib.request def tag_visible(element): if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']: return False if isinstance(element, Comment): return False return True def text_from_html(body): soup = BeautifulSoup(body, 'html.parser') texts = soup.findAll(text=True) visible_texts = filter(tag_visible, texts) return u" ".join(t.strip() for t in visible_texts) html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read() print(text_from_html(html))
For recent BS4 (at least) you could identify comments with isinstance(element, Comment) instead of matching with a regex.
In the visible function, the elif for finding comments didn’t seem to work. i had to update it to elif isinstance(element,bs4.element.Comment): . I also added ‘meta’ to the list of parents.
The above filter has a lot of \n in the result, add the following code to eliminate white spaces and new lines: elif re.match(r»[\s\r\n]+»,str(element)): return False
The approved answer from @jbochi does not work for me. The str() function call raises an exception because it cannot encode the non-ascii characters in the BeautifulSoup element. Here is a more succinct way to filter the example web page to visible text.
html = open('21storm.html').read() soup = BeautifulSoup(html) [s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])] visible_text = soup.getText()
If str(element) fails with encoding problems, you should try unicode(element) instead if you are using Python 2.
import urllib from bs4 import BeautifulSoup url = "https://www.yahoo.com" html = urllib.urlopen(url).read() soup = BeautifulSoup(html) # kill all script and style elements for script in soup(["script", "style"]): script.extract() # rip it out # get text text = soup.get_text() # break into lines and remove leading and trailing space on each lines = (line.strip() for line in text.splitlines()) # break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # drop blank lines text = '\n'.join(chunk for chunk in chunks if chunk) print(text.encode('utf-8'))
If I try this on the url imfuna.com it only returns 6 words (Imfuna Property Inventory and Inspection Apps) despite the fact there is much more text/words on the page. any ideas why this answer doesn’t work for that url? @bumpkin
I completely respect using Beautiful Soup to get rendered content, but it may not be the ideal package for acquiring the rendered content on a page.
I had a similar problem to get rendered content, or the visible content in a typical browser. In particular I had many perhaps atypical cases to work with such a simple example below. In this case the non displayable tag is nested in a style tag, and is not visible in many browsers that I have checked. Other variations exist such as defining a class tag setting display to none. Then using this class for the div.
lots of text here
even headings
One solution posted above is:
html = Utilities.ReadFile('simple.html') soup = BeautifulSoup.BeautifulSoup(html) texts = soup.findAll(text=True) visible_texts = filter(visible, texts) print(visible_texts) [u'\n', u'\n', u'\n\n lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']
This solution certainly has applications in many cases and does the job quite well generally but in the html posted above it retains the text that is not rendered. After searching SO a couple solutions came up here BeautifulSoup get_text does not strip all tags and JavaScript and here Rendered HTML to plain text using Python
I tried both these solutions: html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data.
One answer here from @Helge was about using nltk of all things.
import nltk %timeit nltk.clean_html(html) was returning 153 us per loop
It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.
betterHTML = html.decode(errors='ignore') %timeit html2text.html2text(betterHTML) %3.09 ms per loop
Python:Getting text from html using Beautifulsoup
I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image: I am using the following code:
def get_single_item_data(item_url): sourceCode = requests.get(item_url) plainText = sourceCode.text soup = BeautifulSoup(plainText) for item_name in soup.findAll('h4',): print(item_name.string) item_url = 'https://www.kaggle.com/titericz' get_single_item_data(item_url)
The result is None . The problem is that soup.findAll(‘h4’,<'data-bind':"text: rankingText">) outputs: [ ] but in the html of the link when inspecting this is like:
1st
. It can be seen in the image: Its clear that the text is missing. How can I overpass that? Edit: Printing the soup variable in the terminal I can see that this value exists: So there should be a way to access through soup . Edit 2: I tried unsuccessfully to use the most voted answer from this stackoverflow question. Could be a solution around there.'data-bind':"text:>
The url in your code is different from the one in your linked example; this was confusing. I would suggest changing the one in your code to match the example or vice versa.
4 Answers 4
If you aren’t going to try browser automation through selenium as @Ali suggested, you would have to parse the javascript containing the desired information. You can do this in different ways. Here is a working code that locates the script by a regular expression pattern, then extracts the profile object, loads it with json into a Python dictionary and prints out the desired ranking:
import re import json from bs4 import BeautifulSoup import requests response = requests.get("https://www.kaggle.com/titericz") soup = BeautifulSoup(response.content, "html.parser") pattern = re.compile(r"profile: (<.*>),", re.MULTILINE | re.DOTALL) script = soup.find("script", text=pattern) profile_text = pattern.search(script.text).group(1) profile = json.loads(profile_text) print profile["ranking"], profile["rankingText"]
Extract content within a tag with BeautifulSoup
But it returned nothing. In addition, I’m also having problem with the following extracting the My home address :
Address: My home address
I’m also using the same method to search for the text=»Address: » but how do I navigate down to the next line and extract the content of
4 Answers 4
The contents operator works well for extracting text from
s = 'My home address ' soup = BeautifulSoup(s) td = soup.find('td') #My home address td.contents #My home address
s = 'Address: ' soup = BeautifulSoup(s) td = soup.find('td').find('b') #Address: td.contents #Address:
>>> s = 'Name: Hello world ' >>> soup = BeautifulSoup(s) >>> hello = soup.find(text='Name: ') >>> hello.next u'Hello world' next and previous let you move through the document elements in the order they were processed by the parser while sibling methods work with the parse tree
Sorry for the multiple comments as i didn’t know the return key actually posted the comment. I was thinking if there’s a better method to do this just in case if there’s a similar text which is «Name: «.
you can check for hello.parent.parent.name or hello.parent.parent.attrs or anything else you can latch onto
Use the below code to get extract text and content from html tags with python beautifulSoup
s = 'Example information ' # your raw html soup = BeautifulSoup(s) #parse html with BeautifulSoup td = soup.find('td') #tag of interest Example information td.text #Example information # clean text from html
Thank you for this code snippet, which might provide some limited, immediate help. A proper explanation would greatly improve its long-term value by showing why this is a good solution to the problem and would make it more useful to future readers with other, similar questions. Please edit your answer to add some explanation, including the assumptions you’ve made.
I decided to use .text since the user wanted to extract plain text from the html. After the user parses the the html with the Beautiful soup python library, he can use ‘id’, «class» or any other identifier to find the tag or html element of interest and after doing this, if he wants plain text within any of the selected tag, he can use .text on the tag as I decribed above