BeautifulSoup and lxml.html — what to prefer? [duplicate]
I am working on a project that will involve parsing HTML. After searching around, I found two probable options: BeautifulSoup and lxml.html Is there any reason to prefer one over the other? I have used lxml for XML some time back and I feel I will be more comfortable with it, however BeautifulSoup seems to be much common. I know I should use the one that works for me, but I was looking for personal experiences with both.
4 Answers 4
The simple answer, imo, is that if you trust your source to be well-formed, go with the lxml solution. Otherwise, BeautifulSoup all the way.
This answer is three years old now; it’s worth noting, as Jonathan Vanasco does in the comments, that BeautifulSoup4 now supports using lxml as the internal parser, so you can use the advanced features and interface of BeautifulSoup without most of the performance hit, if you wish (although I still reach straight for lxml myself — perhaps it’s just force of habit :)).
I see. I will go with lxml only, my HTML comes from a robust website so I can (hopefully) depend on it to be well formed.
Yes, I would, certainly if you are already familiar with lxml, and you have no «pure python» requirements (as on Google Appengine). Personally, I haven’t had any problems with processing pages with lxml.html (on the contrary, I have been able to process pages that gave problems with Beautifulsoup), except once where I had to explicitly provide the correct character encoding (because lxml «trusted» the incorrect http headers/html meta tags). Also note that the ElementSoup enables lxml.html to use the BeautifulSoup parser should it be necessary)
This question popped up because of a recent edit. I just wanted to not that BeautifulSoup4 supports using lxml as the underlying parser — so now you can basically get almost the speed of lxml (just a minor hit) with all the bonuses of BeautifulSoup.
In summary, lxml is positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes a soupparser module to fall back on BeautifulSoup’s functionality. BeautifulSoup is a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.
lxml documentation says that both parsers have advantages and disadvantages. For this reason, lxml provides a soupparser so you can switch back and forth. Quoting,
BeautifulSoup uses a different parsing approach. It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better.
In the end they are saying,
The downside of using this parser is that it is much slower than the HTML parser of lxml. So if performance matters, you might want to consider using soupparser only as a fallback for certain cases.
If I understand them correctly, it means that the soup parser is more robust — it can deal with a «soup» of malformed tags by using regular expressions — whereas lxml is more straightforward and just parses things and builds a tree as you would expect. I assume it also applies to BeautifulSoup itself, not just to the soupparser for lxml .
They also show how to benefit from BeautifulSoup ‘s encoding detection, while still parsing quickly with lxml :
>>> from BeautifulSoup import UnicodeDammit >>> def decode_html(html_string): . converted = UnicodeDammit(html_string, isHTML=True) . if not converted.unicode: . raise UnicodeDecodeError( . "Failed to detect encoding, tried [%s]", . ', '.join(converted.triedEncodings)) . # print converted.originalEncoding . return converted.unicode >>> root = lxml.html.fromstring(decode_html(tag_soup))
In words of BeautifulSoup ‘s creator,
That’s it! Have fun! I wrote Beautiful Soup to save everybody time. Once you get used to it, you should be able to wrangle data out of poorly-designed websites in just a few minutes. Send me email if you have any comments, run into problems, or want me to know about your project that uses Beautiful Soup.
I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.
lxml has been downloaded from the Python Package Index more than two million times and is also available directly in many package distributions, e.g. for Linux or MacOS-X.
The C libraries libxml2 and libxslt have huge benefits. Standards-compliant. Full-featured. fast. fast! FAST! . lxml is a new Python binding for libxml2 and libxslt.
Beautiful Soup and Table Scraping — lxml vs html parser
I would like to know why the code bellow works with the «html.parser» and prints back none if I change «html.parser» for «lxml» .
#! /usr/bin/python from bs4 import BeautifulSoup from urllib import urlopen webpage = urlopen('http://www.thewebpage.com') soup=BeautifulSoup(webpage, "html.parser") table = soup.find('table', ) print table
2 Answers 2
Short answer.
If you already installed lxml , just use it.
html.parser — BeautifulSoup(markup, «html.parser»)
- Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)
- Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)
lxml — BeautifulSoup(markup, «lxml»)
html5lib — BeautifulSoup(markup, «html5lib»)
- Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5
- Disadvantages: Very slow, External Python dependency
There is a special paragraph in BeautifulSoup documentation called Differences between parsers, it states that:
Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers.
The differences become clear on non well-formed HTML documents.
The moral is just that you should use the parser that works in your particular case.
Also note that you should always explicitly specify which parser are you using. This would help you to avoid surprises when running the code on different machines or virtual environments.
Parsing HTML in python — lxml or BeautifulSoup? Which of these is better for what kinds of purposes?
From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I’ve chosen BeautifulSoup for a project I’m working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn and understand. But I see a lot of people seem to favour lxml and I’ve heard that lxml is faster. So I’m wondering what are the advantages of one over the other? When would I want to use lxml and when would I be better off using BeautifulSoup? Are there any other libraries worth considering?
possible duplicate of BeautifulSoup and lxml.html — what to prefer? I’ve written a detailed answer; reposted it here because the question is duplicate.
Sorry, I meant to close the other one. Now flagged the other one. I thought it didn’t matter where to raise the flag, in the older one or in the newer one.
7 Answers 7
Pyquery provides the jQuery selector interface to Python (using lxml under the hood).
It’s really awesome, I don’t use anything else anymore.
This works better than bs4. I’ve had some problems with bs4 where the diagnose would not even work 🙁
For starters, BeautifulSoup is no longer actively maintained, and the author even recommends alternatives such as lxml.
Quoting from the linked page:
Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than version 3.0.8 does. The most common problems are handling tags incorrectly, «malformed start tag» errors, and «bad end tag» errors. This page explains what happened, how the problem will be addressed, and what you can do right now.
This page was originally written in March 2009. Since then, the 3.2 series has been released, replacing the 3.1 series, and development of the 4.x series has gotten underway. This page will remain up for historical purposes.
tl;dr
Use 3.2.0 instead.
IMHO this is misleading — careful reading of that page reveals that lxml is just an alternative for the problematic version 3.1.0, the problems of which were fixed in 3.2.0, and now there’s even version 4 on the way released just 2 months ago — so the module is hardly «no longer actively maintained». Please modify the answer
Good to see BeautifulSoup getting maintained again. 3.2.0 was released in november 2010 — almost a year after this answer.. 🙂
I’m doubting whether this should be the accepted answer of today. Everything here is pretty much useless information (other than nostalgic / historic purpose).
In summary, lxml is positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes a soupparser module to fall back on BeautifulSoup’s functionality. BeautifulSoup is a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.
lxml documentation says that both parsers have advantages and disadvantages. For this reason, lxml provides a soupparser so you can switch back and forth. Quoting,
BeautifulSoup uses a different parsing approach. It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better.
In the end they are saying,
The downside of using this parser is that it is much slower than the HTML parser of lxml. So if performance matters, you might want to consider using soupparser only as a fallback for certain cases.
If I understand them correctly, it means that the soup parser is more robust — it can deal with a «soup» of malformed tags by using regular expressions — whereas lxml is more straightforward and just parses things and builds a tree as you would expect. I assume it also applies to BeautifulSoup itself, not just to the soupparser for lxml .
They also show how to benefit from BeautifulSoup ‘s encoding detection, while still parsing quickly with lxml :
>>> from BeautifulSoup import UnicodeDammit >>> def decode_html(html_string): . converted = UnicodeDammit(html_string, isHTML=True) . if not converted.unicode: . raise UnicodeDecodeError( . "Failed to detect encoding, tried [%s]", . ', '.join(converted.triedEncodings)) . # print converted.originalEncoding . return converted.unicode >>> root = lxml.html.fromstring(decode_html(tag_soup))
In words of BeautifulSoup ‘s creator,
That’s it! Have fun! I wrote Beautiful Soup to save everybody time. Once you get used to it, you should be able to wrangle data out of poorly-designed websites in just a few minutes. Send me email if you have any comments, run into problems, or want me to know about your project that uses Beautiful Soup.
I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.
lxml has been downloaded from the Python Package Index more than two million times and is also available directly in many package distributions, e.g. for Linux or MacOS-X.
The C libraries libxml2 and libxslt have huge benefits. Standards-compliant. Full-featured. fast. fast! FAST! . lxml is a new Python binding for libxml2 and libxslt.