BeautifulSoup is one of the most popular Python libraries for parsing HTML and XML documents. But there is often confusion around whether BeautifulSoup itself parses the documents, or whether it uses other parsers like lxml and html.parser under the hood.
BeautifulSoup Doesn't Actually Parse Documents
The key thing to understand is that BeautifulSoup provides a nice API for navigating and searching an HTML/XML document, but it doesn't contain a parser itself. It uses other parsers to actually convert the raw document data into a parsable structure.
The most common parsers BeautifulSoup can use are:
By default, BeautifulSoup will auto-detect and use the best parser available on your system.
from bs4 import BeautifulSoup
soup = BeautifulSoup(my_html_doc) #auto-selects the best parser
You can also explicitly state which parser you want it to use:
soup = BeautifulSoup(my_html_doc, 'lxml')
Practical Implications
The main thing this means in practice is that if you want BeautifulSoup to handle bad HTML, you may need to explicitly use
It also means that if you're doing heavy parsing, installing and using
BeautifulSoup is really just an interface to other parsers. It provides great methods and Pythonic idioms for navigating, searching, and modifying parsed document trees. But it doesn't contain an HTML parser itself - it offloads that work to other specialized parsers.