The html parser built into Python allows you to parse HTML and XML documents and extract data from them. This comprehensive cheatsheet provides everything you need to know to fully utilize this useful package.
Getting Started
Import the HTMLParser module:
from html.parser import HTMLParser
Create a parser class inheriting from HTMLParser:
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print(f"Encountered a start tag: {tag}")
def handle_data(self, data):
print(f"Encountered some data: {data}")
parser = MyHTMLParser()
Feed some HTML to the parser:
html = """<html><head><title>Test Parser</title><body><h1>Hello World!</h1></body></html>"""
parser.feed(html)
The parser will call methods to handle tags and data during parsing.
Parsing HTML/XML
The html parser can handle HTML as well as XML syntax. Here is an example XML document:
<note>
<to>George</to>
<from>John</from>
<heading>Reminder</heading>
<body>Don't forget the meeting!</body>
</note>
Parsing XML is the same process as HTML using the same parser and methods.
Parsing Strategies
There are several approaches and packages available for parsing HTML and XML in Python:
Built-in HTML Parser
BeautifulSoup
Regular Expressions
XML Parsers
In many cases Python's built-in HTML parser is your best choice for basic to intermediate parsing needs.
Parsing Document Fragments
You don't need to feed the parser a full HTML document, it works on any document fragment:
fragment = """<div><p>This is a <b>fragment</b></p><p>Of <i>HTML</i> without metadata</p></div>"""
parser.feed(fragment)
Useful for parsing HTML snippets from larger documents or templates.
Asynchronous Parsing
The parser can be used asynchronously with Python's
import asyncio
async def parse_async(html):
parser = MyHTMLParser()
await loop.run_in_executor(None, parser.feed, html)
return parser.get_data()
# Get data without blocking
data = await parse_async(some_html)
Helpful for parsing multiple pages concurrently without blocking.
Parsing Methods
These are the main parsing methods you can override in a subclass:
handle_starttag(tag, attrs)
Called for each starting tag:
def handle_starttag(self, tag, attrs):
print(f"Encountered start tag: {tag}")
attrs_str = "".join([f' {name}="{value}"' for name, value in attrs])
print(f"<{tag}{attrs_str}>")
Attributes are passed in as a list of (name, value) tuples.
handle_endtag(tag)
Called for each ending tag:
def handle_endtag(self, tag):
print(f"Encountered end tag: {tag}")
handle_data(data)
Called for text blocks between tags:
def handle_data(self, data):
print(f"Encountered data: {data}")
Useful for extracting text content.
handle_comment(data)
Called for HTML comments:
def handle_comment(self, data):
print(f"Comment: {data}")
handle_entityref(name)
Called for entity references like
The name does not include the '&' or ';' delimiters.
def handle_entityref(self, name):
print(f"Found entity ref: {name}")
handle_charref(name)
Called for numeric character references like
The name is the decoded Unicode character.
def handle_charref(self, name):
print(f"Found character reference to: {name}")
handle_decl(data)
Called for DOCTYPE declarations and the XML declaration.
E.g:
def handle_decl(self, data):
print(f"Declaration: {data}")
These cover all the major parsing events.
Extracting Data
Store extracted data in your parser subclass instance:
from html.parser import HTMLParser
class MyParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.links = []
def handle_starttag(self, tag, attrs):
if tag == 'a':
for name, value in attrs:
if name == 'href':
self.links.append(value)
parser = MyParser()
parser.feed(some_html)
for link in parser.links:
print(link)
You have full access to the extracted data after parsing completes.
Some ideas:
Parsing Attributes
Tag attributes are passed as a list of (name, value) tuples to start tag methods:
def handle_starttag(self, tag, attrs):
print(f"<{tag}>")
for attr in attrs:
name = attr[0]
value = attr[1]
print(f" {name}={value}")
print(f"</{tag}>")
Convenient for accessing attribute values by name.
You can also access a dictionary of attributes:
def handle_starttag(self, tag, attrs):
attrs = dict(attrs)
print(attrs['class']) # Print class attr
Parsing Trees
The parser generates a parsing tree as it processes a document.
You can access the tree handlers with:
handle_startendtag(tag, attrs) # Called for empty tags
handle_starttag(tag, attrs) # On opening tag
handle_endtag(tag) # On closing tag
The nesting of calls represents the tree structure.
This allows building abstract syntax trees while parsing or pulling data directly from trees.
Error Handling
Use
try:
parser.feed("<html><asdd<<</html")
except HTMLParseError:
print("Parsing failed!")
Set
parser = HTMLParser(strict=True)
The parser then stops on the first error found.
Advanced Techniques
There are several more advanced techniques available as well:
Parser Subclasses
Create parser subclasses targeted for specific parsing goals:
class LinkParser(HTMLParser):
# Custom logic to find <a> links
class ImageParser(HTMLParser):
# Custom logic to find <img> tags
Reuse parsers without altering their logic.
Web Scraping
Import your parser into a web scraper to parse pages:
import requests
from example import MyParser
def scrape(url):
page = requests.get(url)
parser = MyParser()
parser.feed(page.text)
return parser.data
Brings together fetching and parsing logic.
Asynchronous Parsing
Import asyncio to parse multiple pages concurrently:
import asyncio
async def parse(url):
page = await fetch_page(url) # Fetch HTML
parser = MyParser()
parser.feed(page)
return parser.data
urls = ['url1', 'url2', ...]
loop = asyncio.get_event_loop()
data = loop.run_until_complete(asyncio.gather(*[parse(url) for url in urls]))
Takes advantage of asynchronous IO for faster parsing.
XML Integration
Convert XML to HTML for the parser:
import xml.dom.minidom
xml = """<note>...</note>"""
dom = xml.dom.minidom.parseString(xml)
html = dom.toprettyxml() # Convert to HTML
parser.feed(html) # Send to parser
Allows XML parsing with the HTML parser.
You can also use dedicated XML parsers like
Parsing Tips
Here are some handy tips for using the html parser effectively:
Sanitize Input
Use a library to sanitize input before parsing:
import bleach
dirty_html = get_tainted_input()
clean_html = bleach.clean(dirty_html)
parser.feed(clean_html)
Avoids security issues from malicious input.
Improve Performance
The parser is quite fast but you can optimize further:
Choose Encoding
Specify encoding on the parser instance:
parser = HTMLParser(encoding='utf-8')
Handles issues with encoding mismatches.
Debug Errors
Debug errors by handling exceptions:
try:
parser.feed(bad_html)
except HTMLParseError as e:
print(e.msg) # Print error message
Usually indicates malformed input documents.
Validate Documents
Check if a document is valid HTML before parsing:
import html5validator
is_valid = html5validator.checkValidityOfHtml(document)
if is_valid:
parser.feed(document)
Can help narrow down errors.
Use Cases
Some examples of common use cases:
Web Scraping
Harvesting data from websites:
class ScraperParser(HTMLParser):
def __init__(self):
self.items = []
def handle_data(self, data):
self.items.append(data)
parser = ScraperParser()
parser.feed(requests.get("<https://example.com>").text)
print(parser.items)
RSS/Atom Feeds
Parse syndicated feed content:
from urllib import request
feed = request.urlopen("<https://example.com/feed>")
parser = FeedParser()
parser.feed(feed.read())
print(f"Most recent item: {parser.items[0]}")
Email Parsing
Extract data from HTML email content:
import imaplib
mail = imaplib.fetch(message_id, "(RFC822)")
if mail.is_html:
parser = EmailParser()
parser.feed(mail.html)
print(parser.get_links())
Static Site Generators
Use parsed HTML to produce static sites:
class SiteParser(HTMLParser):
def __init__(self):
self.pages = []
def handle_data(self, data):
self.pages.append(data)
parser = SiteParser()
parser.feed(template_html)
for page in parser.pages:
with open(f"{page}.html", "w") as f:
f.write(render(page))
Automates site generation without a dynamic backend.
HTML Processing
Manipulate and process HTML documents:
class Process(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == "button":
attrs.append(("disabled", "True"))
def get_html(self):
return self.output
parser = Process()
parser.feed(html)
processed_html = parser.get_html()
Modify, sanitize, or transform HTML programmatically.
Test HTML Output
Verify HTML generation:
expected_html = """
<html>
<body>
Hello world!
</body>
</html>
"""
generator = MyHtmlGenerator()
parser = TestParser()
parser.feed(generator.output())
assert parser.body == "Hello world!"
Confirm generated HTML matches expectations.