This cheatsheet covers the full BeautifulSoup 4 API with practical examples.
Installation
Say we want to scrape a website:
pip install beautifulsoup4
Import BeautifulSoup module:
from bs4 import BeautifulSoup
Creating a BeautifulSoup Object
Parse HTML string:
html = "<p>Example paragraph</p>"
soup = BeautifulSoup(html, 'html.parser')
Parse from file:
with open("index.html") as file:
soup = BeautifulSoup(file, 'html.parser')
BeautifulSoup Object Types
When parsing documents and navigating the parse trees, you will encounter the following main object types:
Tag
A Tag corresponds to an HTML or XML tag in the original document:
soup = BeautifulSoup('<p>Hello World</p>')
p_tag = soup.p
p_tag.name # 'p'
p_tag.string # 'Hello World'
Tags contain nested Tags and NavigableStrings.
NavigableString
A NavigableString represents text content without tags:
soup = BeautifulSoup('Hello World')
text = soup.string
text # 'Hello World'
type(text) # bs4.element.NavigableString
BeautifulSoup
The BeautifulSoup object represents the parsed document as a whole. It is the root of the tree:
soup = BeautifulSoup('<html>...</html>')
soup.name # '[document]'
soup.head # <head> Tag element
Comment
Comments in HTML are also available as Comment objects:
<!-- This is a comment -->
comment = soup.find(text=re.compile('This is'))
type(comment) # bs4.element.Comment
Knowing these core object types helps when analyzing, searching, and navigating parsed documents.
Searching the Parse Tree
By Name
HTML:
<div>
<p>Paragraph 1</p>
<p>Paragraph 2</p>
</div>
Python:
paragraphs = soup.find_all('p')
# <p>Paragraph 1</p>, <p>Paragraph 2</p>
By Attributes
HTML:
<div id="content">
<p>Paragraph 1</p>
</div>
Python:
div = soup.find(id="content")
# <div id="content">...</div>
By Text
HTML:
<p>This is some text</p>
Python:
p = soup.find(text="This is some text")
# <p>This is some text</p>
Searching with CSS Selectors
CSS selectors provide a very powerful way to search for elements within a parsed document.
Some examples of CSS selector syntax:
By Tag Name
Select all
soup.select("p")
By ID
Select element with ID "main":
soup.select("#main")
By Class Name
Select elements with class "article":
soup.select(".article")
By Attribute
Select tags with a "data-category" attribute:
soup.select("[data-category]")
Descendant Combinator
Select paragraphs inside divs:
soup.select("div p")
Child Combinator
Select direct children paragraphs:
soup.select("div > p")
Adjacent Sibling
Select h2 after h1:
soup.select("h1 + h2")
General Sibling
Select h2 after any h1:
soup.select("h1 ~ h2")
By Text
Select elements containing text:
soup.select(":contains('Some text')")
By Attribute Value
Select input with type submit:
soup.select("input[type='submit']")
Pseudo-classes
Select first paragraph:
soup.select("p:first-of-type")
Chaining
Select first article paragraph:
soup.select("article > p:nth-of-type(1)")
Accessing Data
HTML:
<p class="content">Some text</p>
Python:
p = soup.find('p')
p.name # "p"
p.attrs # {"class": "content"}
p.string # "Some text"
The Power of find_all()
The
Returns All Matches
all_paras = soup.find_all('p')
This gives you all paragraphs on a page.
Flexible Queries
You can pass a wide range of queries to
Useful Features
Some useful things you can do with
Why It's Useful
In summary,
Whenever you need to get a collection of elements from a parsed document,
Navigating Trees
Traverse up and sideways through related elements.
Modifying the Parse Tree
BeautifulSoup provides several methods for editing and modifying the parsed document tree.
HTML:
<p>Original text</p>
Python:
p = soup.find('p')
p.string = "New text"
Edit Tag Names
Change an existing tag name:
tag = soup.find('span')
tag.name = 'div'
Edit Attributes
Add, modify or delete attributes of a tag:
tag['class'] = 'header' # set attribute
tag['id'] = 'main'
del tag['class'] # delete attribute
Edit Text
Change text of a tag:
tag.string = "New text"
Append text to a tag:
tag.append("Additional text")
Insert Tags
Insert a new tag:
new_tag = soup.new_tag("h1")
tag.insert_before(new_tag)
Delete Tags
Remove a tag entirely:
tag.extract()
Wrap/Unwrap Tags
Wrap another tag around:
tag.wrap(soup.new_tag('div))
Unwrap its contents:
tag.unwrap()
Modifying the parse tree is very useful for cleaning up scraped data or extracting the parts you need.
Outputting HTML
Input HTML:
<p>Hello World</p>
Python:
print(soup.prettify())
# <p>
# Hello World
# </p>
Integrating with Requests
Fetch a page:
import requests
res = requests.get("<https://example.com>")
soup = BeautifulSoup(res.text, 'html.parser')
Parsing Only Parts of a Document
When dealing with large documents, you may want to parse only a fragment rather than the whole thing. BeautifulSoup allows for this using SoupStrainers.
There are a few ways to parse only parts of a document:
By CSS Selector
Parse just a selection matching a CSS selector:
from bs4 import SoupStrainer
only_tables = SoupStrainer("table")
soup = BeautifulSoup(doc, parse_only=only_tables)
This will parse only the Parse only specific tags: Pass a function to test if a tag should be parsed: This parses tags based on their text content. Parse tags that contain specific attributes: You can combine multiple strainers: This will parse only Parsing only parts you need can help reduce memory usage and improve performance when scraping large documents. When parsing documents, you may encounter encoding issues. Here are some ways to handle encoding: Pass the This handles any decoding needed when initially parsing the document. You can encode the contents of a tag: Use this when outputting tag strings. To encode the entire BeautifulSoup document: This returns a byte string with the encoded document. Specify encoding when pretty printing output: BeautifulSoup's UnicodeDammit class can detect and convert incoming documents to Unicode: This converts even poorly encoded documents to Unicode. Properly handling encoding ensures your scraped data is decoded and output correctly when using BeautifulSoup.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key: tags from the document.
By Tag Name
only_divs = SoupStrainer("div")
soup = BeautifulSoup(doc, parse_only=only_divs)
By Function
def is_short_string(string):
return len(string) < 20
only_short_strings = SoupStrainer(string=is_short_string)
soup = BeautifulSoup(doc, parse_only=only_short_strings)
By Attributes
has_data_attr = SoupStrainer(attrs={"data-category": True})
soup = BeautifulSoup(doc, parse_only=has_data_attr)
Multiple Conditions
strainer = SoupStrainer("div", id="main")
soup = BeautifulSoup(doc, parse_only=strainer)
Dealing with Encoding
Specify at Parse Time
soup = BeautifulSoup(doc, from_encoding='utf-8')
Encode Tag Contents
tag.string.encode("utf-8")
Encode Entire Document
soup.encode("utf-8")
Pretty Print with Encoding
print(soup.prettify(encoder="utf-8"))
Unicode Dammit
from bs4 import UnicodeDammit
dammit = UnicodeDammit(doc)
soup = dammit.unicode_markup
Browse by tags:
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!