While BeautifulSoup is mainly designed for parsing HTML, it can also handle XML documents quite well with just a little configuration. Here's how to leverage BeautifulSoup for scraping and analyzing XML files or responses.
Loading the XML
Loading an XML document into a BeautifulSoup object is the same process as with HTML:
from bs4 import BeautifulSoup
with open("file.xml") as f:
data = f.read()
soup = BeautifulSoup(data, "xml")
Notice here we explicitly tell it to use the "xml" parser.
Navigating the Tree
You can navigate and search the parsed XML tree using the same methods as HTML:
titles = soup.find_all("title")
first_title = titles[0]
print(first_title.text)
The tag and attribute names will match those defined in the XML.
Finding by Attributes
Searching by attributes works the same:
songs = soup.find_all("song", {"length": "short"})
This finds all
Modifying the Tree
You can also modify and add to the XML tree:
new_tag = soup.new_tag("priority")
new_tag.string = "urgent"
first_title.append(new_tag)
This adds a new
Outputting XML
To output the modified XML document, use
print(soup.prettify())
This will print out the new XML with indentation.
You can also convert a BeautifulSoup XML object back into a string, perform additional processing, and write it back out to a file.
Here is an example demonstrating parsing an XML file with BeautifulSoup and extracting some data:
from bs4 import BeautifulSoup
xml = """
<catalog>
<book id="1">
<author>Mark Twain</author>
<title>The Adventures of Huckleberry Finn</title>
<genre>Novel</genre>
<price>7.99</price>
</book>
<book id="2">
<author>J.K. Rowling</author>
<title>Harry Potter and the Philosopher's Stone</title>
<genre>Fantasy</genre>
<price>6.99</price>
</book>
</catalog>
"""
# Load XML and parse
soup = BeautifulSoup(xml, "xml")
# Find all book tags
books = soup.find_all('book')
# Print out author and title for each book
for book in books:
author = book.find("author").text
title = book.find("title").text
print(f"{title} by {author}")
This would print:
The Adventures of Huckleberry Finn by Mark Twain
Harry Potter and the Philosopher's Stone by J.K. Rowling
We locate the
Here is an example of parsing the XML and displaying the extracted book data in a table using BeautifulSoup and Pandas:
from bs4 import BeautifulSoup
import pandas as pd
xml = """
<catalog>
<book id="1">
<author>Mark Twain</author>
<title>The Adventures of Huckleberry Finn</title>
<genre>Novel</genre>
<price>7.99</price>
</book>
<book id="2">
<author>J.K. Rowling</author>
<title>Harry Potter and the Philosopher's Stone</title>
<genre>Fantasy</genre>
<price>6.99</price>
</book>
</catalog>
"""
soup = BeautifulSoup(xml, 'xml')
books = []
for book in soup.find_all('book'):
book_data = {
"id": book['id'],
"author": book.find('author').text,
"title": book.find('title').text,
"genre": book.find('genre').text,
"price": float(book.find('price').text)
}
books.append(book_data)
df = pd.DataFrame(books)
print(df)
We extract the book attributes into a dictionary per book, store in a list, then convert to a Pandas DataFrame for a nice tabular display.
This provides a simple way to parse XML and view the extracted data in table format using Python. The DataFrame could also easily be output to CSV or other formats.
Here is an example of using BeautifulSoup to parse an RSS feed and save the extracted data to a CSV file:
import requests
from bs4 import BeautifulSoup
import csv
feed_url = "<https://www.example.com/feed.rss>"
response = requests.get(feed_url)
soup = BeautifulSoup(response.content, "xml")
items = soup.find_all("item")
csv_file = open('feed.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Title', 'Link','Published'])
for item in items:
title = item.find("title").text
link = item.find("link").text
pub_date = item.find("pubDate").text
csv_writer.writerow([title, link, pub_date])
csv_file.close()
This loads and parses the RSS feed, then extracts the title, link, and publish date for each
We write this data out row by row into a CSV file using the csv module.
The end result is a feed.csv file containing nicely extracted data from the RSS feed in tabular format.
This demonstrates how BeautifulSoup can easily parse and extract data from XML formats like RSS into structured datasets readable by other programs.