When using BeautifulSoup for web scraping in Python, you'll need to load the target HTML document into a BeautifulSoup object to start parsing and extracting data. Here's how to properly read an HTML file from disk using BeautifulSoup.
Opening the File
First, open the HTML file in read-binary mode:
with open("page.html", "rb") as file:
html_doc = file.read()
The "rb" mode will read the HTML as raw bytes, which BeautifulSoup needs.
Creating the BeautifulSoup Object
Pass the raw HTML bytes into the BeautifulSoup constructor:
soup = BeautifulSoup(html_doc, "html.parser")
This creates a BeautifulSoup object containing the document structure.
Choosing a Parser
By default BeautifulSoup uses Python's built-in html.parser. But you can choose others like:
For example:
soup = BeautifulSoup(html_doc, "lxml")
Direct String Input
For short samples, you can also pass a raw HTML string directly:
html_str = "<h1>Hello World</h1>"
soup = BeautifulSoup(html_str, "html.parser")
Great for testing code snippets.
Limitations
One limitation is that Beautiful Soup won't execute any JavaScript on the page. A module like Selenium may be needed for dynamic pages.
Overall, BeautifulSoup makes it very straightforward to load up an HTML document ready for parsing and extraction. With the file loaded into a