The get_text() method in the Python BeautifulSoup library is very useful for extracting text from HTML and XML documents. However, there are some nuances to how it works that are good to understand when using it for web scraping or text extraction.
What get_text() Does
The
from bs4 import BeautifulSoup
html = '<p>This is a <b>paragraph</b> with <a href="#">a link</a>.</p>'
soup = BeautifulSoup(html, 'html.parser')
print(soup.get_text())
# Outputs: This is a paragraph with a link.
So it extracts just the raw text content.
Stripping Whitespace
You can use the
Handling Nested Tags
html = '<div><p>Paragraph 1</p><p>Paragraph 2</p></div>'
soup = BeautifulSoup(html, 'html.parser')
print(soup.get_text())
# Outputs:
# Paragraph 1
# Paragraph 2
The text of both Some text like scripts and styles is ignored by default. You can use Calling While
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html>Invisible Text
Multiple vs First Text Nodes
Conclusion
Browse by tags:
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...