When scraping web pages, you'll often want to extract just the text content without all the surrounding HTML tags. Here's how to use BeautifulSoup to cleanly strip out tags and isolate the text.
The get_text() Method
The simplest way is using the
from bs4 import BeautifulSoup
html = "<p>Example text <b>with</b> <i>some</i> tags</p>"
soup = BeautifulSoup(html, 'html.parser')
print(soup.get_text())
# "Example text with some tags"
This strips out all tags and returns just the text.
Stripping Tags from Strings
You can also call
text = soup.p.string
print(text.get_text())
Use this when dealing with a single text element.
Removing Whitespace
To also strip excess whitespace and newline characters:
print(soup.get_text(strip=True))
# "Example text with some tags"
The
Extracting HTML Attributes
To extract specific HTML attributes from tags:
for link in soup.find_all('a'):
print(link.get('href')) # Prints attribute value
This loops through
In summary,