Email addresses are often hidden on websites to avoid spam bots. But sometimes you need to contact someone and can't easily find their email. This is where Python web scraping can help uncover those hidden emails.
We'll use the
Inspecting the Page
First, we'll use the browser's inspector to examine the page and find potential emails. Often they're obfuscated in the HTML or JavaScript.
import requests
from bs4 import BeautifulSoup
import re
url = "http://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
Now search through the HTML to find email-like patterns. Emails contain @ symbols and domain names.
Writing the Regex
We can write a regex to match common email formats. This handles the username, @ symbol, domain extensions like .com, and other patterns.
email_regex = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
emails = re.findall(email_regex, response.text)
This scans the entire page text and extracts anything matching the email pattern.
Handling Javascript
For JavaScript heavy sites,
Web scraping takes trial and error. Inspecting the pages, writing regular expressions, and handling JavaScript can uncover those hidden email contacts. With some Python and perseverance, you can find what you need.