Web Scraping in Python - The Complete Guide

In this tutorial you'll build robust web crawlers using libraries like BeautifulSoup, learn techniques to overcome real-world scraping challenges and best practices for large scale scraping.

You'll gain the skills to scrape complex sites, handle issues like rate limits, blocks, and javascript pages.

Why Python for Web Scraping?

Python is a popular language for web scraping due to its advantages:

Simple Syntax: Python's intuitive syntax allows quick coding for scraping.

Built-in Libraries: Python comes with built-in libraries and modules, like urllib and lxml, that aid in scraping.

Mature Scraping Libraries: Libraries like Beautiful Soup and Scrapy simplify scraping at any scale.

General Purpose: Python can be used to build complete data pipelines around scraping.

Interoperability: Python integrates well with other languages and performs well when performance is crucial.

In contrast, languages like C++ require more effort for basic scraping tasks. JavaScript platforms like Node.js can be complex for beginners when building scraping scripts.

Python's simplicity, power, and interoperability makes it ideal for scraping needs. Its high-quality libraries allow quick start to scraping at scale.

Best Python Web Scraping Libraries

Some of the most popular and robust Python libraries for web scraping are:

BeautifulSoup

Features: Excellent HTML/XML parser, easy web scraping interface, flexible navigation and search. We will be using this library in our example scraper below.

Use Case: Small to medium scale web scraping.

Link to BeautifulSoup docs

Scrapy

Features: Fast and scalable, middlewares, distributed crawling capability.

Use Case: Large scale advanced web scraping projects.

Link to Scrapy docs

Selenium

Features: Full browser automation, handles javascript heavy sites.

Use Case: Sites with highly dynamic content loaded by JS.

Link to Selenium docs

lxml

Features: Very fast XML and HTML parser.

Use Case: Lightning fast parsing of XML/HTML data.

Link to lxml docs

pyquery

Features: jQuery-style syntax for accessing HTML elements.

Use Case: Makes scrape code look cleaner and more readable.

Link to pyquery docs

Prerequisites

To follow along with the code examples in this article, you will need:

Virtual Environment (Recommended)

While optional, we highly recommended creating a virtual env for the project:

python -m venv my_web_scraping_env

The Libraries

We will be using the Requests, BeautifulSoup and OS libraries primarily:

pip install requests beautifulsoup4

This will fetch the libraries from PyPI and install them locally.

With the prerequisites installed, you are all setup! Let's start scraping.

Lets pick a target website

For demonstration purposes, we will be scraping the Wikipedia page List of dog breeds to extract information about various dog breeds.

The rationale behind choosing this page is:

Well structured HTML layout that makes scraping easy

Nice table layout with one breed per row

Contains mulitple data fields per breed including names, breed group, alternate names and images

Images can allow us to showcase scraping binary files as well

This is the page we are talking about…

Other great pages to practice web scraping include:

Wikipedia category pages like Lists of films

Ecommerce product listings like Amazon books

Real estate listings like Zillow rentals

The concepts covered will be applicable across any site.

Write the scraping code

Let's now closely examine the full code to understand how to systematically scrape data from the dogs breed webpage.


# Full code

import os
import requests
from bs4 import BeautifulSoup

url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'

# Headers to masquerade as a browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

# Download page HTML using requests
response = requests.get(url, headers=headers)

# Check valid response received
if response.status_code == 200:

    # Parse HTML using Beautiful Soup
    soup = BeautifulSoup(response.text, 'html.parser')

    # CSS selector for the main tables
    table = soup.find('table', {'class': 'wikitable sortable'})

    # Initialize data lists to store scraped info
    names = []
    groups = []
    local_names = []
    photographs = []

    # Create directory to store images
    os.makedirs('dog_images', exist_ok=True)

    # Loop through rows omitting header
    for row in table.find_all('tr')[1:]:

        # Extract each column data using CSS selectors
        columns = row.find_all(['td', 'th'])

        name = columns[0].find('a').text.strip()
        group = columns[1].text.strip()

        # Extract local name if exists
        span_tag = columns[2].find('span')
        local_name = span_tag.text.strip() if span_tag else ''

        # Extract photo url if exists
        img_tag = columns[3].find('img')
        photograph = img_tag['src'] if img_tag else ''

        # Download + Save image if url exists
        if photograph:

            response = requests.get(photograph)

            if response.status_code == 200:

                image_filename = os.path.join('dog_images', f'{name}.jpg')

                with open(image_filename, 'wb') as img_file:

                    img_file.write(response.content)

        names.append(name)
        groups.append(group)
        local_names.append(local_name)
        photographs.append(photograph)

print(names)
print(groups)
print(local_names)
print(photographs)

The imports include standard Python libraries that provide HTTP requests functionality (requests), parsing capability (BeautifulSoup), and file system access (os) which we will leverage.

The requests library allows us to make HTTP requests to the web page and check if the response is valid before parsing. BeautifulSoup then enables us to parse the full HTML content and isolate the main data table using CSS selectors. Finally, os provides file system access to save images locally.

Together they form a very handy toolkit for scraping!

Downloading the page

We first construct the target URL and initialize a requests Session which allows connection reuse and efficiencies when making multiple HTTP requests to the same domain:

url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

response = requests.get(url, headers=headers)

We also setup a custom User-Agent HTTP header to masquerade as a Chrome browser. This helps avoid blocks from servers trying to prevent scraping.

After getting the response, we can check the status code to ensure we received a proper HTML document:

if response.status_code == 200:
   # Success!
   print(response.text)

In case of errors (e.g. 404 or 500), we do not proceed with scraping and handle the failure.

Parsing the html

Since we received a valid HTML response, we can parse the text content using Beautiful Soup:

soup = BeautifulSoup(response.text, 'html.parser')

BeautifulSoup accepts the raw HTML text and an optional parser like lxml or the built-in html.parser, and provides simple methods and Pythonic idioms for navigating, searching, and modifying the parse tree.

Beautiful Soup transforms the messy HTML into a parse tree that mirrors the DOM structure of tags, attributes and text. We can use CSS selectors and traversal methods to quickly isolate the data we need from this tree.

The Magic of Selectors for Data Extraction

One of the most magical parts of web scraping with Python's BeautifulSoup library is using CSS selectors to extract specific content from HTML pages.

Selectors allow us to visually target the tags enclosing the data we want scraped. BeautifulSoup makes selecting elements a breeze.

For example, consider extracting book titles from this snippet:

<div class="book-listing">
  <img src="/covers/harry-potter.jpg">
  <span class="title">Harry Potter and the Goblet of Fire</span>
  <span class="rating">9.1</span>
</div>

<div class="book-listing">
  <img src="/covers/lord-of-the-rings.jpg">
  <span class="title">The Fellowship of the Ring</span>
  <span class="rating">9.3</span>
</div>

We can directly target the span with class title through the CSS selector:

soup.select("div.book-listing > span.title")

This says - find all span tags having class title which are direct children of any div tag having book-listing as the CSS class.

And voila, we have exactly the titles isolated:

[<span class="title">Harry Potter and the Goblet of Fire</span>,
 <span class="title">The Fellowship of the Ring</span>]

We can chain .text to extract just the readable text within the tags:

[Harry Potter and the Goblet of Fire, The Fellowship of the Ring]

Selectors provide incredible precision during data extraction by leveraging the innate hierarchy of structured HTML tags surrounding it.

Some other examples of selectors:

# Select id attribute
soup.select("#book-title")

# Attribute equality match
soup.select('a[href="/login"]')

# Partial attribute match
soup.select('span[class^="title"]')

# Select direct descendant
soup.select("ul > li")

As you can see, by mastering different selector types and combining multiple selectors where needed - you gain immense power to zone in on and extract the exact data you need from any HTML document, eliminating nearly all guesswork. Lets get back to the task at hand now…

Finding the table

Looking at the Raw HTML, we notice a table tag with CSS class wikitable sortable contains the main breed data.

We can simply select this using:

table = soup.find('table', {'class': 'wikitable sortable'})

This searches the parse tree for any table tag having a class attribute matching wikitable sortable. Beautiful soup makes Selection using CSS selectors super easy!

Extracting all the fields

With the table isolated, we loop through every tr row after the header row to extract the data from each breed:

for row in table.find_all('tr')[1:]:

    columns = row.find_all(['td', 'th'])

    name = columns[0].find('a').text.strip()
    group = columns[1].text.strip()

Here, .find_all() helps search all the row children tags for any td or th elements, which represent table cells. We select these into a list columns.

Using positional indexes in this columns list, we can extract the data within each cell cleanly:

    name = columns[0].find('a').text.strip()

This grabs the anchor a tag inside the first table cell, gets .text property to extract raw string content and chains .strip() to remove whitespace. Beautiful Soup chains such operations elegantly.

Similarly for cells containing just text:

    group = columns[1].text.strip()

We fetch .text property directly on table cell element.

The power of CSS selectors in quickly isolating specific tags, ids, classes or attributes makes data extraction very precise and straightforward in Beautiful Soup.

Downloading and saving the images

After scraping textual data like names, groups etc in each row, we check the last cell for an image link:

    img_tag = columns[3].find('img')
    photograph = img_tag['src'] if img_tag else ''

This tries detecting and fetching src attribute on any image tag if exists.

We can then download and save images using this url if present:

    if photograph:

        response = requests.get(photograph)

        image_filename = os.path.join('dog_images', f'{name}.jpg')

        with open(file_path, 'wb') as img_file:
           img_file.write(response.content)

We reuse the requests library to make another GET request - this time to download the image binary content and save it locally using built-in file handling capability. Pretty nifty!

And that's it! By using requests and BeautifulSoup together with Python's intuitive standard library, we were able to build a complete web scraper to extract complex data!

Alternative libraries and tools for web scraping

While requests and BeautifulSoup form the most popular combination, here are some alternatives worth considering:

Scrapy

An open source modular scraping framework meant for large scale crawling that handles throttling, cookies, proxy rotation automatically. Recommended for complex needs.

Selenium

Performs actual browser automation by controlling Chrome, Firefox etc. Enables scraping dynamic content that renders via JavaScript. More complex setup.

pyppeteer

Headless browser automation like Selenium driven through Python code. Good for javascript rendered websites.

pyquery

Offers jQuery style element selection. Scrape code looks very clean due to chaining syntax similar to jQuery.

lxml

A very fast XML/HTML parser. Great when raw parsing performance is critical.

Challenges of Web Scraping in the real world: Some tips & best practices

While basic web scraping is easy, building robust production-grade scalable crawlers brings its own challenges:

Handling Dynamic Content

Many websites rely heavily on JavaScript to render content dynamically. Static scraping then fails. Solutions: Use browser automation tools like Selenium or scraper specific solutions like Scrapy's splash integration.

Here is a simple Hello World example to handle dynamic content using Selenium browser automation:

from selenium import webdriver
from selenium.webdriver.common.by import By

# Initialize chrome webdriver
driver = webdriver.Chrome()

# Load page
driver.get("<https://example.com>")

# Wait for title to load from dynamic JS execution
driver.implicitly_wait(10)

# Selenium can extract dynamically loaded elements
print(driver.title)

# Selenium allows clicking buttons triggering JS events
driver.find_element(By.ID, "dynamicBtn").click()

# Inputs can be handled as well
search = driver.find_element(By.NAME, 'search')
search.send_keys('Automate using Selenium')
search.submit()

# Teardown browser after done
driver.quit()

The key capabilities offered by Selenium here are:

Launches a real Chrome browser to load JavaScript
Finds elements only available after execution of JS
Can interact with page by clicking, entering text etc thereby triggering JavaScript events
Experience mimics an actual user browsing dynamically generated content

Together this allows handling complex sites primarily driven by JavaScript for dynamic content. Selenium provides full programmatic control to automate browsers directly thereby scraping correctly.

Getting Blocked

Websites often block scrapers via blocked IP ranges or blocking characteristic bot activity through heuristics. Solutions: Slow down requests, properly mimic browsers, rotate user agents and proxies.

Rate Limiting

Servers fight overload by restricting number of requests served per time. Hitting these limits lead to temporary bans or denied requests. Solutions: Honor crawl delays, use proxies and ration requests appropriately.

Here is sample code to handle rate limiting while scraping:

Many websites have protection mechanisms that temporarily block scrapers when they detect too many frequent requests coming from a single IP address.

We can counter getting blocked by rate limits by adding throttling, proxies and random delays in our code.

import time
import random
import requests
from urllib.request import ProxyHandler, build_opener

# List of free public proxies
PROXIES = ["104.236.141.243:8080", "104.131.178.157:8085"]

# Pause 5-15 seconds between requests randomly
def get_request():
    time.sleep(random.randint(5, 15))
    proxy = random.choice(PROXIES)
    opener = build_opener(ProxyHandler({'https': proxy}))

    resp = opener.open("<https://example.com>")
    return resp

for i in range(50):
   response = get_request()
   print("Request Success")

Here each request first waits for a random interval before executing. This prevents continuous rapid requests.

We also route every alternate request through randomly chosen proxy servers via rotated IP addresses.

Together, throttling down overall crawl pace and distributing requests over different proxy IPs prevents hitting site-imposed rate limits.

Additional improvements like automatically detecting rate limit warnings in responses and reacting accordingly can enhance the scraper's resilience further.

Rotating User Agents

Websites often try to detect and block scraping bots by tracking characteristic user agent strings.

To prevent blocks, it is good practice to rotate multiple well-disguised user agents randomly to mimic a real browser flow.

Here is sample code to pick a random desktop user agent from a predefined list using Python's random library before making each request:

import requests
import random

# List of desktop user agents
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 OPR/43.0.2442.991"
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/604.4.7 (KHTML, like Gecko) Version/11.0.2 Safari/604.4.7"
]

# Pick a random user agent string
user_agent = random.choice(user_agents)

# Set request headers with user agent before making request
headers = {"User-Agent": user_agent}

response = requests.get(url, headers=headers)

By varying the user agent across requests in code runs, websites have a tougher time profiling traffic as coming from an automated bot using a static user agent. This allows the scraper to fly under the radar without getting blocked.

Some additional enhancements include:

Having separate user agent lists for mobile, tablets, desktop browsers

Updating the lists with latest user agents periodically

Dynamically generating user agents to match genuine browser attributes

With effective user agent rotation and an ever expanding list of strings, scrapers enjoy better longevity undetected before site administrators can profile and actively block them.

Browser Fingerprinting

Beyond simplistic user agent checks, websites have adopted advanced browser fingerprinting techniques to identify bots.

This involves browser attribute profiling - collecting information regarding device screen size, installed fonts, browser plugins etc. together called browser fingerprints.

These fingerprints tend to remain largely consistent, stable and unique for standard tool-based bots and automation software.

Dynamic websites track fingerprints of scrapers accessing them. By detecting known crawler fingerprints they can block them even if the user agents are rotated constantly.

Minimizing detection risks

Some ways to minimize exposing scraper fingerprints:

Use Selenium to automate a standard desktop browser like Chrome or Firefox instead of custom bot agents

Dynamically generate randomized attributes like viewport size, screen resolution, font lists within ranges of variety exhibited by human browsers

Utilize proxy rotation and residential IP proxies to prevent tracking of IP specific attributes

Limit number of parallel requests from a single proxy to site to make traffic volume seem manual

Essentially by mimicking the natural randomness and variability across genuine user browsers, scraper fingerprints can avoid easy profiling by sites simply as another standard browser.

Here is a code example to dynamically modify browser attributes to avoid fingerprinting:

from selenium import webdriver
import random

# List of common screen resolutions
screen_res = [(1366, 768), (1920, 1080), (1024, 768)]

# List of common font families
font_families = ["Arial", "Times New Roman", "Verdana"]

#Pick random resolution
width, height = random.choice(screen_res)

#Create chrome options
opts = webdriver.ChromeOptions()

# Set random screen res
opts.add_argument(f"--window-size={width},{height}")

# Set random user agent
opts.add_argument("--user-agent=Mozilla/5.0...")

# Set random font list
random_fonts = random.choices(font_families, k=2)
opts.add_argument(f'--font-list="{random_fonts[0]};{random_fonts[1]}"')

# Initialize driver with options
driver = webdriver.Chrome(options=opts)

# Access webpage
driver.get(target_url)

# Webpage sees every scraper request originating
# from distinct unpredictable browser profiles

Here we randomly configure our Selenium controlled Chrome instance with different screen sizes, user agents and font sets per request.

and here is how you do it using Python Requests…

import requests
import random

# Device profiles
desktop_config = {
    'user-agent': 'Mozilla/5.0...',
    'accept-language': ['en-US,en', 'en-GB,en'],
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'upgrade-insecure-requests': '1',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'cache-control': 'max-age=0'
}

mobile_config = {
    'user-agent': 'Mozilla/5.0... Mobile',
    'accept-language': ['en-US,en', 'en-GB,en'],
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'x-requested-with': 'mark.via.gp',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'referer': '<https://www.example.com/>',
    'accept-encoding': 'gzip, deflate, br',
    'cache-control': 'max-age=0'
}

device_profiles = [desktop_config, mobile_config]

def build_headers():

    profile = random.choice(device_profiles)

    headers = {
         'User-Agent': random.choice(profile['user-agent']),
         'Accept-Language': random.choice(profile['accept-language']),
         # Other headers
         ...
    }

    return headers

Now instead of hard coding, the scraper randomly selects from plausible configuration profiles including several identifying request headers - providing realistic and human-like mutations necessary to avoid fingerprint tracking.

Parsing Complex HTML

Scrape targets often have complex HTML structures, obfuscated tags and advanced client side code packing logic which break parsers. Solutions: Careful inspection of rendered source, using robust parsers like lxml and enhancing selectors.

Here are some common types of bad HTML scrape targets exhibit and techniques to handle them:

Improper Nesting

HTML can often have incorrectly nested tags:

<b>Latest News <p>Impact of oil prices fall...</b></p>

Solution: Use a parser like lxml that handles bad nesting and uneven tags more robustly.

Broken Markup

Tags could be unclosed:

<div>
  <span class="title">Python Web Scraping <span>
  Lorem ipsum...
</div>

Solution: Specify tag close explicitly while parsing:

title = soup.find("span", class_="title").text

Non-standard Elements

Vendor specific unrecognized custom tags may exist:

<album>
  <cisco:song>Believer</cisco:song>
</album>

Solution: Search for standard tags in namespace:

song = soup.find("cisco:song").text

Non-text Content

Tables, images embedded between text tags:

<p>
  Trending Now
  <table>...</table>
</p>

Solution: Select child tags specifically:

paras = soup.select("p > text()")

This picks only text nodes as children ignoring other elements present under

tag.

As you can see, liberal use of selectors along with robust parsers provides the tools to handle even badly designed HTML and extract the required data reliably.

Other guidelines worth following:

Respect robots.txt rules

Check if API access is available before scraping sites without permission

Scrape data responsibly in moderate volumes

Adopting these practices ensures reliable, resilient and responsible scraping operations.

Conclusion In this comprehensive guide, we took an in-depth look into web scraping using Python. We covered:

Why Python and libraries like BeautifulSoup are ideal for scraping most targets

Common scraping patterns like making requests, parsing responses, handling dynamic content using Selenium

Best practices around mimicry, circumventing blocks, respecting crawl delays and auto-throttling

How to build resilient, production-grade scalable scrapers

By learning core scraping paradigms, structuring code properly and applying optimization techniques, extracting accurate web data in Python at scale has become an achievable skill!

While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

Web Scraping in Python - The Complete Guide

Why Python for Web Scraping?

Best Python Web Scraping Libraries

Prerequisites

Lets pick a target website

Write the scraping code

Downloading the page

Parsing the html

The Magic of Selectors for Data Extraction

Finding the table

Extracting all the fields

Downloading and saving the images

Alternative libraries and tools for web scraping

Challenges of Web Scraping in the real world: Some tips & best practices

Handling Dynamic Content

Getting Blocked

Rotating User Agents

Browser Fingerprinting

Parsing Complex HTML

Other guidelines worth following:

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Web Scraping in Python - The Complete Guide

Why Python for Web Scraping?

Best Python Web Scraping Libraries

Prerequisites

Lets pick a target website

Write the scraping code

Downloading the page

Parsing the html

The Magic of Selectors for Data Extraction

Finding the table

Extracting all the fields

Downloading and saving the images

Alternative libraries and tools for web scraping

Challenges of Web Scraping in the real world: Some tips & best practices

Handling Dynamic Content

Getting Blocked

Rotating User Agents

Browser Fingerprinting

Parsing Complex HTML

Other guidelines worth following:

The easiest way to do Web Scraping

Don't leave just yet!