Web scraping is a useful technique for programmatically extracting data from websites. Often you need to scrape multiple pages from a site to gather complete information. In this article, we will see how to scrape multiple pages using Python and the BeautifulSoup library.
Prerequisites
To follow along, you'll need:
pip install requests beautifulsoup4
Import Modules
We'll need the
import requests
from bs4 import BeautifulSoup
Define Base URL
We'll be scraping a blog -
—
<https://copyblogger.com/blog/>
<https://copyblogger.com/blog/page/2/>
<https://copyblogger.com/blog/page/3/>
Let's define a base URL pattern:
base_url = '<https://copyblogger.com/blog/page/{}/>'
The
Specify Number of Pages
Next, we'll specify how many pages we want to scrape. Let's scrape the first 5 pages:
num_pages_to_scrape = 5
Loop Through Pages
We can now loop from 1 to
for page_num in range(1, num_pages_to_scrape + 1):
# Construct page URL
url = base_url.format(page_num)
# Code to scrape each page here
Send Request and Check Response
Inside the loop, we'll use
We'll check that the response status code is 200 to ensure the request succeeded:
response = requests.get(url)
if response.status_code == 200:
# Scrape page
else:
print(f"Failed to retrieve page {page_num}")
Parse HTML Using BeautifulSoup
If the request succeeds, we can parse the HTML using
soup = BeautifulSoup(response.text, 'html.parser')
This creates a
Extract Data
—
Now within the loop we can use
For example, to get all the article elements:
articles = soup.find_all('article')
We can loop through the articles and extract information like title, URL, author etc.
Full Code
Our full code to scrape 5 pages looks like:
import requests
from bs4 import BeautifulSoup
base_url = '<https://copyblogger.com/blog/page/{}/>'
num_pages_to_scrape = 5
for page_num in range(1, num_pages_to_scrape + 1):
url = base_url.format(page_num)
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
for article in articles:
# Extract data from article
print(title)
print(author)
else:
print(f"Failed to retrieve page {page_num}")
This allows us to scrape and extract data from multiple pages sequentially. The full code can be extended to scrape any number of pages.
Summary
Web scraping enables collecting large datasets that can be analyzed programmatically. With the techniques covered here, you can scrape and extract information from multiple pages of a website in Python.
# Updated full code
import requests
from bs4 import BeautifulSoup
base_url = 'https://copyblogger.com/blog/page/{}/'
num_pages_to_scrape = 5
for page_num in range(1, num_pages_to_scrape + 1):
url = base_url.format(page_num)
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article')
for article in articles:
# Extract the article title
title = article.find('h2', class_='entry-title').text.strip()
# Extract the article URL
article_url = article.find('a', class_='entry-title-link')['href']
# Extract the author's name
author_name = article.find('div', class_='post-author').find('a').text.strip()
# Find the categories container div
categories_container = article.find('div', class_='entry-categories')
# Extract the categories
if categories_container:
categories = [cat.text.strip() for cat in categories_container.find_all('a')]
else:
categories = []
# Print extracted information
print("Title:", title)
print("URL:", article_url)
print("Author:", author_name)
print("Categories:", categories)
print("\n")
else:
print(f"Failed to retrieve page {page_num}")
While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.
Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.
This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.
With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.