Scrapy is one of the easiest tools that you can use to scrape and also spider a website with effortless ease.
Today lets see how we can solve one of the most common design patterns while scraping any large scale projects like scraping article list or blog posts. Typically, the number of items shown on a single page is limited to 10 or 20 and you will want to pull out all the pages as automatically as possible.
We will take the example of the CopyBlogger blog and see if we can run through all the pages without much sweat. We will do this and do it using a powerful tool like Scrapy because once we have this basic infrastructure, we will be able to build almost anything on top of it.
Here is how the CopyBlogger blog section looks:
You can see that there are about 10 posts on each page and then there are about 329 pages in total.
First, we need to install scrapy if you haven't already
pip install scrapy
Once installed, we will add a simple file with some barebones code like so...
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from bs4 import BeautifulSoup
import urllib
class SimpleNextPage(CrawlSpider):
name = 'SimpleNextPage'
allowed_domains = ['copyblogger.com']
start_urls = [
'https://copyblogger.com/blog/',
]
custom_settings = {
'LOG_LEVEL': 'INFO',
}
def parse(self, response):
Let's examine this code before we proceed...
The allowed_domains array restricts all further crawling to the domain paths specified here.
start_urls is the list of URLs to crawl... for us, in this example, we only need one URL.
The LOG_LEVEL settings make the scrapy output less verbose so it is not confusing.
The def parse(self, response): function is called by scrapy after every successful URL crawl. Here is where we can write our code to extract the data we want.
Now let's see what we can write in the parse function...
For this let's find the CSS patterns that we can use as selectors for finding the next page link on any page.
We will not use the page links titled 1,2,3 for this. It makes more sense to find the link inside the 'Next Page' button. It should then ALWAYS lead us to the next page reliably.
When we inspect this in the Google Chrome inspect tool (right-click on the page in Chrome and click Inspect to bring it up), we can see that the link is inside an LI element with the CSS class pagination-next. This is good enough for us. We can just select this using the CSS selector function like this...
nextpage = response.css('.pagination-next').extract()
This will give us the text 'Next Page' though. What we need is the href in the 'a' tag inside the LI tag. So we modify it to this...
nextpage = response.css('.pagination-next a::attr(href)').extract()
In fact, the moment we have the URL, we can ask Scrapy to fetch the URL contents like this
yield scrapy.Request(nextpage[0], callback=self.parse_next_page)
So the whole code looks like this:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from bs4 import BeautifulSoup
import urllib
class SimpleNextPage(CrawlSpider):
name = 'SimpleNextPage'
allowed_domains = ['copyblogger.com']
start_urls = [
'https://copyblogger.com/blog/',
]
custom_settings = {
'LOG_LEVEL': 'INFO',
}
def parse(self, response):
print('Current page ' response.url)
nextpage = response.css('.pagination-next a::attr(href)').extract()
nextpagetext = response.css('.pagination-next').extract()
yield scrapy.Request(nextpage[0], callback=self.parse_next_page)
return
def parse_next_page(self, response):
print('Fetched next page' response.url)
return
Let's save it as SimpleNextPage.py and then run it with these parameters which tells scrapy to disobey Robots.txt and also to simulate a web browser...
scrapy runspider SimpleNextPage.py -s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36" -s ROBOTSTXT_OBEY=False
When you run it should return...
We dont have to stop there. Let's make these function recursive. For that, we can do away with the parse_next_page function altogether and ask the Parse function to parse all the next page links.
Here is the final code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from bs4 import BeautifulSoup
import urllib
class SimpleNextPage(CrawlSpider):
name = 'SimpleNextPage'
allowed_domains = ['copyblogger.com']
start_urls = [
'https://copyblogger.com/blog/',
]
custom_settings = {
'LOG_LEVEL': 'INFO',
}
def parse(self, response):
print('Current page ' response.url)
nextpage = response.css('.pagination-next a::attr(href)').extract()
nextpagetext = response.css('.pagination-next').extract()
yield scrapy.Request(nextpage[0], callback=self.parse)
return
def parse_next_page(self, response):
print('Fetched next page' response.url)
return
And the results:
It will fetch all the pages which you can parse, scrape or whatever other function you may want to perform on them.
Scaling Scrapy
The example above is ok for small scale web crawling projects. But if you try to scrape large quantities of data at high speeds you will find that sooner or later your access will be restricted. Web servers can tell you are a bot so one of the things you can do is run the crawler impersonating a web browser. This is done by passing the user agent string to the Wikipedia web server so it doesn't block you.
Like this:
-s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64)/
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36" /
-s ROBOTSTXT_OBEY=False
In more advanced implementations you will need to even rotate this string so Wikipedia cant tell its the same browser! Welcome to web scraping.
If we get a little bit more advanced, you will realize that Wikipedia can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
- With millions of high speed rotating proxies located all over the world,
- With our automatic IP rotation
- With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
- With our automatic CAPTCHA solving technology,
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.