May 4th, 2020
Scraping an Entire Blog with Scrapy

Scrapy is one of the easiest tools that you can use to scrape and also spider a website with effortless ease.

One of the most common applications of web scraping according to the patterns we see with many of our customers at Proxies API is scraping blog posts. Today lets look at how we can build a simple scraper to pull out and save blog posts from a blog like CopyBlogger.

Here is how the CopyBlogger blog section looks.

You can see that there are about 10 posts on this page. We will try and scrape them all.

First, we need to install scrapy if you haven't already.

pip install scrapy

Once installed, we will add a simple file with some barebones code like so.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from bs4 import BeautifulSoup
import urllib


class SimpleNextPage(CrawlSpider):
    name = 'SimpleNextPage'
    allowed_domains = ['copyblogger.com']
    start_urls = [
        'https://copyblogger.com/blog/',
    ]

		custom_settings = {

    'LOG_LEVEL': 'INFO',
 
    }

def parse(self, response):

Let's examine this code before we proceed...

The allowed_domains array restricts all further crawling to the domain paths specified here.

start_urls is the list of URLs to crawl... for us, in this example, we only need one URL.

The LOG_LEVEL settings make the scrapy output less verbose so it is not confusing.

The def parse(self, response): function is called by scrapy after every successful URL crawl. Here is where we can write our code to extract the data we want.

Now let's see what we can write in the parse function...

For this let's find the CSS patterns that we can use as selectors for finding the blog posts on this page.

When we inspect this in the Google Chrome inspect tool (right-click on the page in Chrome and click Inspect to bring it up), we can see that the article headlines are always inside an H2 tag with the CSS class entry-title. This is good enough for us. We can just select this using the CSS selector function like this.

titles = response.css('.entry-title').extract()

This will give us the Headline. We also need the href in the 'a' which has the class entry-title-link so we need to extract this as well

links = response.css('.entry-title  a::attr(href)').extract()

So lets put this all together.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule

from bs4 import BeautifulSoup
import urllib


class blogCrawlerSimple(CrawlSpider):
    name = 'blogCrawlerSimple'
    allowed_domains = ['copyblogger.com']
    start_urls = [
        'https://copyblogger.com/blog/',
    ]

    def parse(self, response):
        #yield response
        titles = response.css('.entry-title').extract()       
        links = response.css('.entry-title  a::attr(href)').extract()

        
        for item in zip(titles, links):
            all_items = {
                'title' : BeautifulSoup(item[0]).text,
                'link' : item[1],
            }
            #found the link now lets scrape it... 
            yield scrapy.Request(item[1], callback=self.parse_blog_post)


            yield all_items

    def parse_blog_post(self, response):
        print('Fetched blog post' response.url)

Let's save it as BlogCrawler.py and then run it with these parameters which tells scrapy to disobey Robots.txt and also to simulate a web browser.

scrapy runspider BlogCrawler.py -s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36" -s ROBOTSTXT_OBEY=False

When you run it should return.

Those are all the blog posts. Let's save them into files.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule

from bs4 import BeautifulSoup
import urllib


class blogCrawlerSimple(CrawlSpider):
    name = 'blogCrawlerSimple'
    allowed_domains = ['copyblogger.com']
    start_urls = [
        'https://copyblogger.com/blog/',
    ]

    def parse(self, response):
        #yield response
        titles = response.css('.entry-title').extract()       
        links = response.css('.entry-title  a::attr(href)').extract()


        #links = response.css('.css-8atqhb a::attr(href)').extract()
        
        for item in zip(titles, links):
            all_items = {
                'title' : BeautifulSoup(item[0]).text,
                'link' : item[1],
            }
            #found the link now lets scrape it... 
            yield scrapy.Request(item[1], callback=self.parse_blog_post)


            yield all_items

    def parse_blog_post(self, response):
        print('Fetched blog post' response.url)
        filename = 'storage/' response.url.split("/")[3]   '.html'
        print('Saved post as :' filename)
        with open(filename, 'wb') as f:
            f.write(response.body)
        return

When you run it now, it will save all the blog posts into a file folder.

But if you really look at it, there are more than 320 odd pages like this on CopyBlogger. We need a way to paginate through to them and fetch them all.

When we inspect this in the Google Chrome inspect tool, we can see that the link is inside an LI element with the CSS class pagination-next. This is good enough for us. We can just select this using the CSS selector function like this...

nextpage = response.css('.pagination-next').extract()

This will give us the text 'Next Page' though. What we need is the href in the 'a' tag inside the LI tag. So we modify it to this...

nextpage = response.css('.pagination-next a::attr(href)').extract()

In fact, the moment we have the URL, we can ask Scrapy to fetch the URL contents like this

yield scrapy.Request(nextpage[0], callback=self.parse_next_page)

So the whole code looks like this:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule

from bs4 import BeautifulSoup
import urllib


class blogCrawler(CrawlSpider):
    name = 'blogCrawler'
    allowed_domains = ['copyblogger.com']
    start_urls = [
        'https://copyblogger.com/blog/',
    ]




    def parse(self, response):
        #yield response
        titles = response.css('.entry-title').extract()       
        links = response.css('.entry-title  a::attr(href)').extract()

        nextpage = response.css('.pagination-next a::attr(href)').extract()

        yield scrapy.Request(nextpage[0], callback=self.parse)


        #links = response.css('.css-8atqhb a::attr(href)').extract()
        
        for item in zip(titles, links):
            all_items = {
                'title' : BeautifulSoup(item[0]).text,
                'link' : item[1],
            }
            #found the link now lets scrape it... 
            yield scrapy.Request(item[1], callback=self.parse_blog_post)


            yield all_items

    def parse_blog_post(self, response):
        print('Fetched blog post' response.url)
        filename = 'storage/' response.url.split("/")[3]   '.html'
        print('Saved post as :' filename)
        with open(filename, 'wb') as f:
            f.write(response.body)
        return 


    def parse_next_page(self, response):
        print('Fetched next page' response.url)

and when you run it, it should download all the blog posts that were ever written on CopyBlogger onto your system.

Scaling Scrapy

The example above is ok for small scale web crawling projects. But if you try to scrape large quantities of data at high speeds you will find that sooner or later your access will be restricted. Web servers can tell you are a bot so one of the things you can do is run the crawler impersonating a web browser. This is done by passing the user agent string to the Wikipedia web server so it doesn't block you.

Like this:

-s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64)/
 AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36" /
-s ROBOTSTXT_OBEY=False

In more advanced implementations you will need to even rotate this string so Wikipedia cant tell its the same browser! Welcome to web scraping.

If we get a little bit more advanced, you will realize that Wikipedia can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Share this article:

Get our articles in your inbox

Dont miss our best tips/tricks/tutorials about Web Scraping
Only great content, we don’t share your email with third parties.
Icon