Scrapy is one of the most accessible tools that you can use to scrape and also spider a website with effortless ease.
One of the most common applications of web scraping according to the patterns we see with many of our customers at Proxies API is scraping blog posts. Today lets look at how we can build a simple scraper to pull out and save blog posts from a blog like CopyBlogger.
Here is how the CopyBlogger blog section looks.
You can see that there are about ten posts on this page. We will try and scrape them all.
First, we need to install scrapy if you haven't already.
pip install scrapy
Once installed, we will add a simple file with some barebones code like so.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from bs4 import BeautifulSoup
import urllib
class SimpleNextPage(CrawlSpider):
name = 'SimpleNextPage'
allowed_domains = ['copyblogger.com']
start_urls = [
'https://copyblogger.com/blog/',
]
custom_settings = {
'LOG_LEVEL': 'INFO',
}
def parse(self, response):
Let's examine this code before we proceed.
The allowed_domains array restricts all further crawling to the domain paths specified here.
start_urls is the list of URLs to crawl. For us, in this example, we only need one URL.
The LOG_LEVEL settings make the scrapy output less verbose, so it is not confusing.
The def parse(self, response): function is called by scrapy after every successful URL crawl. Here is where we can write our code to extract the data we want.
Now let's see what we can write in the parse function.
For this, let's find the CSS patterns that we can use as selectors for finding the blog posts on this page.
When we inspect this in the Google Chrome inspect tool (right-click on the page in Chrome and click Inspect to bring it up), we can see that the article headlines are always inside an H2 tag with the CSS class entry-title. This is good enough for us. We can just select this using the CSS selector function like this.
titles = response.css('.entry-title').extract()
This will give us the Headline. We also need the href in the 'a' which has the class entry-title-link, so we need to extract this as well.
links = response.css('.entry-title a::attr(href)').extract()
So lets put this all together.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from bs4 import BeautifulSoup
import urllib
class blogCrawlerSimple(CrawlSpider):
name = 'blogCrawlerSimple'
allowed_domains = ['copyblogger.com']
start_urls = [
'https://copyblogger.com/blog/',
]
def parse(self, response):
#yield response
titles = response.css('.entry-title').extract()
links = response.css('.entry-title a::attr(href)').extract()
for item in zip(titles, links):
all_items = {
'title' : BeautifulSoup(item[0]).text,
'link' : item[1],
}
#found the link now lets scrape it...
yield scrapy.Request(item[1], callback=self.parse_blog_post)
yield all_items
def parse_blog_post(self, response):
print('Fetched blog post' response.url)
Let's save it as BlogCrawlerSimple.py and then run it with these parameters, which tells scrapy to disobey Robots.txt and also to simulate a web browser.
scrapy runspider BlogCrawlerSimple.py -s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36" -s ROBOTSTXT_OBEY=False
When you run, it should return.
Those are all the blog posts. Let's save them into files.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from bs4 import BeautifulSoup
import urllib
class blogCrawlerSimple(CrawlSpider):
name = 'blogCrawlerSimple'
allowed_domains = ['copyblogger.com']
start_urls = [
'https://copyblogger.com/blog/',
]
def parse(self, response):
#yield response
titles = response.css('.entry-title').extract()
links = response.css('.entry-title a::attr(href)').extract()
#links = response.css('.css-8atqhb a::attr(href)').extract()
for item in zip(titles, links):
all_items = {
'title' : BeautifulSoup(item[0]).text,
'link' : item[1],
}
#found the link now lets scrape it...
yield scrapy.Request(item[1], callback=self.parse_blog_post)
yield all_items
def parse_blog_post(self, response):
print('Fetched blog post' response.url)
filename = 'storage/' response.url.split("/")[3] '.html'
print('Saved post as :' filename)
with open(filename, 'wb') as f:
f.write(response.body)
return
When you run it now, it will save all the blog posts into a file folder.
Scaling Scrapy
The example above is ok for small scale web crawling projects. But if you try to scrape large quantities of data at high speeds, you will find that sooner or later, your access will be restricted. Web servers can tell you are a bot, so one of the things you can do is run the crawler impersonating a web browser. This is done by passing the user agent string to the Wikipedia web server, so it doesn't block you.
Like this
-s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64)/
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36" /
-s ROBOTSTXT_OBEY=False
In more advanced implementations, you will need even to rotate this string, so Wikipedia can't tell it the same browser! Welcome to web scraping.
If we get a little bit more advanced, you will realize that Wikipedia can simply block your IP, ignoring all your other tricks. This is a bummer, and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project, which gets the job done consistently and one that never really works.
Plus, with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
- With millions of high speed rotating proxies located all over the world
- With our automatic IP rotation
- With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
- With our automatic CAPTCHA solving technology
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
A simple API can access the whole thing like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.