Scrapy is one of the most accessible tools that you can use to scrape and also spider a website with effortless ease.
One of the most common applications of web scraping according to the patterns we see with many of our customers at Proxies API is scraping images from websites. Today lets look at how we can build a simple scraper to pull out and save all the pictures from a website like The New York Times.
Here is how the NYT Home page looks.
Now let's try and download all those interesting images that define the world every day.
First, we need to install scrapy if you haven't already.
pip install scrapy
Once installed, we will add a simple file with some barebones code like so.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from bs4 import BeautifulSoup
import urllib
class crawlImages(CrawlSpider):
name = 'crawlImages'
allowed_domains = ['nytimes.com', 'nyt.com']
start_urls = [
'https://www.nytimes.com/',
]
def parse(self, response):
Let's examine this code before we proceed.
The allowed_domains array restricts all further crawling to the domain paths specified here. We need nyt.com as the images are stored there, as you will see below.
start_urls is the list of URLs to crawl. For us, in this example, we only need one URL.
The def parse(self, response): function is called by scrapy after every successful URL crawl. Here is where we can write our code to extract the data we want.
Now let's see what we can write in the parse function...
We need the CSS selector for images. It's pretty obvious. We can just use the IMG tag and get the SRC attribute for the image path and the ALT attribute for the caption of the image if we want.
links = response.css('img::attr(src)').extract()
titles = response.css('img::attr(alt)').extract()
It will give us the pictures and their captions.
So lets put this all together.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from bs4 import BeautifulSoup
import urllib
class crawlImages(CrawlSpider):
name = 'crawlImages'
allowed_domains = ['nytimes.com', 'nyt.com']
start_urls = [
'https://www.nytimes.com/',
]
def parse(self, response):
titles = response.css('img::attr(alt)').extract()
links = response.css('img::attr(src)').extract()
print('##########')
for item in zip(titles, links):
all_items = {
'title' : BeautifulSoup(item[0]).text,
'link' : item[1]
}
print(item[1])
yield all_items
Let's save it as crawlImages.py and then run it with these parameters, which tells scrapy to disobey Robots.txt and also to simulate a web browser.
scrapy runspider crawlImages.py -s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36" -s ROBOTSTXT_OBEY=False
When you run, it should return.
Those are all the images. Now let's save them into files. For that, we need to use the Request command fetch the photos one after another and extract the file name from the URL and save it into the local disk, as shown below.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from bs4 import BeautifulSoup
import urllib
class crawlImages(CrawlSpider):
name = 'crawlImages'
allowed_domains = ['nytimes.com', 'nyt.com']
start_urls = [
'https://www.nytimes.com/',
]
def parse(self, response):
titles = response.css('img::attr(alt)').extract()
links = response.css('img::attr(src)').extract()
print('##########')
for item in zip(titles, links):
all_items = {
'title' : BeautifulSoup(item[0]).text,
'link' : item[1]
}
#print(item[1])
yield scrapy.Request(item[1], callback=self.parse_image)
yield all_items
def parse_image(self, response):
print('^^^ fetched image : ' response.url)
filename1 = 'storage/' response.url.split('/')[-1]
filename = filename1.split('?')[0]
with open(filename, 'wb') as f:
f.write(response.body)
print('^^^ Saved image as: ' filename)
return
When you run it now, it will keep all the images into the storage folder like so.
Scaling Scrapy
The example above is ok for small scale web crawling projects. But if you try to scrape large quantities of data at high speeds, you will find that sooner or later, your access will be restricted. Web servers can tell you are a bot, so one of the things you can do is run the crawler impersonating a web browser. It is done by passing the user agent string to the Wikipedia web server, so it doesn't block you.
Like this.
-s USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64)/
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36" /
-s ROBOTSTXT_OBEY=False
In more advanced implementations, you will need even to rotate this string, so Wikipedia can't tell it the same browser! Welcome to web scraping.
If we get a little bit more advanced, you will realize that Wikipedia can simply block your IP, ignoring all your other tricks. It is a bummer, and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project, which gets the job done consistently and one that never really works.
Plus, with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
- With millions of high speed rotating proxies located all over the world
- With our automatic IP rotation
- With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
- With our automatic CAPTCHA solving technology
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
A simple API can access the whole thing like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.