Write sophisticated spiders
It is a breeze to write full-blown spiders quickly with Scrapy. Here is one that can download all the images from a Wikipedia page.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
i=1
class MySpider(CrawlSpider):
name = 'Wikipedia'
allowed_domains = ['en.wikipedia.org', 'upload.wikimedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Lists_of_animals']
rules = (
# This rule set makes sure it downloads anything with the extension .jpg in it and also removes the deny_extensions default setting
Rule(LinkExtractor(allow=('.jpg'), deny_extensions=set(), tags = ('img',), attrs=('src',), canonicalize = True, unique = True), follow = False, callback='parse_item'),
)
def parse_item(self, response):
global i
i=i 1
self.logger.info('Found image - %s', response.url)
flname='image' str(i) '.jpg'
with open('image' str(i) '.jpg', 'wb') as html_file:
html_file.write(response.body)
self.logger.info('Saved image as - %s', flname)
item = scrapy.Item()
return item
Here is another that can navigate and get quotes from http://quotes.toscrape.com by following the pagination
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.xpath('span/small/text()').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Use selectors to extract content
The example above uses both CSS and XPath selectors to extract text easily. Between them, you can extract any imaginable content on the web. Plus Scrapy makes it easy to test the selectors interactively in the Scrapy Shell
Do Interactive testing in Scrapy Shell
It is one of my favorite things about Scrapy. One of the most time-consuming things is writing the correct selectors that get you the data you want. The fastest way to test and iterate through this process is by using the interactive shell. You can invoke it like this.
scrapy shell http://example.com
It loads the contents of example.com into the response object, which you can now query like so.
response.xpath('//title/text()')
This will print the title of the page as a result.
[<Selector xpath='//title/text()' data=u'Example Domain'>]
To get the Headline of the page, you can use the CSS selector.
response.css('h1::text').get()
This will print
Out[10]: u'Example Domain'
Export data in many ways and store it in different systems
After you have successfully run your spiders, extracted the data you will need to export that data into all sorts of formats depending on your needs plus you may have to save it into various locations like the local disk or Amason S3 etc.
Scrapy comes inbuilt with the following export formats:
a. JSON
b. JSONLINES
c. CSV
d. XML
e. Pickle
f. Marshal
and you can save the data too.
g. Local storage
h. Amazon S3
i. FTP
j. Standard output
Use the Signals API to get notified when certain events occur
The Signals API is super useful in controlling, monitoring, and reporting the behavior of your spiders.
The following code demonstrates how you can subscribe to the Spider_closed event.
from scrapy import signals
from scrapy import Spider
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def spider_closed(self, spider):
spider.logger.info('Spider closed: %s', spider.name)
def parse(self, response):
pass
The author is the founder of Proxies API the rotating proxies service