First of all, when you are writing a web crawler or web scraper, it is always advisable to use a framework like Scrapy or Nutch, so you dont have to reinvent the wheel for every little capability you need. Use frameworks like these make your crawler more stable, easier to write, and modify and flexible enough to play well with other systems. For this article, we are going to ignore these frameworks as being external tools. Once you have finished making the crawler, you might need these tools to stabilize, scale, and monitor your crawler.
A Rotating Proxy Service
Using Scrapy or Nutch, you can make a damn good Crawler, but you will still be prone to get blocked my target websites. If you are trying to crawl the big networks like Amazon, Yelp, Twitter, etc., the block is only a few hundred pages away. You will find that when your IP gets blocked, you will need some sort of a Rotating Proxy to get past these restrictions. Proxies API (I am the founder) is a service that allows you to route your requests through millions of proxies, and it also handles all the other factors that get your crawler blocked behind the scenes.
Website Pattern Change Detection Tool
For want of a better name, let's call it that. Crawlers break if the website goes down. Crawlers also break if the website shifts its URLs around. So we will need something that can track them.
The scraper will break if the HTML inside the website changes. Not the content. The content can be dynamic, but the HTML should more or less stay the same for our scrapers not to break.
It is a good idea to keep track of HTML changes, especially certain portions of the webpage we are scraping. We use a tool internally at Proxies API called urlWatch https://github.com/thp/urlwatch
It's open-source and super useful.
You can add a list of URLs to track and also specify the element inside the HTML to track...
url: http://example.org/
filter: element-by-id:something
You can use XPath.
url: https://example.net/
filter: xpath:/body
Or even CSS selectors... So suppose you are scraping the New York Times, and you are scraping all elements with the class name 'article' you can set a tracker specifying the CSS class name like below.
url: https://nytimes.com/
filter: css:.article
The tools also track connection errors and other HTTP error codes like 408, 429, 500, 502, 503, 504 errors, and send you an alert in various ways so you can take action on them at the earliest and make sure that your scraper doesn't break.
A Queueing system
When you are building large scale web scrapers, a queuing system to handle tasks asynchronously is super essential. For example, once the scraper has fetched the data, you might want a summarizer algorithm going to work on it or a term extractor that will pull the terms out without coming in the way of the busy crawler engine. These tasks might even be handled by scripts on another server or set of servers that are like workers picking out tasks when they are free. A queueing system will help manage the allocation algorithm, having multiple buckets of tasks with different priorities, the speed and concurrency limits for each 'bucket' etc.
RabbitMQ is probably the most popular message queuing system. Its open-source, platform-independent, has a message rate of about 20k messages per second, has a built-in monitor and dashboard.
A third party Cron Service API
In such a large system, you might want to build in a schedule that you might have to trigger at run time. For example, your crawler might find that a particular page is taking too long to load or is down. You might detect this in your code and may want to quickly set up a cron job like a callback to try the webpage at a later time.
Its always handy to have a service that can quickly schedule calls to my scripts at different times, intervals and frequencies at run time with a single call. Multiple services do that. EasyCron is one of them.
- queue
- check website for changes
- cron job api