A lot of programmers think that building a web crawler/scraper is all about the code. It's mostly not. The robustness of your project depends mainly on other factors, in our opinion. Here are the boxes you need to check to build something that truly scales, that you can rely on, is consistent, rarely breaks, is easy to diagnose and debug, and doesn't give you the midnight emergency alert.
Use a framework: Use Beautifulsoup or scrapy or Nutch. Anything. Anything that has thousands of lines of code by hundreds of coders who do large web scraping projects for years, to take care of all the weird exceptions that happen when you are dealing with something as unpredictable as the web. If your scraper is hand-coded, I am sorry to say this, and you have hard times coming.
Take a lot of measures to pretend to be human.
Build multiple checks in your web crawler that monitors the health of your crawler and fires an alert to you.
Here is a list of places that your web crawler will probably fail
- We the web pages dont load
- Internet is down
- When the content at the URL has moved
- You are shown a CAPTCHA challenge.
- The web page changes its HTML, so your scraping doesn't work.
- Some fields that you scrape are empty some of the time, and there is no handler for that.
- The web pages take a long time to load
- The web site has blocked you completely.
- Just use a professional third party Rotating Proxy Service to avoid the inevitable IP block. There is no getting around it. We have tried it for years.
- We have tried it all. That's why we built the Proxies API. It will rotate proxies with a pool of a couple of million private residential proxies. Rotates user-agents retries requests automatically, solve captchas, renders AJAX pages. Problems you should not be solving and won't be able to without massive investments.
Is it rotating user-agents? Yes. It will cut you some slack for a while, but your IP address is right there for them to block and they will
Have 4-5 IPs from different machines or servers? Yes. For a while. After that, they will ALL get blocked.
Use publicly available proxies? Sure. They have a life span of about 5 mins because every other hacker around the world is onto them.