Feb 9th, 2021The Differences Between Newbie & Pro Level Web Scraper CoderChecks & BalancesNewbieA newbie uses no checks and balances. If it works on my machine, it should work in production.Proa. A pro has carefully looked at every breaking point imaginable in the code and looks to see if any of that can bring the whole operation down. For example, if the webserver IP block, rate limits, change their code, the internet goes down, the disk space gets full, etc. b. A pro builds in alerts and essential info into the alerts so he can debug them easily.2. Code & ArchitectureNewbieA newbie spends too much time on code and too little time on the Architecture.ProA pro spends much time researching and experimenting with different frameworks and libraries like Scrapy, Puppeteer, Selenium, Beautiful Soup, etc. to see what suits his current needs the best.3. FrameworkNewbieA newbie doesn’t use a framework because it is not in his ‘Favorite’ programming language and writes code without any best practices.ProThe pro knows that a framework might have a small learning curve but is heavily offset very soon by all the abstractions they provide.4. Being Like a BotNewbieA newbie doesn’t work on ‘pretending to be human’ enough.ProPro works more human than an actual human taking care of small children or babies.5. Choosing ProxyNewbieA newbie uses free proxy servers available on the internetProA pro doesn’t want a free lunch. If the project is important, he knows there is no way he can build a rotating proxy infrastructure. He will opt for one like Proxies API.6. Expect the UnexpectedNewbieA newbie doesn’t factor in that the target website might change their code.ProA pro expects it. Puts a time stamp on every website he written a scraper for. Writes a Hello World test case for each which should pass no matter what, and if it doesn’t, he sends himself an alert to change his code.7. Scrapping ProcessNewbieA newbie uses RegEx or some such rudimentary way to scrape data.ProCSS selectors or XPath are the way to predictably be able to retrieve data, which allows for many changes to be made in the target HTML and the code will probably still work.8. Normalization Of DataNewbieA newbie doesn’t normalize data that are downloadedProDownloading from multiple websites means duplicate data, the same data in multiple formats, etc. A pro puts in normalization code to make sure the end data looks as uniform as possible.9. Crawling SpeedNewbieA newbie doesn’t work on scaling the spiders by using concurrency, multiple spiders using Scrapyd, using Rotating Proxies to make more requests per second.ProPro is always looking to make the crawling process faster and more reliable.10. IP BlockageNewbieA newbie doesn’t believe that he will ever get IP blocked until he is.ProA pro expects this almost to be a guarantee, especially for big sites like Amazon, Reddit, Yelp, etc. He puts in measures like Proxies API (Rotating Proxies) to help completely negate this risk.The author is the founder of Proxies API, a proxy rotation API service.