Web scraping refers to the process of crawling or spidering a website or multiple websites systematically and then to extract relevant data usually not intended for consumption by software programs. The difficulty comes in the fact that the web is haphazard, designed originally for humans, varied in implementation, and then the relevant data is not defined semantically.
So here is a comprehensive look at all the approaches to web crawling and web scraping.
Before you embark on a web scraping project, you would be well advised to know all the options you have at your fingertips. We wish we had such a resource when we started doing web crawling projects 18 years ago. In this article, we would like to bring all there is to know together at a high level (understanding only) context so you can make the right decisions on the stack to use, approach to take, pitfalls to be wary of, and best practices to know.
There are three main approaches to Web Scraping:
- Web Scraping from the Browser
- Specialized Web Scraping
- Web scraping using automated tools
- Web scraping using programming
Scraping inside Google Chrome using plugins:
If you want to scrape only a few pages once in a while, using it directly off of a browser could be a great idea.
Here are a few that do the trick:
Web Scraper Chrome plugin:
This plugin allows you to scrape multiple pages using CSS selectors and XPath and enables you to export the output as a CSV file. It has around 300k users on chrome.
Get it Here.
Scraper plugin
Uses XPath to extract data from the website to excel. Get it Here.
We use this plugin all the time at Proxies API for quickly pulling out data from places like Crunchbase, Angel list, etc.
Data scraper plugin
Data scraper offers 500 free scrapes per month, and esports scraped data into Excel. Get it Here.
Instant Data Scraper
Instant Data Scraper is an automated data extraction tool for any website. It uses AI to predict which data is most relevant on an HTML page and allows saving it to Excel or CSV files (XLS, XLSX, CSV).
Instant data scraper claims to have been tested to scrape YP, Yelp, eBay, Amazon, etc.
Get it Here.
Specialized Web Scraping
Scraping data from LinkedIn:
If you need to scrape data from LinkedIn, here are the options:
- Phantombuster - Mainly useful for scraping LinkedIn profiles
- Oxylabs - Linkedin data collection
- eMail Prospector - Can extract LinkedIn leads & find email addresses
- One2Lead | Server-side automation tool
- Estrattoredati | LinkedIn Email Extraction
- Egrabber | Extract LinkedIn Leads & finds Contact Info
- GetProspect | Email finder along with Linkedin profile URL, prospect names, position, company.
- Skrapp | B2B email finder and lead extractor Chrome extension
- Expandi I Software for LinkedIn Automation
Generic Email scraper:
If you, however, already know the Name and the organization of the person whose email you are looking for, then the easiest thing to do is to use pattern matching to guess and pseudo email sending to verify. There are many tools for that:
One of our favorites is open source if you have some development resources. Even otherwise you can use it for free Here.
and the source code Here.
Scrape emails from websites:
Another technique is to find all the emails that can be found on the web for a website if you just know the site and not the names of the people working there.
There are quite a few. Here are the top 3:
Web Scraping using Automated Tools
These are services that host even the web scrapers on their cloud, so you have nothing to run and nothing to deploy.
Some times you want to scrape a large set of data regularly, and the chrome plugins won't cut it. You will need an automated, online, and could base data scraping setup. Here are a couple of options:
Octoparse: Provides scalable solutions for both regular and enterprise companies. Can handle AJAX, has a wizard mode to make things easier, and can scale with simultaneous connections.
ParseHub: Uses machine learning to scrape data efficiently. Has a desktop app, chrome extension, and you can set up five scraping tasks free.
Visual Web scraper: Not a cloud solution, but it's windows desktop app can scale and run on a schedule. It also offers up to 50000 webpages in the free plan itself.
AI Web scraping:
The creators of Scrapy have a beta program where they are trying to scale data extraction (without having to write custom extractor code by hand for each website) by using AI focussed explicitly in the areas of e-commerce and media scraping. Their new developer extraction API is designed for article extraction and real-time e-commerce scraping.
Web scraping in various languages:
Cheap web scraping with AWS
If you are a developer, you can set up a quick and dirty serverless web scraping stack using AWS Lambda and Chalice. Here is an excellent example of how to do it in Python.
Node JS web scraping
JQuery experts can use server-side the jquery implementation, Cheerio JS
Using JQuery means you can use standard CSS selectors to search for elements and extract data from them.
You can do that so merely like in the example below
const cheerio = require('cheerio')
// This returns a Cheerio instance
const $ = cheerio.load('These are the droids you are looking for
')
// We can use the same calls as jQuery
const txt = $('#example').text()
console.log(txt)
// will print: "These are the droids you are looking for"
Find out more here in the official documentation.
Or get started using this excellent article.
If you want to speed up learning and study practical, working code, you can check out New York Times Scraper called Mongo Scraper Here.
Web Scraping in Golang:
The Colly library provides a super simple way to scrape data from any website in a structured way.
Here is the code to find all the links in a web page:
1 func main() {
2 c := colly.NewCollector()
3
4 // Find and visit all links
5 c.OnHTML("a", func(e *colly.HTMLElement) {
6 e.Request.Visit(e.Attr("href"))
7 })
8
9 c.OnRequest(func(r *colly.Request) {
10 fmt.Println("Visiting", r.URL)
11 })
12
13 c.Visit("http://go-colly.org/")
14 }
Visit the Colly framework home page.
Simple web scraping using JQuery
JQuery can be used to quickly load AJAX content and parse and find the elements you want like below:
// Get HTML from page
$.get( 'http://example.com/', function( html ) {
// Loop through elements you want to scrape content from
$(html).find("ul").find("li").each( function(){
var text = $(this).text();
// Do something with content
} )
} );
Web Scraping in PHP
Goutte is one of the best libraries to make complex tasks like extracting data, clicking on links, and even submitting forms super simple.
Once you include Goutte like so:
use Goutte\Client;
use Symfony\Component\HttpClient\HttpClient;
$client = new Client(HttpClient::create(['timeout' => 60]));
You can use code as simple as this too.
Clicking on links:
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);
To extract the data:
// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
print $node->text()."\n";
});
To submit forms.
$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));
$crawler->filter('.flash-error')->each(function ($node) {
print $node->text()."\n";
});
Web scraping with python (using Scrapy)
Why Scrapy?:
Scrapy allows you to:
- Scale to a large number of pages and websites easily.
- Abstracts all the complex fetching and parsing operations.
- Run the crawler in multiple processes.
For example, to fetch all the links on a website, all you have to do is:
import scrapy
class ExtractEverything(scrapy.Spider):
name = "extract"
def start_requests(self):
urls = ['https://proxiesapi.com', ]
for url in urls:
yield scrapy.Request(url = url, callback = self.parse)
And extract data using XPath very quickly. This code below retrieves all the articles in the Mashable news feed.
import scrapy
class MySpider(scrapy.Spider):
name = 'Myspider'
#list of allowed domains
allowed_domains = ['https://mashable.com/category/rss/']
#starting url for scraping
start_urls = ['https://mashable.com/category/rss/']
#setting the location of the output csv file
custom_settings = {
'FEED_URI' : 'mashable.csv'
}
def parse(self, response):
response.selector.remove_namespaces()
#Extract each article
title = response.xpath('//item/title/text()').extract()
link = response.xpath('//item/link/text()').extract()
for item in zip(titles,links):
scraped_info = {
'title' : item[0],
'link' : item[3]
}
yield scraped_info
Web scraping in C :
If you want the raw speed of C and dont mind the extra coding overhead, you can use it download and parse HTML in any way you want.
There is libcurl to get the URLs, libtidy to convert to valid XML, and finally, libxml to parse it and extract data from it. You can also use Qt as shown Here.
Here is a fast, threaded library that abstracts this process to some extent. Get it Here.
Puppeteer Web Scraping:
Its a reality that many websites load content dynamically using AJAX calls back to the server. So the content is not visible in the original HTML. If you have to scrape this kind of web page, you need to execute the DOM and all the javascript functions thereon.
In these cases, one of the options is to use the Puppeteer Node library.
Puppeteer uses Chromium behind the scenes to render the web page and allow you to control and extract the contents inside it. I also have the added advantage of running headless (which means it is not visible but for all practical purposes can be assumed to be) and also that you can even take visual screenshots of a webpage.
For example here is how you can take a screenshot of the YCombinator home page:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf', format: 'A4'});
await browser.close();
})();
You can try this code Here.
or get the Puppeteer library Here.
If you want to use Puppeteer at scale, you can use the browserless automation tool.
Browserless takes care of all the dependencies, sand-boxing, and management of the web browser, allowing it to scale to hundreds or ready instances to connect to inside a docker container. You can connect remotely using Puppeteer and control browser instances to extract data at scale. We use browserless and Puppeteer behind the scenes in one of the functions we offer at Proxies API to serve hundreds of customers a few hundred thousand web pages a day via our rotating proxies.
Programmatically Scraping social networks like Facebook, LinkedIn, Twitter, etc.
There are ready tools that you can use to do this. But in almost all these cases, you are likely to get IP Blocked by the big networks. It's best to use professional rotating proxy services like Proxies API along with the libraries below.
Facebook scraper:
Ultimate Facebook Scraper is a comprehensive open-source tool written in python which scrapes almost everything about a user's Facebook profile. It includes uploaded photos, tagged photos, videos, friends list, and their profile photos and all public posts/statuses available on the user's timeline.
Linkedin scraper:
A couple of open source tools do this well:
Twitter Scraper:
Twitter has an API but is notorious for usage and speed limits.
These two open-source libraries help you bypass them easily. Use with a rotating proxy like Proxies API if you need to scale.
Instagram Scraper:
This one is in Python. Get it Here.
and this is in PHP. Get it Here.
Reddit Scraper:
Overcoming IP Blocks
The best way to overcome the inevitable IP block in any web scraping project is by using a Rotating Proxy Service like Proxies API.