Cheerio is a fast, flexible web scraping library for Node.js. This cheat sheet provides a comprehensive reference of its syntax and capabilities.
Capabilities Covered
Installation
Install via npm:
npm install cheerio
Or Yarn:
yarn add cheerio
Loading HTML
Load markup into Cheerio for parsing:
From String:
const $ = cheerio.load('<h2 class="title">Hello</h2>')
From File:
const fs = require('fs');
const $ = cheerio.load(fs.readFileSync('index.html'));
From URL:
const $ = cheerio.load(await axios('<https://example.com>'));
From JSON:
const data = {foo: 'bar'};
const $ = cheerio.load(JSON.stringify(data));
Selectors
Query DOM elements using CSS selector syntax:
IDs:
$('#my-id');
Classes:
$('.my-class');
Tags:
$('ul'); // <ul>
$('li'); // <li>
Attributes:
$('a[target=_blank]');
Multiple Classes:
$('.class1.class2');
Wildcards:
$('*'); // All elements
Chained:
$('.outer').find('.inner');
Pseudo Selectors:
$('a:first');
$('div:last');
$('li:nth-child(3)');
$('a:contains("text")');
DOM Traversal
Navigate between nodes:
Parents:
$('.child').parent();
Children:
$('.parent').children();
Siblings:
$('.first-child').next();
$('.last-child').prev();
Filtering:
$('.parent').filter('.special').text();
Traverse Up:
$('.child').closest('.ancestor');
$('.child').parentsUntil('.grandparent');
Traverse Down:
$('.parent').find('.child');
DOM Manipulation
Modify elements and content:
Set Text:
$('h1').text('New Text');
Set HTML:
$('button').html('<b>Save</b>');
Add Class:
$('.box').addClass('blue');
Remove Class:
$('.box').removeClass('blue');
Toggle Class:
$('.box').toggleClass('highlighted');
Set Attributes:
$('input[type="text"]').attr('name', 'username');
Append:
$('ul').append('<li class="new">New</li>');
Prepend:
$('ul').prepend('<li class="new">New</li>');
Before:
$('li.third').before('<li class="second">Second</li>');
After:
$('li.third').after('<li class="fourth">Fourth</li>');
Remove:
$('.deleted').remove();
Wrap Inner:
$('.message').wrapInner('<b></b>');
Unwrap:
$('b').unwrap();
Information
Extract info from elements:
Text:
$('h1').text();
HTML:
$('div').html();
Value:
$('input[name=first_name]').val();
Attribute:
$('a').attr('href');
Data Attribute:
$('.user').data('id');
Looping
Iterate through elements:
Each:
$('li').each((i, el) => {
// element logic
});
Map:
const urls = $('li a').map((i, el) => $(el).attr('href')).get();
Reduce:
const total = $('.product').reduce((sum, el) => {
const price = $(el).data('price');
return sum + price;
}, 0);
Filter:
const special = $('.product').filter((i, el) => {
return $(el).data('special');
}).get();
Output
Render final output:
Full HTML:
$.html();
Outer HTML:
$('.box').html();
Text:
$('.message').text();
JSON:
JSON.stringify($('.box').map((i, el) => {
// map to object
}).get());
Save File:
fs.writeFileSync('page.html', $.html());
HTTP Response:
res.send($.html());
Plugins
Extend functionality:
Images:
const images = require('cheerio-image-loader')
images($, '.product img')
.then(/* ... */)
Videos:
const videos = require('cheerio-video')
videos($).attr('src', '<https://example.com/trailer.mp4>')
SVG:
const svg = require('cheerio-svg-parser')
svg.parse($.html()).svg() // SVG DOM
Debugging
Log and inspect output:
Elements:
console.log($('.item'));
HTML:
console.log($.html());
JSON:
console.log(JSON.stringify($('.item').map((i, el) => {
return $(el).text();
}).get()));
Node REPL:
const repl = require('repl');
repl.start('> ').context.$_ = $;
Rate Limiting
Control request speed:
Simple Delay:
await new Promise(resolve => setTimeout(resolve, 1000));
Queue:
const queue = new PQueue({ concurrency: 2 });
queue.add(() => {
// Request code
})
Bottleneck:
const limiter = new Bottleneck({
minTime: 1000
});
limiter.schedule(() => {
// Request code
});
Caching
Save responses:
In-Memory:
let cache = {};
const url = '<https://example.com>';
if (cache[url]) {
return cache[url];
} else {
const resp = await fetch(url);
cache[url] = resp;
return resp;
}
Redis:
const redis = require('redis');
const client = redis.createClient();
const key = `cache:${url}`;
const cached = await client.get(key);
if (cached) {
return JSON.parse(cached);
} else {
const resp = await fetch(url);
client.set(key, JSON.stringify(resp), 'EX', 3600);
return resp;
}
Best Practices
Tips for effective web scraping:
Real World Examples
Common use cases:
And that covers the full range of Cheerio's syntax and capabilities. With this handy reference, you can scrape the web more effectively!