This Node.js code scrapes article data from the Hacker News homepage using the axios and cheerio modules. Here's a high-level overview of what it does:
- Sends a GET request to the Hacker News URL using axios
- Loads the returned HTML content into cheerio
- Parses the page and extracts article data by targeting DOM elements
- Prints out the scraped data to the console
In this beginner tutorial, we'll go through the code step-by-step to understand how it works under the hood.
This is the page we are talking about…
Install Required Node Modules
To run this web scraper, you need to have Node.js installed and install the axios and cheerio modules:
npm install axios cheerio
Axios handles making the HTTP requests while cheerio allows querying and manipulating the returned HTML just like jQuery.
Make HTTP Request for Hacker News Page
First, we import the axios and cheerio modules:
const axios = require('axios');
const cheerio = require('cheerio');
Next, we define the URL of the page we want to scrape - the Hacker News homepage:
const url = '<https://news.ycombinator.com/>';
We make a GET request using axios to fetch the content of this URL:
axios.get(url)
The axios .get() method returns a promise that resolves with a response object containing the status code and response data.
Load HTML and Parse with Cheerio
In the .then() handler, we check if the status code is 200 meaning the request succeeded.
We then load the HTML content from response.data into Cheerio using the cheerio.load() method. This parses the document and allows us to query it with a jQuery-style syntax.
if (response.status === 200) {
const $ = cheerio.load(response.data);
}
Find Table Rows to Scrape
Inspecting the page
You can notice that the items are housed inside a We grab all the As we loop through the rows, we need to keep track of the current article we're extracting data from and whether we're on an article row or details row: We loop through each of the rows: Inside this, we check if it has a class named "athing" indicating it's an article row: If not, we check if we're on a details row following an article row: Inside here we extract the data we want like title, URL, points, etc. by targeting elements and retrieving attributes, text, etc: Let's break this down in detail... We first find the element with class "title" inside currentArticle using .find(): We get the text content of this element with .text(), trim any whitespace, and save to articleTitle: The key things to understand are: For the article URL, we again find the link inside titleElem, then access its href attribute: Here we use: And similarly for points, author, comments, etc! Finally, we print out all the extracted data so we can see the scraped article info: This prints the article details to the console separated by dashes. Here is the complete code for reference: This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key: tag with the class athing table row elements on the page as we'll loop through them to extract article data: const rows = $('tr');
Track Current Article and Row
let currentArticle = null;
let currentRowType = null;
Iterate Through Rows to Scrape Articles
rows.each((index, row) => {
// row parsing code
});
if ($row.attr('class') === 'athing') {
currentArticle = $row;
currentRowType = 'article';
}
} else if (currentRowType === 'article') {
// Extract article data
}
Extract Article Data
const titleElem = currentArticle.find('.title');
const articleTitle = titleElem.text().trim();
const articleUrl = titleElem.find('a').attr('href');
// ...
Extracting Title
const titleElem = currentArticle.find('.title');
const articleTitle = titleElem.text().trim();
Extracting URL
const articleUrl = titleElem.find('a').attr('href');
Print Scraped Article Data
console.log('Title:', articleTitle);
console.log('URL:', articleUrl);
// etc
Full Code
const axios = require('axios');
const cheerio = require('cheerio');
// Define the URL of the Hacker News homepage
const url = 'https://news.ycombinator.com/';
// Send a GET request to the URL
axios.get(url)
.then((response) => {
// Check if the request was successful (status code 200)
if (response.status === 200) {
// Load the HTML content of the page using Cheerio
const $ = cheerio.load(response.data);
// Find all rows in the table
const rows = $('tr');
// Initialize variables to keep track of the current article and row type
let currentArticle = null;
let currentRowType = null;
// Iterate through the rows to scrape articles
rows.each((index, row) => {
const $row = $(row);
if ($row.attr('class') === 'athing') {
// This is an article row
currentArticle = $row;
currentRowType = 'article';
} else if (currentRowType === 'article') {
// This is the details row
if (currentArticle) {
const titleElem = currentArticle.find('.title');
if (titleElem.length) {
const articleTitle = titleElem.text().trim();
const articleUrl = titleElem.find('a').attr('href');
const subtext = $row.find('.subtext');
const points = subtext.find('.score').text();
const author = subtext.find('.hnuser').text();
const timestamp = subtext.find('.age').attr('title');
const commentsElem = subtext.find('a:contains("comments")');
const comments = commentsElem.text().trim() || '0';
// Print the extracted information
console.log('Title:', articleTitle);
console.log('URL:', articleUrl);
console.log('Points:', points);
console.log('Author:', author);
console.log('Timestamp:', timestamp);
console.log('Comments:', comments);
console.log('-'.repeat(50)); // Separating articles
}
}
// Reset the current article and row type
currentArticle = null;
currentRowType = null;
} else if ($row.attr('style') === 'height:5px') {
// This is the spacer row, skip it
return;
}
});
} else {
console.log('Failed to retrieve the page. Status code:', response.status);
}
})
.catch((error) => {
console.error('An error occurred:', error);
});
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!