Web scraping is the process of programmatically extracting data from websites. This is often done by sending HTTP requests to a target site, then parsing the HTML response to identify and extract relevant information.
In this article, we'll walk through Ruby code that scrapes titles, URLs, vote counts, authors, timestamps, and comment counts from the popular Hacker News site. The code utilizes the Nokogiri library for HTML parsing and OpenURI for sending HTTP requests.
This is the page we are talking about…
Prerequisites
Before running the web scraper, you'll need to have Ruby installed along with the Nokogiri and OpenURI libraries. These can be installed by running:
gem install nokogiri
gem install open-uri
Walkthrough
Now let's dive into how the web scraper code works:
require 'open-uri'
require 'nokogiri'
First we
url = "<https://news.ycombinator.com/>"
Next, we define the URL of the Hacker News homepage that we want to scrape.
page_content = URI.open(url).read
We use
doc = Nokogiri::HTML(page_content)
Here we pass the HTML content to Nokogiri's
Inspecting the page
You can notice that the items are housed inside a The headlines on Hacker News are contained in table rows ( As we loop through the rows, we'll use these variables to keep track of whether we're currently processing an article row or detail row. We iterate through the rows using The next row after a headline contains additional details like score, author, etc. So if Focusing now on the key data extraction using CSS selectors: Here From the matched element, The code continues on using additional selectors to extract score, author, comments, etc. from the detail rows: In each case: And the final output prints each field scraped for every headline: Full code: This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key: tag with the class athing rows = doc.css('tr')
current_article = nil
current_row_type = nil
rows.each do |row|
if row['class'] == 'athing'
# This is an article row
current_article = row
current_row_type = 'article'
elsif current_row_type == 'article'
# This is the details row
if current_article
title_elem = current_article.css('span.title a')
if title_elem
article_title = title_elem.text
article_url = title_elem[0]['href']
...
end
end
current_article = nil
current_row_type = nil
end
title_elem = current_article.css('span.title a')
article_title = title_elem.text
article_url = title_elem[0]['href']
points = subtext.css('span.score').text
author = subtext.css('a.hnuser').text
comments_elem = subtext.css('a:contains("comments")')
comments = comments_elem.text if comments_elem.any? else '0'
Title: ...
URL: ...
Points: ...
Author: ...
require 'open-uri'
require 'nokogiri'
# Define the URL of the Hacker News homepage
url = "https://news.ycombinator.com/"
# Send a GET request to the URL and read the content
page_content = URI.open(url).read
# Parse the HTML content of the page using Nokogiri
doc = Nokogiri::HTML(page_content)
# Find all rows in the table
rows = doc.css('tr')
# Initialize variables to keep track of the current article and row type
current_article = nil
current_row_type = nil
# Iterate through the rows to scrape articles
rows.each do |row|
if row['class'] == 'athing'
# This is an article row
current_article = row
current_row_type = 'article'
elsif current_row_type == 'article'
# This is the details row
if current_article
title_elem = current_article.css('span.title a')
if title_elem
article_title = title_elem.text # Get the text of the anchor element
article_url = title_elem[0]['href'] # Get the href attribute of the anchor element
subtext = row.css('td.subtext')
points = subtext.css('span.score').text
author = subtext.css('a.hnuser').text
timestamp = subtext.css('span.age')[0]['title']
comments_elem = subtext.css('a:contains("comments")')
comments = comments_elem.text if comments_elem.any? else '0'
# Print the extracted information
puts "Title: #{article_title}"
puts "URL: #{article_url}"
puts "Points: #{points}"
puts "Author: #{author}"
puts "Timestamp: #{timestamp}"
puts "Comments: #{comments}"
puts "-" * 50 # Separating articles
end
end
# Reset the current article and row type
current_article = nil
current_row_type = nil
elsif row['style'] == 'height:5px'
# This is the spacer row, skip it
next
end
end
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!