Web scraping allows you to automatically extract data from websites - it's useful for collecting large volumes of data for analysis. Here we'll scrape article titles and links from the New York Times homepage.

Prerequisites

Before scraping any site, we need:

Ruby installed - I'd recommend Ruby 2.7+

Some gems including nokogiri for parsing and net/http for sending requests

A good IDE like VS Code or Atom to write code

You can install these in Windows, Linux or MacOS environments.

Walkthrough

Here's how the NYTimes scraper works:

First we require the gems we need:

require 'net/http'
require 'nokogiri'

Next we set the target URL to scrape:

url = '<https://www.nytimes.com/>'

We define a user agent header that mimics a browser's request - this helps avoid getting blocked:

headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

Using Net::HTTP, we send a GET request to the URL and store the response:

response = Net::HTTP.get_response(URI(url), headers)

We check if the request was successful via the status code:

if response.code == "200"

If good, we parse the HTML using Nokogiri:

doc = Nokogiri::HTML(response.body)

Inspecting the page

We now inspect element in chrome to see how the code is structured…

You can see that the articles are contained inside section tags and with the class story-wrapper

We grab article sections using a CSS selector:

article_sections = doc.css('section.story-wrapper')

Within these, we find titles and links via more selectors:

title_element = article_section.at_css('h3.indicate-hover')

link_element = article_section.at_css('a.css-9mylee')

If found, we extract and store them:

article_titles << article_title
article_links << article_link

Finally, we print the scraped data:

article_titles.zip(article_links).each do |title, link|
  puts "Title: #{title}"
  puts "Link: #{link}"
end

And that's the gist of how this scraper works!

Here is the full code

require 'net/http'
require 'nokogiri'

# URL of The New York Times website
url = 'https://www.nytimes.com/'

# Define a user-agent header to simulate a browser request
headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

# Send an HTTP GET request to the URL
response = Net::HTTP.get_response(URI(url), headers)

# Check if the request was successful (status code 200)
if response.code == "200" 
  # Parse the HTML content of the page
  doc = Nokogiri::HTML(response.body)

  # Find all article sections with class 'story-wrapper'
  article_sections = doc.css('section.story-wrapper')

  # Initialize lists to store the article titles and links
  article_titles = []
  article_links = []

  # Iterate through the article sections
  article_sections.each do |article_section|
    # Check if the article title element exists
    title_element = article_section.at_css('h3.indicate-hover')
    
    # Check if the article link element exists  
    link_element = article_section.at_css('a.css-9mylee')

    # If both title and link are found, extract and append 
    if title_element && link_element
      article_title = title_element.text.strip
      article_link = link_element['href']
      article_titles << article_title
      article_links << article_link
    end
  end

  # Print or process the extracted article titles and links
  article_titles.zip(article_links).each do |title, link|
    puts "Title: #{title}"
    puts "Link: #{link}"
    puts
  end

else
  puts "Failed to retrieve the web page. Status code: #{response.code}"
end

Practical Considerations

Handling errors - We check for status codes and handles cases where page wasn't retrieved properly.

Adaptability - The CSS selectors could be tweaked to scrape other parts of articles.

Blocking - Rotating user agents helps avoid getting blocked by sites.

Legalities - Do check a website's terms before scraping to avoid issues!

Key Takeways

Web scraping can extract data from sites to CSV/JSON programatically

Tools like Nokogiri parser and Net::HTTP help scrape in Ruby

Mimicking browsers and rotating user agents helps avoid blocks

CSS selectors identify parts of page to extract data from

Always handle errors and exceptions when scraping sites

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Web Scraping New York Times News Headlines in Ruby

Prerequisites

Walkthrough

Inspecting the page

Practical Considerations

Key Takeways

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Web Scraping New York Times News Headlines in Ruby

Prerequisites

Walkthrough

Inspecting the page

Practical Considerations

Key Takeways

The easiest way to do Web Scraping

Don't leave just yet!