Scraping Yelp Business Listings using Ruby - A step by step guide

Step 1: Introduction

Imagine you're researching the best Chinese restaurants in San Francisco, and you want to gather data from Yelp to make an informed decision. Web scraping can be your secret weapon in this quest. In this article, we'll walk through the process of scraping Yelp business listings step by step.

This is the page we are talking about

We'll be using Ruby and Nokogiri, powerful tools for web scraping. So, if you're a beginner in web scraping, don't worry; we've got you covered.

Step 2: Set Up the Environment

Before we dive into the code, you'll need to make sure you have Ruby installed on your system. If you haven't already, head over to the official Ruby website (https://www.ruby-lang.org/en/documentation/installation/) to download and install Ruby.

Additionally, we'll be using some Ruby gems (libraries) to help us with web scraping. Open your terminal and run the following commands to install them:

gem install net-http
gem install nokogiri

Now, create a new Ruby script file in your preferred code editor and name it something like scrape_yelp.rb. We'll use this file to organize our code.

Step 3: Import Necessary Libraries

In our Ruby script, we'll start by importing the necessary libraries. Here's what each library does:

net/http: This library allows us to make HTTP requests to websites.

nokogiri: Nokogiri is a powerful HTML parsing library that helps us extract data from web pages.

Now, let's move on to the next step.

Step 4: Define the Yelp Search URL

Our first task is to define the URL of the Yelp search page we want to scrape. In our case, we're searching for Chinese restaurants in San Francisco, so our URL looks like this:

url = "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>"

But before we proceed, we need to URL-encode this URL to ensure it's correctly formatted for use in our code. To do this, we'll use the URI.escape method:

encoded_url = URI.escape(url, /[:?&=]/)

Now, let's move on to the next step where we'll handle premium proxies to bypass Yelp's anti-bot mechanisms.

Step 5: Generate the API URL with Premium Proxies

Here's where things get interesting. Yelp, like many websites, has defenses against web scraping. To circumvent these measures and ensure uninterrupted scraping, we'll use premium proxies.

In our code, we construct an api_url variable that combines your premium proxy key from ProxiesAPI with the encoded Yelp URL. This URL will act as an intermediary between your script and Yelp, helping you avoid detection.

api_url = "<http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=#{encoded_url}>"

Make sure to replace YOUR_AUTH_KEY with your actual ProxiesAPI authentication key. If you don't have one, you can sign up for an account on our website (https://proxiesapi.com/).

Step 6: Set Up Request Headers

Before we make the request to Yelp, we need to simulate a browser request. This is essential to avoid being flagged as a bot. To do this, we define a headers hash that contains various details like the user-agent, language, encoding, and even a referrer.

headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
  "Accept-Language" =>

 "en-US,en;q=0.5",
  "Accept-Encoding" => "gzip, deflate, br",
  "Referer" => "<https://www.google.com/>",  # Simulate a referrer
}

These headers make your requests look more like they're coming from a real web browser.

Step 7: Send an HTTP GET Request

With our URL, premium proxies, and headers in place, it's time to send an HTTP GET request to the Yelp search page. We use the net/http library to accomplish this.

uri = URI.parse(api_url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true if uri.scheme == 'https'

request = Net::HTTP::Get.new(uri.request_uri, headers)

response = http.request(request)

We'll also check if the request was successful by examining the response status code.

Step 8: Save the HTML Response

Now that we've made a successful request, we need to save the HTML response to a file for further analysis. In our code, we create a file named yelp_html.html and write the response body to it.

File.open("yelp_html.html", "w", encoding: "utf-8") do |file|
  file.write(response.body)
end

It's important to preserve the data as it is for accuracy and future reference.

Step 9: Parsing HTML with Nokogiri

With the Yelp HTML data saved, we can move on to parsing it with Nokogiri. Nokogiri will help us navigate the HTML structure and extract the information we need.

doc = Nokogiri::HTML(response.body)

Now, let's tackle the next step: extracting business information.

Step 10: Extract Business Information

Our goal is to extract details like the business name, rating, number of reviews, price range, and location for each listing on the Yelp page. Let's walk through this step by step.

Inspecting the page

When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

Inside our code, we find all the listings using CSS selectors:

listings = doc.css('div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x')

These CSS classes are specific to Yelp's HTML structure and may change over time, so make sure they match the current structure.

Now, we loop through each listing and extract information. We'll assume you've already extracted the information as shown in the original code. Here's a detailed breakdown of each extraction:

Business Name: We search for an anchor tag with a specific class and extract the text. If not found, we default to "N/A."

business_name_elem = listing.at_css('a.css-19v1rkv')
business_name = business_name_elem ? business_name_elem.text : "N/A"

Rating: We search for a specific span element and extract the text. If not found, we default to "N/A."

rating_elem = listing.at_css('span.css-gutk1c')
rating = rating_elem ? rating_elem.text : "N/A"

Price Range: We search for a specific span element and extract the text. If not found, we default to "N/A."

price_range_elem = listing.at_css('span.priceRange__09f24__mmOuH')
price_range = price_range_elem ? price_range_elem.text : "N/A"

Number of Reviews and Location: We search for all span elements with a specific class. If there are at least two, we assume the first is for the number of reviews and the second is for the location. We trim the text to remove extra whitespace. If there's only one span, we check if it's a number (for reviews) or a location and assign accordingly.

span_elements = listing.css('span.css-chan6m')
num_reviews = "N/A"
location = "N/A"

if span_elements.length >= 2
  num_reviews = span_elements[0].text.strip
  location = span_elements[1].text.strip
elsif span_elements.length == 1
  text = span_elements[0].text.strip
  if text.match?(/^\\d+$/)
    num_reviews = text
  else
    location = text
  end
end

Now that we've successfully extracted the business information, it's time to move to the next step: printing this information.

Step 11: Printing Extracted Information

For each business listing, we print the extracted information using the puts method. Here's how we format and display the data:

puts "Business Name: #{business_name}"
puts "Rating: #{rating}"
puts "Number of Reviews: #{num_reviews}"
puts "Price Range: #{price_range}"
puts "Location: #{location}"
puts "=" * 30

This code ensures that the extracted data is presented clearly.

Step 12: Error Handling

While we've covered a lot of ground, it's crucial to handle potential errors gracefully. In case the request to Yelp fails, we check the response status code and print an error message.

if response.code.to_i == 200
  # Continue with data extraction and printing
else
  puts "Failed to retrieve data. Status Code: #{response.code}"
end

This ensures that if something goes wrong, you'll know about it.

Step 13: Conclusion and Next Steps

Congratulations! You've successfully scraped Yelp business listings using Ruby and Nokogiri while bypassing Yelp's anti-bot mechanisms with premium proxies. Here are some key takeaways from our journey:

Web scraping is a powerful technique for gathering data from websites.

Premium proxies are essential to avoid detection and blocking by websites like Yelp.

Simulating browser requests with headers helps you scrape data without being flagged as a bot.

Nokogiri is a handy library for parsing HTML and extracting information.

Now that you have the data, you can use it for various purposes, such as data analysis, visualization, or simply making an informed decision about where to enjoy some delicious Chinese cuisine in San Francisco.

Full Code:

require 'net/http'
require 'uri'
require 'nokogiri'

# URL of the Yelp search page
url = "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA"

# URL-encode the URL
encoded_url = URI.escape(url, /[:?&=]/)

# API URL with the encoded Yelp URL
api_url = "http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=#{encoded_url}"

# Define a user-agent header to simulate a browser request
headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
  "Accept-Language" => "en-US,en;q=0.5",
  "Accept-Encoding" => "gzip, deflate, br",
  "Referer" => "https://www.google.com/",  # Simulate a referrer
}

# Send an HTTP GET request to the URL with the headers
uri = URI.parse(api_url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true if uri.scheme == 'https'

request = Net::HTTP::Get.new(uri.request_uri, headers)

response = http.request(request)

File.open("yelp_html.html", "w", encoding: "utf-8") do |file|
  file.write(response.body)
end

# Check if the request was successful (status code 200)
if response.code.to_i == 200
  # Parse the HTML content of the page using Nokogiri
  doc = Nokogiri::HTML(response.body)

  # Find all the listings
  listings = doc.css('div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x')
  puts listings.length

  # Loop through each listing and extract information
  listings.each do |listing|
    # Assuming you've already extracted the information as shown in your code

    # Check if business name exists
    business_name_elem = listing.at_css('a.css-19v1rkv')
    business_name = business_name_elem ? business_name_elem.text : "N/A"

    # If business name is not "N/A," then print the information
    if business_name != "N/A"
      # Check if rating exists
      rating_elem = listing.at_css('span.css-gutk1c')
      rating = rating_elem ? rating_elem.text : "N/A"

      # Check if price range exists
      price_range_elem = listing.at_css('span.priceRange__09f24__mmOuH')
      price_range = price_range_elem ? price_range_elem.text : "N/A"

      # Find all <span> elements inside the listing
      span_elements = listing.css('span.css-chan6m')

      # Initialize num_reviews and location as "N/A"
      num_reviews = "N/A"
      location = "N/A"

      # Check if there are at least two <span> elements
      if span_elements.length >= 2
        # The first <span> element is for Number of Reviews
        num_reviews = span_elements[0].text.strip

        # The second <span> element is for Location
        location = span_elements[1].text.strip
      elsif span_elements.length == 1
        # If there's only one <span> element, check if it's for Number of Reviews or Location
        text = span_elements[0].text.strip
        if text.match?(/^\d+$/)
          num_reviews = text
        else
          location = text
        end
      end

      # Print the extracted information
      puts "Business Name: #{business_name}"
      puts "Rating: #{rating}"
      puts "Number of Reviews: #{num_reviews}"
      puts "Price Range: #{price_range}"
      puts "Location: #{location}"
      puts "=" * 30
    end
  end
else
  puts "Failed to retrieve data. Status Code: #{response.code}"
end

Scraping Yelp Business Listings using Ruby - A step by step guide

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Yelp Business Listings using Ruby - A step by step guide

The easiest way to do Web Scraping

Don't leave just yet!