Scraping Yelp Business Listings using R

Web scraping can be an extremely useful technique for gathering and analyzing data from websites. However, many sites like Yelp have anti-scraping mechanisms in place to prevent automation. This is where using premium proxies comes in handy.

Proxies act as an intermediary between your computer and the target site, masking your identity and evading bot detection. Premium proxies are fast, reliable pools of IP addresses that imitate real organic users. By routing requests through them, we can bypass restrictions and scrape data at scale.

In this tutorial, we'll walk through a full web scraping script for extracting key details on Yelp listings, with proxies enabled for stability. We'll go line-by-line to ensure you understand exactly what's happening behind the scenes.

This is the page we are talking about

I know when I first started out, selectors were the most confusing part - so we'll take extra care there. Whether you ultimately want to analyze trends across top-rated Chinese restaurants or collect pricing info to improve your own menu, this script has you covered!

Setting up the Tools

We'll use R and two handy libraries - httr for sending HTTP requests, and rvest for parsing the HTML response to extract data.

Run the first two lines to import these:

library(httr)
library(rvest)

Make sure you have them installed. If not, a quick install.packages("httr") will do.

Constructing the Target URL

Now, we define our Yelp search URL to target. Let's go after Chinese restaurants in San Francisco.

url <- "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>"

Easy enough! find_desc filters by category, and find_loc specifies the location.

Encoding the URL

Here's an insider trick - we URL-encode this to handle any special characters:

encoded_url <- URLencode(url, reserved = TRUE)

This formats the URL properly for sending off to the proxy service next.

Setting up the Proxy API Request

ProxiesAPI offers a handy premium proxy API with fast residential IPs to evade bot checks. We'll append our encoded target url to their endpoint:

api_url <- paste0("<http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=>", encoded_url)

Remember to add your own auth_key so the request goes through!

Configuring the HTTP Request Headers

To really mimic a true browser visit, we can modify the HTTP request headers:

headers <- c(
  "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
  "Accept-Language" = "en-US,en;q=0.5",
  "Accept-Encoding" = "gzip, deflate, br",
  "Referer" = "<https://www.google.com/>"
)

This fakes a Chrome browser on Windows, with English language preference, gzip compression enabled, and a Google referer. The more real we seem, the better chance of avoiding blocks!

Sending the Request and Parsing the HTML

Now we can fire off the GET request through ProxiesAPI and parse the HTML response:

response <- httr::GET(api_url, httr::add_headers(.headers = headers))

if (httr::status_code(response) == 200) {

  soup <- read_html(httr::content(response, "text"))

}

A status code of 200 means success. We pass the HTML content to rvest's read_html parser to create a soup object for scraping.

Extracting Listing Data with Selectors

Inspecting the page

When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

Now's when the real magic happens! We use CSS selectors to pinpoint elements and extract listing details one-by-one:

We first grab all the individual listings on the search page using this long selector:

listings <- soup %>% html_nodes(".arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x")

Breaking it down:

.arrange-unit__09f24__rqHTg targets the outer div enclosing each listing

Adding .arrange-unit-fill__09f24__CUubG further narrows to a specific class

css-1qn0b6x pins the listing container element

Together, these classes uniquely identify the components we want.

We store all those listing divs in listings to loop through.

Next, for each individual listing, we extract data elements by their own classes:

for (listing in listings) {

  business_name_elem <- listing %>% html_node(".css-19v1rkv")

  business_name <- html_text(business_name_elem)

}

.css-19v1rkv targets the business name tag

We grab that node and extract the text with html_text()

Voila, business name saved!

Some listings have extra elements like ratings and review counts. We handle those cases by first checking if the element exists before extracting:

We locate all the tags within each listing first:

Copy codespan_elements <- listing %>% html_nodes(".css-chan6m")

The .css-chan6m class grabs them.

Now we need to figure out which holds what data. Some listings have 2 spans, some only 1.

So we check:

Copy codeif (length(span_elements) >= 2) {

  # First span is reviews
  # Second is location

} else if (length(span_elements) == 1) {

  # Could be either reviews or location

}

If there are 2 spans:

1st span -> Number of reviews

2nd span -> Location

We extract and save accordingly:

Copy codenum_reviews <- html_text(span_elements[1])

location <- html_text(span_elements[2])

If only 1 span, we have to programmatically determine if it is reviews or location.

We grab the text, then use a regular expression to test if it's a number (indicating reviews):

Copy codetext <- html_text(span_elements[1])

if (grepl("^\\d+$", text)) {

  num_reviews <- text

} else {

  location <- text

}

This shows the importance of handling edge cases and missing data properly when scraping! Let me know if you have any other specific questions.

The key is finding unique CSS classes or ids on the page that point to the data you want. This takes some trial-and-error - use browser Dev Tools to inspect elements and tweak selectors until you pinpoint the right tags.

Handling Paginated Results

One challenge is Yelp displays listings across multiple pages. To scrape them all, we'd need to:

Check for a "Next" link and click until last page
Concatenate all page results into one final data set
Remove any duplicates before analyzing

Pagination can get tricky - we won't implement it here but just be aware!

Full Code

Below is the complete R script to scrape Yelp listings with proxies enabled.

You should be able to simply plug in your ProxiesAPI key and run it successfully! Try tweaking the search query or selectors to adapt it.

library(httr)
library(rvest)

# URL of the Yelp search page
url <- "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA"

# URL-encode the URL
encoded_url <- URLencode(url, reserved = TRUE)

# API URL with the encoded URL
api_url <- paste0("http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=", encoded_url)

# Define a user-agent header to simulate a browser request
headers <- c(
  "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
  "Accept-Language" = "en-US,en;q=0.5",
  "Accept-Encoding" = "gzip, deflate, br",
  "Referer" = "https://www.google.com/"  # Simulate a referrer
)

# Send an HTTP GET request to the URL with the headers
response <- httr::GET(api_url, httr::add_headers(.headers = headers))

# Check if the request was successful (status code 200)
if (httr::status_code(response) == 200) {
  # Parse the HTML content of the page using rvest
  soup <- read_html(httr::content(response, "text"))

  # Find all the listings
  listings <- soup %>% html_nodes(".arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x")
  print(length(listings))

  # Loop through each listing and extract information
  for (listing in listings) {
    # Assuming you've already extracted the information as shown in your code

    # Check if business name exists
    business_name_elem <- listing %>% html_node(".css-19v1rkv")
    business_name <- ifelse(!is.null(business_name_elem), html_text(business_name_elem), "N/A")

    # If business name is not "N/A," then print the information
    if (business_name != "N/A") {
      # Check if rating exists
      rating_elem <- listing %>% html_node(".css-gutk1c")
      rating <- ifelse(!is.null(rating_elem), html_text(rating_elem), "N/A")

      # Check if price range exists
      price_range_elem <- listing %>% html_node(".priceRange__09f24__mmOuH")
      price_range <- ifelse(!is.null(price_range_elem), html_text(price_range_elem), "N/A")

      # Find all <span> elements inside the listing
      span_elements <- listing %>% html_nodes(".css-chan6m")

      # Initialize num_reviews and location as "N/A"
      num_reviews <- "N/A"
      location <- "N/A"

      # Check if there are at least two <span> elements
      if (length(span_elements) >= 2) {
        # The first <span> element is for Number of Reviews
        num_reviews <- trimws(html_text(span_elements[1]))

        # The second <span> element is for Location
        location <- trimws(html_text(span_elements[2]))
      } else if (length(span_elements) == 1) {
        # If there's only one <span> element, check if it's for Number of Reviews or Location
        text <- trimws(html_text(span_elements[1]))
        if (grepl("^\\d+$", text)) {
          num_reviews <- text
        } else {
          location <- text
        }
      }

      # Print the extracted information
      cat("Business Name:", business_name, "\n")
      cat("Rating:", rating, "\n")
      cat("Number of Reviews:", num_reviews, "\n")
      cat("Price Range:", price_range, "\n")
      cat("Location:", location, "\n")
      cat("==================================\n")
    }
  }
} else {
  cat(paste("Failed to retrieve data. Status Code:", httr::status_code(response), "\n"))
}

There you have it - a battle-tested Yelp scraper complete with proxy workaround ready for your analytics needs! As a next step, look into storing the extracted data into databases or CSVs.

Browse by language:

Scraping Yelp Business Listings using R

Setting up the Tools

Constructing the Target URL

Encoding the URL

Setting up the Proxy API Request

Configuring the HTTP Request Headers

Sending the Request and Parsing the HTML

Extracting Listing Data with Selectors

Handling Paginated Results

Full Code

Browse by tags:

The easiest way to do Web Scraping

Scraping Yelp Business Listings using R

Setting up the Tools

Constructing the Target URL

Encoding the URL

Setting up the Proxy API Request

Configuring the HTTP Request Headers

Sending the Request and Parsing the HTML

Extracting Listing Data with Selectors

Handling Paginated Results

Full Code

The easiest way to do Web Scraping

Don't leave just yet!