Scraping Yelp Business Listings with Scala

Introduction

Are you eager to dive into web scraping and unlock the treasure trove of data within Yelp business listings? This step-by-step guide is tailor-made for beginners, offering in-depth explanations and invaluable insights. We'll use Scala to extract data from Yelp, and you can apply these techniques to various other websites as well.

This is the page we are talking about

Prerequisites

Before we embark on our data extraction journey, ensure you have the following prerequisites in place:

Scala: Make sure you have Scala installed on your system. If not, you can download it here.
ProxiesAPI: This code employs ProxiesAPI to bypass Yelp's anti-bot measures. You'll need a premium ProxiesAPI account. Sign up here and obtain your API key.

Plan of Action

Here's our comprehensive plan to extract data from Yelp business listings:

URL Encoding: We'll start by encoding the Yelp search URL to handle special characters properly.
Simulate Browser Request: To avoid detection as a bot by Yelp, we'll send an HTTP GET request with simulated browser headers.
Save HTML Content: We'll save the HTML content to a file for further analysis.
HTML Parsing: We'll use Jsoup, a powerful HTML parsing library, to parse the HTML content.
Data Extraction: The heart of the process – we'll extract relevant information from the HTML.

Now, let's delve into each step with detailed explanations.

Step 1: URL Encoding

The very first step is to ensure our Yelp search URL is properly encoded. This is crucial to handle special characters and format the URL correctly for use in the API request.

val url = "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>"
val encodedUrl = java.net.URLEncoder.encode(url, "UTF-8")

Step 2: Simulating a Browser Request

Yelp employs anti-bot mechanisms, so we'll simulate a browser request by sending an HTTP GET request with headers. This is essential to avoid detection.

val headers = Map(
  "User-Agent" -> "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
  "Accept-Language" -> "en-US,en;q=0.5",
  "Accept-Encoding" -> "gzip, deflate, br",
  "Referer" -> "<https://www.google.com/>"
)

val response = Http(apiUrl).headers(headers).option(HttpOptions.allowUnsafeSSL).asString

Step 3: Saving HTML Content

Before we proceed, we save the HTML content to a file. This step ensures we have a local copy for analysis and future reference.

val htmlContent = response.body
val outputFile = new File("yelp_html.html")
val writer = new PrintWriter(outputFile)
writer.write(htmlContent)
writer.close()

Step 4: HTML Parsing

Now, it's time to bring in Jsoup, a popular HTML parsing library. We'll parse the HTML content to make it accessible and manipulable.

val document: Document = Jsoup.parse(htmlContent)

Step 5: Data Extraction

The highlight of our journey – data extraction. This is where we unearth valuable information from the HTML content. Let's break down the process:

Inspecting the page

When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

val listings: Elements = document.select("div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x")

The listings variable now holds a collection of HTML elements representing individual business listings on Yelp's page.

For each listing, we extract details like business name, rating, number of reviews, price range, and location. Here's how:

Business Name: We find the business name element and extract its text.

val businessNameElem: Element = listing.selectFirst("a.css-19v1rkv")
val businessName: String = if (businessNameElem != null) businessNameElem.text() else "N/A"

Rating: Similarly, we locate and extract the rating.

val ratingElem: Element = listing.selectFirst("span.css-gutk1c")
val rating: String = if (ratingElem != null) ratingElem.text() else "N/A"

Price Range: Extracting the price range, if available.

val priceRangeElem: Element = listing.selectFirst("span.priceRange__09f24__mmOuH")
val priceRange: String = if (priceRangeElem != null) priceRangeElem.text() else "N/A"

Number of Reviews and Location: Here, we extract either the number of reviews or the location, depending on their availability.

val spanElements: Elements = listing.select("span.css-chan6m")

var numReviews: String = "N/A"
var location: String = "N/A"

// Check for at least two <span> elements
if (spanElements.size() >= 2) {
    numReviews = spanElements.get(0).text().trim
    location = spanElements.get(1).text().trim
} else if (spanElements.size() == 1) {
    // Handle cases with only one <span> element
    val text: String = spanElements.get(0).text().trim
    if (text.forall(_.isDigit)) {
        numReviews = text
    } else {
        location = text
    }
}

Practical Considerations and Challenges

Premium Proxies: Premium ProxiesAPI is crucial to bypass Yelp's anti-bot mechanisms effectively. Free proxies might not work.
Selector Clarification: Selectors can be puzzling for newcomers. In this code, we've retained the original selectors without any changes. They serve as patterns to locate specific HTML elements.

Main Takeaways

Encoding URLs and simulating browser requests are fundamental for web scraping.

Jsoup simplifies HTML parsing in Scala, making data extraction a breeze.

Selectors are like search patterns that help locate and extract data from HTML elements.

Premium proxies are your best bet when scraping Yelp or similar websites.

Next Steps

Having successfully extracted data from Yelp listings, you can now broaden your web scraping knowledge. Explore advanced techniques, tackle pagination, or automate data storage for more complex projects.

Always respect websites' terms of service and adhere to ethical and legal guidelines when scraping.

Here's the complete code for your reference:

import java.io.{File, PrintWriter}
import org.jsoup.Jsoup
import org.jsoup.nodes.{Document, Element}
import org.jsoup.select.Elements
import scalaj.http.{Http, HttpOptions}

object YelpScraper {
  def main(args: Array[String]): Unit = {
    // URL of the Yelp search page
    val url = "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA"

    // URL-encode the URL
    val encodedUrl = java.net.URLEncoder.encode(url, "UTF-8")

    // API URL with the encoded Yelp URL
    val apiUrl = s"http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=$encodedUrl"

    // Define headers to simulate a browser request
    val headers = Map(
      "User-Agent" -> "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
      "Accept-Language" -> "en-US,en;q=0.5",
      "Accept-Encoding" -> "gzip, deflate, br",
      "Referer" -> "https://www.google.com/"
    )

    // Send an HTTP GET request to the URL with the headers
    val response = Http(apiUrl).headers(headers).option(HttpOptions.allowUnsafeSSL).asString

    // Check if the request was successful (status code 200)
    if (response.isSuccess) {
      // Save the HTML content to a file
      val htmlContent = response.body
      val outputFile = new File("yelp_html.html")
      val writer = new PrintWriter(outputFile)
      writer.write(htmlContent)
      writer.close()

      // Parse the HTML content using Jsoup
      val document: Document = Jsoup.parse(htmlContent)

      // Find all the listings
      val listings: Elements = document.select("div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x")

      println(s"Number of Listings: ${listings.size()}")

      // Loop through each listing and extract information
      listings.forEach { listing: Element =>
        // Assuming you've already extracted the information as shown in your code

        // Check if business name exists
        val businessNameElem: Element = listing.selectFirst("a.css-19v1rkv")
        val businessName: String = if (businessNameElem != null) businessNameElem.text() else "N/A"

        // If business name is not "N/A," then print the information
        if (businessName != "N/A") {
          // Check if rating exists
          val ratingElem: Element = listing.selectFirst("span.css-gutk1c")
          val rating: String = if (ratingElem != null) ratingElem.text() else "N/A"

          // Check if price range exists
          val priceRangeElem: Element = listing.selectFirst("span.priceRange__09f24__mmOuH")
          val priceRange: String = if (priceRangeElem != null) priceRangeElem.text() else "N/A"

          // Find all <span> elements inside the listing
          val spanElements: Elements = listing.select("span.css-chan6m")

          // Initialize numReviews and location as "N/A"
          var numReviews: String = "N/A"
          var location: String = "N/A"

          // Check if there are at least two <span> elements
          if (spanElements.size() >= 2) {
            // The first <span> element is for Number of Reviews
            numReviews = spanElements.get(0).text().trim

            // The second <span> element is for Location
            location = spanElements.get(1).text().trim
          } else if (spanElements.size() == 1) {
            // If there's only one <span> element, check if it's for Number of Reviews or Location
            val text: String = spanElements.get(0).text().trim
            if (text.forall(_.isDigit)) {
              numReviews = text
            } else {
              location = text
            }
          }

          // Print the extracted information
          println(s"Business Name: $businessName")
          println(s"Rating: $rating")
          println(s"Number of Reviews: $numReviews")
          println(s"Price Range: $priceRange")
          println(s"Location: $location")
          println("=" * 30)
        }
      }
    } else {
      println(s"Failed to retrieve data. Status Code: ${response.code}")
    }
  }
}

Scraping Yelp Business Listings with Scala

Introduction

Prerequisites

Plan of Action

Step 1: URL Encoding

Step 2: Simulating a Browser Request

Step 3: Saving HTML Content

Step 4: HTML Parsing

Step 5: Data Extraction

Practical Considerations and Challenges

Main Takeaways

Next Steps

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Yelp Business Listings with Scala

Introduction

Prerequisites

Plan of Action

Step 1: URL Encoding

Step 2: Simulating a Browser Request

Step 3: Saving HTML Content

Step 4: HTML Parsing

Step 5: Data Extraction

Practical Considerations and Challenges

Main Takeaways

Next Steps

The easiest way to do Web Scraping

Don't leave just yet!