Scraping Yelp Business Listings in Go

Web scraping refers to the automated extraction of data from websites. In this guide, we'll walk through an example of scraping business listing data from Yelp to perform further analysis.

This is the page we are talking about

Use Case

Why would someone want to scrape Yelp? Here are some examples of what you can do with the scraped data:

Analyze ratings, reviews, and price ranges for competitive analysis

Build a dataset of business info like names, locations, categories

Combine with other data sources for deeper insights into consumer behavior

The code we will go through scrapes key details like business name, rating, price range, number of reviews, and location for each listing on a Yelp search. Let's dive in!

The Code

We will break down this Go program section-by-section to understand how it works under the hood:

// Imports packages needed
import (
  "fmt"
  "io/ioutil"
  "net/http"
  "net/url"
  "os"
  "strings"
  "github.com/PuerkitoBio/goquery"
)

First we import all the necessary packages that will be used:

net/http and io/ioutil: Making HTTP requests and handling responses

net/url: Encoding the Yelp URL

strings: Manipulating strings

github.com/PuerkitoBio/goquery: Querying and parsing HTML

Constructing the URLs

// Yelp URL to scrape
yelpURL := "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>"

// URL-encode the string
encodedURL := url.QueryEscape(yelpURL)

// ProxiesAPI URL
apiURL := "<http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=>" + encodedURL

We begin by defining the Yelp URL that we want to scrape data from with a search query.

The URL is then encoded properly to be handled by the proxy service API. Proxies are needed to bypass Yelp's bot detection and scraping restrictions.

NOTE: You would need to sign up for a proxy service like ProxiesAPI to obtain an auth key. Proxy rotation is necessary for stable scraping of sites like Yelp.

Setting Request Headers

// User-Agent header
headers := map[string]string{
  "User-Agent": "Mozilla/5.0...",
  "Accept-Language": "en-US,en;q=0.5",
  "Accept-Encoding": "gzip, deflate, br",
  "Referer": "<https://www.google.com/>",
}

// Create HTTP client
client := &http.Client{}

// Build GET request
req, err := http.NewRequest("GET", apiURL, nil)

// Add headers
for key, value := range headers {
  req.Header.Set(key, value)
}

We set headers like User-Agent to mimic a normal browser visit, reducing chances of blocking.

The HTTP client and GET request are created to go through the ProxiesAPI service.

Making the Request

// Send request
resp, err := client.Do(req)

// Handle errors
if err != nil {
  panic(err)
}

// Close response body
defer resp.Body.Close()

We send the request and make sure to close the response body when done.

Processing the Response

Inspecting the page

When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

// Read response body
body, err := ioutil.ReadAll(resp.Body)

// Write HTML to file
err = ioutil.WriteFile("yelp_html.html", body, 0644)

// 200 OK status?
if resp.StatusCode == 200 {

  // Parse HTML
  doc, err := goquery.NewDocumentFromReader(strings.NewReader(string(body)))

  // Find listings
  listings := doc.Find("div.css-1qn0b6x")

The HTML response content is read and saved to a file for parsing.

We check that the status code returned is 200 OK before proceeding to extract data.

This is where the key action happens - selecting elements from the HTML document using goquery!

// Loop through listings
listings.Each(func(index int, item *goquery.Selection) {

  // Extract name
  nameSel := item.Find("a.css-19v1rkv")
  businessName := nameSel.Text()

  // Extract rating
  ratingSel := item.Find("span.css-gutk1c")
  rating := ratingSel.Text()

  // Extract price range
  priceRangeSel := item.Find("priceRange__09f24__mmOuH")
  priceRange := priceRangeSel.Text()

	// Extracting number of reviews and location
				numReviews := "N/A"
				location := "N/A"
				spanElements := item.Find("span.css-chan6m")
				if spanElements.Length() >= 2 {
					numReviews = spanElements.Eq(0).Text()
					location = spanElements.Eq(1).Text()
				} else if spanElements.Length() == 1 {
					text := spanElements.Eq(0).Text()
					if _, err := strconv.Atoi(text); err == nil {
						numReviews = text
					} else {
						location = text
					}
				}

  // Print data
  fmt.Println("Name:", businessName)
  fmt.Println("Rating:", rating)
  fmt.Println("Number of Reviews:", numReviews)
  fmt.Println("Price Range:", priceRange)
  fmt.Println("Location:", location)
})

For each business listing, we use CSS selectors to find and extract specific pieces of data:

a.css-19v1rkv to select the business name anchor tag

span.css-gutk1c to select the rating

priceRange__09f24__mmOuH for price range

And so on...

The key things that can trip beginners up are:

Understanding that CSS selectors target HTML elements to extract data from

Figuring out the right selector combinations to zero in on the data needed

Handling variability in the DOM structure (hence try/catch checks done)

With practice, you build knowledge and intuition for writing robust scrapers!

Finally, we print out the extracted data from each listing.

And that's it! By going through the documentation and this code walkthrough, you should have a solid grasp of the fundamentals of web scraping Yelp listings using Go.

Some challenges you may face:

Dealing with captchas and blocks from sending too many requests

Changes in selectors breaking the scraper

Expanding to scrape additional data fields as needed

But the core concepts remain the same. Feel free to build on this starter scraper for your own projects!

Here is the full code:

package main

import (
	"fmt"
	"io/ioutil"
	"net/http"
	"net/url"
	"os"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	// Yelp URL
	yelpURL := "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA"

	// URL-encode the Yelp URL
	encodedURL := url.QueryEscape(yelpURL)

	// API URL
	apiURL := "http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=" + encodedURL

	// User-Agent header
	headers := map[string]string{
		"User-Agent":      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
		"Accept-Language": "en-US,en;q=0.5",
		"Accept-Encoding": "gzip, deflate, br",
		"Referer":         "https://www.google.com/",
	}

	// Create HTTP client and request
	client := &http.Client{}
	req, err := http.NewRequest("GET", apiURL, nil)
	if err != nil {
		panic(err)
	}

	// Add headers to the request
	for key, value := range headers {
		req.Header.Set(key, value)
	}

	// Perform the HTTP GET request
	resp, err := client.Do(req)
	if err != nil {
		panic(err)
	}
	defer resp.Body.Close()

	// Read the response body
	body, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		panic(err)
	}

	// Write response to file
	err = ioutil.WriteFile("yelp_html.html", body, 0644)
	if err != nil {
		panic(err)
	}

	// Check if the request was successful
	if resp.StatusCode == 200 {
		// Load the HTML document
		doc, err := goquery.NewDocumentFromReader(strings.NewReader(string(body)))
		if err != nil {
			panic(err)
		}

		// Find all listings
		listings := doc.Find("div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x")
		fmt.Println("Listings found:", listings.Length())

		// Loop through each listing
		listings.Each(func(index int, item *goquery.Selection) {
			// Extracting business name
			businessName := "N/A"
			if nameSel := item.Find("a.css-19v1rkv"); nameSel.Length() > 0 {
				businessName = nameSel.Text()
			}

			// Extracting rating
			rating := "N/A"
			if ratingSel := item.Find("span.css-gutk1c"); ratingSel.Length() > 0 {
				rating = ratingSel.Text()
			}

			// Extracting price range
			priceRange := "N/A"
			if priceRangeSel := item.Find("span.priceRange__09f24__mmOuH"); priceRangeSel.Length() > 0 {
				priceRange = priceRangeSel.Text()
			}

			// Extracting number of reviews and location
			numReviews := "N/A"
			location := "N/A"
			spanElements := item.Find("span.css-chan6m")
			if spanElements.Length() >= 2 {
				numReviews = spanElements.Eq(0).Text()
				location = spanElements.Eq(1).Text()
			} else if spanElements.Length() == 1 {
				text := spanElements.Eq(0).Text()
				if _, err := strconv.Atoi(text); err == nil {
					numReviews = text
				} else {
					location = text
				}
			}

			// Print the extracted information
			fmt.Println("Business Name:", businessName)
			fmt.Println("Rating:", rating)
			fmt.Println("Number of Reviews:", numReviews)
			fmt.Println("Price Range:", priceRange)
			fmt.Println("Location:", location)
			fmt.Println(strings.Repeat("=", 30))
		})
	} else {
		fmt.Printf("Failed to retrieve data. Status Code: %d\n", resp.StatusCode)
	}
}

Scraping Yelp Business Listings in Go

Use Case

The Code

Constructing the URLs

Setting Request Headers

Making the Request

Processing the Response

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping Yelp Business Listings in Go

Use Case

The Code

Constructing the URLs

Setting Request Headers

Making the Request

Processing the Response

The easiest way to do Web Scraping

Don't leave just yet!