Web scraping refers to the automated extraction of data from websites. In this guide, we'll walk through an example of scraping business listing data from Yelp to perform further analysis.
This is the page we are talking about
Use Case
Why would someone want to scrape Yelp? Here are some examples of what you can do with the scraped data:
The code we will go through scrapes key details like business name, rating, price range, number of reviews, and location for each listing on a Yelp search. Let's dive in!
The Code
We will break down this Go program section-by-section to understand how it works under the hood:
// Imports packages needed
import (
"fmt"
"io/ioutil"
"net/http"
"net/url"
"os"
"strings"
"github.com/PuerkitoBio/goquery"
)
First we import all the necessary packages that will be used:
Constructing the URLs
// Yelp URL to scrape
yelpURL := "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>"
// URL-encode the string
encodedURL := url.QueryEscape(yelpURL)
// ProxiesAPI URL
apiURL := "<http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=>" + encodedURL
We begin by defining the Yelp URL that we want to scrape data from with a search query.
The URL is then encoded properly to be handled by the proxy service API. Proxies are needed to bypass Yelp's bot detection and scraping restrictions.
NOTE: You would need to sign up for a proxy service like ProxiesAPI to obtain an auth key. Proxy rotation is necessary for stable scraping of sites like Yelp.
Setting Request Headers
// User-Agent header
headers := map[string]string{
"User-Agent": "Mozilla/5.0...",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "<https://www.google.com/>",
}
// Create HTTP client
client := &http.Client{}
// Build GET request
req, err := http.NewRequest("GET", apiURL, nil)
// Add headers
for key, value := range headers {
req.Header.Set(key, value)
}
We set headers like
The HTTP client and GET request are created to go through the ProxiesAPI service.
Making the Request
// Send request
resp, err := client.Do(req)
// Handle errors
if err != nil {
panic(err)
}
// Close response body
defer resp.Body.Close()
We send the request and make sure to close the response body when done.
Processing the Response
Inspecting the page
When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x
// Read response body
body, err := ioutil.ReadAll(resp.Body)
// Write HTML to file
err = ioutil.WriteFile("yelp_html.html", body, 0644)
// 200 OK status?
if resp.StatusCode == 200 {
// Parse HTML
doc, err := goquery.NewDocumentFromReader(strings.NewReader(string(body)))
// Find listings
listings := doc.Find("div.css-1qn0b6x")
The HTML response content is read and saved to a file for parsing.
We check that the status code returned is 200 OK before proceeding to extract data.
This is where the key action happens - selecting elements from the HTML document using goquery!
// Loop through listings
listings.Each(func(index int, item *goquery.Selection) {
// Extract name
nameSel := item.Find("a.css-19v1rkv")
businessName := nameSel.Text()
// Extract rating
ratingSel := item.Find("span.css-gutk1c")
rating := ratingSel.Text()
// Extract price range
priceRangeSel := item.Find("priceRange__09f24__mmOuH")
priceRange := priceRangeSel.Text()
// Extracting number of reviews and location
numReviews := "N/A"
location := "N/A"
spanElements := item.Find("span.css-chan6m")
if spanElements.Length() >= 2 {
numReviews = spanElements.Eq(0).Text()
location = spanElements.Eq(1).Text()
} else if spanElements.Length() == 1 {
text := spanElements.Eq(0).Text()
if _, err := strconv.Atoi(text); err == nil {
numReviews = text
} else {
location = text
}
}
// Print data
fmt.Println("Name:", businessName)
fmt.Println("Rating:", rating)
fmt.Println("Number of Reviews:", numReviews)
fmt.Println("Price Range:", priceRange)
fmt.Println("Location:", location)
})
For each business listing, we use CSS selectors to find and extract specific pieces of data:
The key things that can trip beginners up are:
With practice, you build knowledge and intuition for writing robust scrapers!
Finally, we print out the extracted data from each listing.
And that's it! By going through the documentation and this code walkthrough, you should have a solid grasp of the fundamentals of web scraping Yelp listings using Go.
Some challenges you may face:
But the core concepts remain the same. Feel free to build on this starter scraper for your own projects!
Here is the full code:
package main
import (
"fmt"
"io/ioutil"
"net/http"
"net/url"
"os"
"strings"
"github.com/PuerkitoBio/goquery"
)
func main() {
// Yelp URL
yelpURL := "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA"
// URL-encode the Yelp URL
encodedURL := url.QueryEscape(yelpURL)
// API URL
apiURL := "http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=" + encodedURL
// User-Agent header
headers := map[string]string{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
}
// Create HTTP client and request
client := &http.Client{}
req, err := http.NewRequest("GET", apiURL, nil)
if err != nil {
panic(err)
}
// Add headers to the request
for key, value := range headers {
req.Header.Set(key, value)
}
// Perform the HTTP GET request
resp, err := client.Do(req)
if err != nil {
panic(err)
}
defer resp.Body.Close()
// Read the response body
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
panic(err)
}
// Write response to file
err = ioutil.WriteFile("yelp_html.html", body, 0644)
if err != nil {
panic(err)
}
// Check if the request was successful
if resp.StatusCode == 200 {
// Load the HTML document
doc, err := goquery.NewDocumentFromReader(strings.NewReader(string(body)))
if err != nil {
panic(err)
}
// Find all listings
listings := doc.Find("div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x")
fmt.Println("Listings found:", listings.Length())
// Loop through each listing
listings.Each(func(index int, item *goquery.Selection) {
// Extracting business name
businessName := "N/A"
if nameSel := item.Find("a.css-19v1rkv"); nameSel.Length() > 0 {
businessName = nameSel.Text()
}
// Extracting rating
rating := "N/A"
if ratingSel := item.Find("span.css-gutk1c"); ratingSel.Length() > 0 {
rating = ratingSel.Text()
}
// Extracting price range
priceRange := "N/A"
if priceRangeSel := item.Find("span.priceRange__09f24__mmOuH"); priceRangeSel.Length() > 0 {
priceRange = priceRangeSel.Text()
}
// Extracting number of reviews and location
numReviews := "N/A"
location := "N/A"
spanElements := item.Find("span.css-chan6m")
if spanElements.Length() >= 2 {
numReviews = spanElements.Eq(0).Text()
location = spanElements.Eq(1).Text()
} else if spanElements.Length() == 1 {
text := spanElements.Eq(0).Text()
if _, err := strconv.Atoi(text); err == nil {
numReviews = text
} else {
location = text
}
}
// Print the extracted information
fmt.Println("Business Name:", businessName)
fmt.Println("Rating:", rating)
fmt.Println("Number of Reviews:", numReviews)
fmt.Println("Price Range:", priceRange)
fmt.Println("Location:", location)
fmt.Println(strings.Repeat("=", 30))
})
} else {
fmt.Printf("Failed to retrieve data. Status Code: %d\n", resp.StatusCode)
}
}