Web scraping is the process of automatically extracting data from websites using code. It allows you to harvest and analyze content from the web on a large scale. Go is a great language for writing web scrapers thanks to its fast performance and concise syntax.
In this tutorial, we'll walk through a simple Go program to scrape news article headlines and links from the New York Times homepage. Along the way, we'll learn web scraping concepts that apply to many projects regardless of language or site.
Overview
Here's what our scraper will do:
- Send a GET request to retrieve the NYT homepage
- Parse the HTML content
- Use Go's goquery library to find all article containers
- Extract the headline and link from each
- Print the scraped data
Now let's break it down section-by-section!
Imports
We import three packages:
import (
"fmt"
"net/http"
"github.com/PuerkitoBio/goquery"
)
Struct to Store Scraped Data
We define a struct called
type Article struct {
Title string
Link string
}
Main Function
The entry point of execution is the
func main() {
}
All web scraping logic will go inside here.
Constructing the HTTP Request
To make a request, we need a URL and user agent header:
url := "<https://www.nytimes.com/>"
userAgent := "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
We use a real browser's user agent instead of
Next, we create a client, build the GET request, and execute it:
client := &http.Client{}
req, err := http.NewRequest("GET", url , nil)
req.Header.Set("User-Agent", userAgent)
resp, err := client.Do(req)
We handle any errors and close the response body when done.
Parsing the HTML
The website content lives in
doc, err := goquery.NewDocumentFromReader(resp.Body)
Inspecting the page
We now inspect element in chrome to see how the code is structured…
You can see that the articles are contained inside section tags and with the class story-wrapper
Extracting Data
goquery allows jQuery-style element selection and traversal. We find all article containers, loop through them, and pull what we need:
doc.Find("section.story-wrapper").Each(func(i int, s *goquery.Selection) {
title := s.Find("h3.indicate-hover").Text()
link := s.Find("a.css-9mylee").Attr("href")
article := Article{Title: title, Link: link}
articles = append(articles, article)
})
Here we use the class selectors of key elements to target titles and links.
Printing Output
Finally, we can print or store the scraped data:
for _, a := range articles {
fmt.Println(a.Title)
fmt.Println(a.Link)
}
And we have a working scraper!
Challenges You May Encounter
There are a few common issues to look out for with web scrapers:
Practical solutions include:
With diligence, these can be overcome.
Next Steps
Some ideas for building on this:
Web scraping is a learn-by-doing skill. Experiment with different sites and data points to grow!
Full Code
For reference, here is the complete code from the beginning:
// full code from above
package main
import (
"fmt"
"net/http"
"github.com/PuerkitoBio/goquery"
)
// Article struct to store title and link
type Article struct {
Title string
Link string
}
func main() {
// URL and user agent
url := "https://www.nytimes.com/"
userAgent := "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
// Create HTTP client
client := &http.Client{}
// Build GET request
req, err := http.NewRequest("GET", url, nil)
if err != nil {
panic(err)
}
// Set user agent
req.Header.Set("User-Agent", userAgent)
// Send request
resp, err := client.Do(req)
if err != nil {
panic(err)
}
defer resp.Body.Close()
// Check status code
if resp.StatusCode != 200 {
fmt.Println("Failed to retrieve web page. Status code:", resp.StatusCode)
return
}
// Parse response HTML
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
panic(err)
}
// Find articles
var articles []Article
doc.Find("section.story-wrapper").Each(func(i int, s *goquery.Selection) {
// Extract title and link
title := s.Find("h3.indicate-hover").Text()
link, _ := s.Find("a.css-9mylee").Attr("href")
// Create article
article := Article{Title: title, Link: link}
// Append article to results
articles = append(articles, article)
})
// Print articles
for _, a := range articles {
fmt.Println(a.Title)
fmt.Println(a.Link)
}
}
In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!
If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.