Web scraping is the process of automatically collecting structured data from websites. It can be useful for getting data off the web and into a format you can work with in applications. In this tutorial, we'll walk through scraping a Wikipedia table with Golang.
Specifically, we'll scrape the table listing all the presidents of the United States from this page:
https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States
This is the table we are talking about
Here's what our final program will do:
This provides a blueprint for scraping and structuring data from any web page with tables.
Let's get started!
First we import the packages we'll need:
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
Next we'll define the URL of the Wikipedia page we want to scrape:
url := "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>"
Now we need to make the HTTP request. Web servers can identify automated scrapers by the lack of headers. So we'll simulate a real browser's headers:
headers := map[string]string{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
}
This will fool the server into thinking a real browser is making the request.
We'll create a new HTTP client and build a GET request with our headers:
client := &http.Client{}
req, err := http.NewRequest("GET", url, nil)
if err != nil {
log.Fatal(err)
}
for key, value := range headers {
req.Header.Set(key, value)
}
Now we can send the request and get the response:
response, err := client.Do(req)
if err != nil {
log.Fatal(err)
}
defer response.Body.Close()
We close the response body when done to prevent resource leaks.
Let's check if the request succeeded with a 200 status code:
if response.StatusCode == 200 {
// Parsing logic here
} else {
fmt.Println("Failed to retrieve page")
}
If successful, we can parse the HTML using goquery, which allows jQuery-style element selection.
First we load the response HTML into a goquery document:
doc, err := goquery.NewDocumentFromReader(response.Body)
if err != nil {
log.Fatal(err)
}
We can now search for elements by class, id, tag name etc. Let's find the presidents table:
Inspecting the page
When we inspect the page we can see that the table has a class called wikitable and sortable
table := doc.Find("table.wikitable.sortable")
We initialize a slice to store our scraped data:
data := [][]string{}
Then we iterate through the table rows, skipping the header:
table.Find("tr").Each(func(rowIdx int, row *goquery.Selection) {
if rowIdx == 0 {
return
}
// extract row data here
})
Inside this, we can extract and store each cell's text:
rowData := []string{}
row.Find("td").Each(func(colIdx int, col *goquery.Selection) {
rowData = append(rowData, col.Text())
})
data = append(data, rowData)
Finally, we can print out the scraped president data:
for _, presidentData := range data {
fmt.Println("Name: ", presidentData[2])
fmt.Println("Term: ", presidentData[3])
// etc
}
And we have a working Wikipedia scraper in Go!
Some ideas for next steps:
package main
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func main() {
// Define the URL of the Wikipedia page
url := "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"
// Define a user-agent header to simulate a browser request
headers := map[string]string{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
}
// Create an HTTP client with custom headers
client := &http.Client{}
req, err := http.NewRequest("GET", url, nil)
if err != nil {
log.Fatal(err)
}
for key, value := range headers {
req.Header.Set(key, value)
}
// Send an HTTP GET request to the URL with the headers
response, err := client.Do(req)
if err != nil {
log.Fatal(err)
}
defer response.Body.Close()
// Check if the request was successful (status code 200)
if response.StatusCode == 200 {
// Parse the HTML content of the page using goquery
doc, err := goquery.NewDocumentFromReader(response.Body)
if err != nil {
log.Fatal(err)
}
// Find the table with the specified class name
table := doc.Find("table.wikitable.sortable")
// Initialize empty slice to store the table data
data := [][]string{}
// Iterate through the rows of the table
table.Find("tr").Each(func(rowIdx int, row *goquery.Selection) {
// Skip the header row
if rowIdx == 0 {
return
}
// Extract data from each column and append it to the data slice
rowData := []string{}
row.Find("th,td").Each(func(colIdx int, col *goquery.Selection) {
rowData = append(rowData, col.Text())
})
data = append(data, rowData)
})
// Print the scraped data for all presidents
for _, presidentData := range data {
fmt.Println("President Data:")
fmt.Println("Number:", presidentData[0])
fmt.Println("Name:", presidentData[2])
fmt.Println("Term:", presidentData[3])
fmt.Println("Party:", presidentData[5])
fmt.Println("Election:", presidentData[6])
fmt.Println("Vice President:", presidentData[7])
fmt.Println()
}
} else {
fmt.Println("Failed to retrieve the web page. Status code:", response.StatusCode)
}
}
In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!
If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.