Wikipedia is a gold mine containing extensive information on just about every topic imaginable. As one of the most popular websites globally, it contains structured data that can be incredibly useful for analysis if extracted properly. This is where web scraping comes in.
In this article, we will walk through a complete example of how to scrape data from Wikipedia pages using R. We will extract information on all the Presidents of the United States and print it out.
This is the table we are talking about
Here's a peek at the key things you'll learn:
And much more! By the end, you'll have hands-on experience with the end-to-end process.
The best way to learn web scraping is by getting our hands dirty with some code. So without further ado, let's get scraping!
Step 1: Import Libraries
We will leverage a few handy R libraries that make scraping very easy:
library(httr) # for sending HTTP requests to get webpages
library(rvest) # for parsing and extracting HTML content
library(xml2) # for wrangling XML/HTML
Let's go through the purpose of each:
httr: Provides useful functions for creating and sending HTTP requests to fetch resources like HTML pages. We don't want to deal with HTTP at a low level, so this abstracts it away.
rvest: Built on top of httr, this provides very useful tools for parsing, selecting, and extracting content from HTML and XML documents fetched. Our best friend for scraping!
xml2: Useful for wrangling and processing XML/HTML documents once extracted.
So in a nutshell:
This combination is very powerful!
Step 2: Define the URL
We need to pass a URL into httr to actually fetch the webpage. Let's define the URL of the Wikipedia page we want to scrape:
url <- "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>"
Specifically, we will be scraping the List of Presidents page which contains tables with plenty of structured data on all US presidents.
Step 3: Create a User-Agent Header
Websites can identify who is sending requests by checking the User-Agent - a text header that contains info about the software/client making the request.
To mimic a real browser:
headers <- c("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
This makes Wikipedia think a Chrome browser is accessing it. Doing this helps avoid blocks since Wikipedia doesn't like scraping bots!
Step 4: Send HTTP Request
Now we can fetch the page by sending a GET request:
response <- GET(url, add_headers(headers))
This will return an HTTP response object containing the status code, headers, and most importantly - the HTML content!
Step 5: Check if Request Succeeded
It's good practice to ensure the request was successful before trying to extract data.
if (http_status(response)$category == "success") {
# Success! Extract data
} else {
print("Failed to retrieve page. Status code:", http_status(response)$code)
}
We simply check if the status category was "success" (code 200). If not, we print the failure status code.
Step 6: Parse the HTML
Since the request succeeded, we can parse the HTML using
webpage <- read_html(response)
This parses the raw HTML into an xml document that rvest can now query!
Step 7: Extract the Table
On the Wikipedia page, all president data sits within a table marked by
Inspecting the page
When we inspect the page we can see that the table has a class called wikitable and sortable
We can use an XPath query to extract just this table node:
table <- html_node(webpage, xpath="//table[contains(@class, 'wikitable sortable')]")
This says - find the table tag with a class attribute containing "wikitable sortable".
Step 8: Extract Table Rows
We can now grab all This selects all table row elements we want to extract. Note:
The first row contains the headers, so we'll want to skip that when extracting the data. To automatically parse the table into a data frame: This uses the Finally, we can iterate through the rows and print president data: And we have successfully scraped Wikipedia in R! Full code: Let's recap what we learned: In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser! If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail. Overcoming IP Blocks Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works. Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. We have a running offer of 1000 API calls completely free. Register and get your free API Key here.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key: nodes within this table as R data frames with: rows <- html_nodes(table, "tr")
Step 9: Convert Table to Data Frame
data <- html_table(html_node(table, "table"), fill = TRUE)[2:nrow(rows),]
Step 10: Print Extracted Data
for (i in 1:nrow(data)) {
print("Number:", data[i, 1])
print("Name:", data[i, 3])
print("Term:", data[i, 4])
# ...
}
library(httr)
library(rvest)
library(xml2)
# Define the URL of the Wikipedia page
url <- "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"
# Define a user-agent header to simulate a browser request
headers <- c("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
# Send an HTTP GET request to the URL with the headers
response <- GET(url, add_headers(headers))
# Check if the request was successful (status code 200)
if (http_status(response)$category == "success") {
# Parse the HTML content of the page
webpage <- read_html(response)
# Find the table with the specified class name
table <- html_node(webpage, xpath = "//table[contains(@class, 'wikitable sortable')]")
# Extract rows from the table, skipping the header row
rows <- html_nodes(table, "tr")
data <- html_table(html_node(table, "table"), fill = TRUE)[2:nrow(rows), ]
# Print the scraped data for all presidents
for (i in 1:nrow(data)) {
cat("President Data:\n")
cat("Number:", data[i, 1], "\n")
cat("Name:", data[i, 3], "\n")
cat("Term:", data[i, 4], "\n")
cat("Party:", data[i, 6], "\n")
cat("Election:", data[i, 7], "\n")
cat("Vice President:", data[i, 8], "\n\n")
}
} else {
cat("Failed to retrieve the web page. Status code:", http_status(response)$code, "\n")
}
Key Takeaways
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
Browse by tags:
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!