Web scraping is the process of extracting data from websites. It can be useful for getting data that is not available through an API or that would take a long time to collect manually.
In this article, we'll walk through a full code example for scraping Wikipedia to get data on all the US presidents. Our use case will be to print out the number, name, term dates, party, election year, and vice president for each president.
This is the table we are talking about
Importing Jsoup
First we import the Jsoup Java library, which we'll use to connect to and parse content from the Wikipedia page:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
Jsoup handles a lot of the nitty gritty HTTP requests and HTML parsing for us. We just need to tell it which page to scrape.
Defining the URL
We define the Wikipedia URL we want to scrape. Specifically this is the page listing all US presidents:
String url = "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>";
Setting a User Agent
Next we set a user agent header to simulate a real browser request. Many websites block scrapers so this makes our request look more legitimate:
String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";
Getting the HTML Document
Now we use Jsoup to connect to the URL and get the HTML document. We pass the user agent we defined:
Document document = Jsoup.connect(url).userAgent(userAgent).get();
The document contains all the HTML from the Wikipedia page.
Extracting the Presidents Table
Next we want to extract the presidents table.
Inspecting the page
When we inspect the page we can see that the table has a class called wikitable and sortable
We use a CSS selector to find the table element with class "wikitable sortable":
Element table = document.select("table.wikitable.sortable").first();
We initialize an empty StringBuilder to hold the scraped data:
StringBuilder output = new StringBuilder();
Looping Through Table Rows
Now we loop through the rows of the table. We skip the first row since that is the header. For each row, we grab the data cells:
for (Element row : table.select("tr").subList(1, table.select("tr").size())) {
Elements columns = row.select("td, th");
// extract data from cells
}
Inside the loop, we extract the text from the cells we care about - the number, name, term, party, etc. We append labels and values to the output.
String number = columns.get(0).text();
output.append("Number: " + number + "\\n");
Printing the Scraped Data
After the loop, we print out the full scraped president data!
System.out.println(output.toString());
And that's it! We've now written a full Wikipedia scraper to extract president data.
Key Takeaways
You could extend this scraper to get more data, export the data to JSON/CSV, store it in a database, and more!
Full code below:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class WikipediaScraper {
public static void main(String[] args) {
// Define the URL of the Wikipedia page
String url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States";
try {
// Define a user-agent header to simulate a browser request
String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36";
// Send an HTTP GET request to the URL with the headers
Document document = Jsoup.connect(url).userAgent(userAgent).get();
// Find the table with the specified class name
Element table = document.select("table.wikitable.sortable").first();
// Initialize empty lists to store the table data
StringBuilder output = new StringBuilder();
// Iterate through the rows of the table
for (Element row : table.select("tr").subList(1, table.select("tr").size())) { // Skip the header row
Elements columns = row.select("td, th");
// Extract data from each column and append it to the output
String number = columns.get(0).text();
String name = columns.get(2).text();
String term = columns.get(3).text();
String party = columns.get(5).text();
String election = columns.get(6).text();
String vicePresident = columns.get(7).text();
output.append("President Data:\n");
output.append("Number: ").append(number).append("\n");
output.append("Name: ").append(name).append("\n");
output.append("Term: ").append(term).append("\n");
output.append("Party: ").append(party).append("\n");
output.append("Election: ").append(election).append("\n");
output.append("Vice President: ").append(vicePresident).append("\n\n");
}
// Print the scraped data for all presidents
System.out.println(output.toString());
} catch (IOException e) {
System.err.println("Failed to retrieve the web page: " + e.getMessage());
}
}
}
In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!
If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.