Hacker News is a popular social news website in the technology and startup community. In this beginner tutorial, we will scrape key details from Hacker News article listings using Kotlin and print article titles, URLs, points, authors, timestamps and comment counts.
This is the page we are talking about…
Imports
Let's look at the imports needed:
import okhttp3.OkHttpClient
import okhttp3.Request
import org.jsoup.Jsoup
These dependencies allow us to retrieve the Hacker News page and parse its content.
Fetching the Page
First we define the Hacker News homepage URL:
val url = "<https://news.ycombinator.com/>"
Next we create an
val client = OkHttpClient()
And build a simple GET request for the URL:
val request = Request.Builder()
.url(url)
.build()
Finally we execute the request and handle the response:
client.newCall(request).execute().use { response ->
// Parse response here
}
This sends the GET request and we access the response body inside the lambda.
Parsing the Page with Jsoup
Inside the response handler, we first parse the HTML:
val html = response.body!!.string()
val document = Jsoup.parse(html)
This loads up a Jsoup Document object we can now query to extract data.
Scraping Rows with Selectors
Inspecting the page
You can notice that the items are housed inside a Jsoup uses CSS-style selectors to find elements. Let's get all Matches: We iterate over the rows, keeping track of article and row type: We first check if a row has the "athing" class - this denotes an article listing: We save the full HTML of the article row to scrape details next. After an article row, the next row contains key details like title, URL, points etc. We handle this case: Inside here we use other selectors to extract specific parts: Matches: We specifically get the link text and href attribute. Similarly, we fetch points, author, timestamp, comments: Using the Finally, we print all extracted details: This outputs each article's details! The full code is below to scrape Hacker News in Kotlin. Hopefully this gives a good look at using real-world libraries like OkHttp and Jsoup along with CSS selectors to easily extract content from websites. This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must. Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms. Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly. Hundreds of our customers have successfully solved the headache of IP blocks with a simple API. The whole thing can be accessed by a simple API like below in any programming language. In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so: We have a running offer of 1000 API calls completely free. Register and get your free API Key.
Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com" <!doctype html> Enter your email below to claim your free API key: tag with the class athing rows from the table: val rows = document.select("tr")
<tr>...</tr>
<tr class="athing">...</tr>
...
var currentArticle = ""
var currentRowType = ""
for (row in rows) {
// Check type of row
// Extract data
// Update current* variables
}
Getting Article Rows
if (row.hasClass("athing")) {
currentArticle = row.toString()
currentRowType = "article"
}
Scraping Article Details
} else if (currentRowType == "article") {
// Extract article details here
// Reset current* variables
currentArticle = ""
currentRowType = ""
}
Title and URL
val titleElem = row.selectFirst("span.titleline")
if (titleElem != null) {
val articleTitle = titleElem.select("a").text()
val articleUrl = titleElem.select("a").attr("href")
}
<span class="titleline">
<a href="item?id=37497275">Ask HN: Am I the only one still using Vim?</a>
</span>
Other Details
val subtext = row.selectFirst("td.subtext")
val points = subtext?.selectFirst("span.score")?.text() ?: "0"
val author = subtext?.selectFirst("a.hnuser")?.text() ?: ""
val timestamp = subtext?.selectFirst("span.age")?.attr("title") ?: ""
val commentsElem = subtext?.selectFirst("a:contains(comments)")
val comments = commentsElem?.text() ?: "0"
Printing Extracted Article Details
println("Title: $articleTitle")
println("URL: $articleUrl")
println("Points: $points")
println("Author: $author")
println("Timestamp: $timestamp")
println("Comments: $comments")
println("-".repeat(50)) // Separator
import okhttp3.OkHttpClient
import okhttp3.Request
import org.jsoup.Jsoup
fun main() {
// Define the URL of the Hacker News homepage
val url = "https://news.ycombinator.com/"
// Create an OkHttpClient instance
val client = OkHttpClient()
// Create a GET request
val request = Request.Builder()
.url(url)
.build()
// Send the GET request and handle the response
client.newCall(request).execute().use { response ->
if (response.isSuccessful) {
// Parse the HTML content of the page using Jsoup
val html = response.body!!.string()
val document = Jsoup.parse(html)
// Find all rows in the table
val rows = document.select("tr")
// Initialize variables to keep track of the current article and row type
var currentArticle = ""
var currentRowType = ""
// Iterate through the rows to scrape articles
for (row in rows) {
if (row.hasClass("athing")) {
// This is an article row
currentArticle = row.toString()
currentRowType = "article"
} else if (currentRowType == "article") {
// This is the details row
if (currentArticle.isNotEmpty()) {
// Extract information from the current article and details row
val titleElem = row.selectFirst("span.titleline")
if (titleElem != null) {
val articleTitle = titleElem.select("a").text()
val articleUrl = titleElem.select("a").attr("href")
val subtext = row.selectFirst("td.subtext")
val points = subtext?.selectFirst("span.score")?.text() ?: "0"
val author = subtext?.selectFirst("a.hnuser")?.text() ?: ""
val timestamp = subtext?.selectFirst("span.age")?.attr("title") ?: ""
val commentsElem = subtext?.selectFirst("a:contains(comments)")
val comments = commentsElem?.text() ?: "0"
// Print the extracted information
println("Title: $articleTitle")
println("URL: $articleUrl")
println("Points: $points")
println("Author: $author")
println("Timestamp: $timestamp")
println("Comments: $comments")
println("-".repeat(50)) // Separating articles
}
}
// Reset the current article and row type
currentArticle = ""
currentRowType = ""
} else if (row.attr("style") == "height:5px") {
// This is the spacer row, skip it
continue
}
}
} else {
println("Failed to retrieve the page. Status code: ${response.code}")
}
}
}
curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
Browse by language:
The easiest way to do Web Scraping
Try ProxiesAPI for free
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
...Don't leave just yet!