The New York Times homepage contains dozens of article links that get updated throughout the day. If you want to grab those article titles and links programmatically for further analysis or processing, web scraping is a handy approach.
In this post, we’ll walk through Python code that:
- Sends an HTTP request to retrieve the NYTimes homepage HTML
- Parses the HTML content using JSoup
- Extracts all article titles and links into lists
- Prints out the results
Follow along and you’ll end up with a working web scraper for this specific site. Then you can adapt the concepts for your own projects.
Sending the Initial Request
We kick things off by importing the libraries we need and defining our target URL:
import io.ktor.client.*
import io.ktor.client.engine.okhttp.*
import io.ktor.client.request.*
import org.jsoup.Jsoup
val url = "<https://www.nytimes.com/>"
To actually request this URL, we use the Ktor HTTP client. This handles all the low-level network communication for us.
Insider trick: We create the client using the OkHttp engine because it's fast and efficient:
val client = HttpClient(OkHttp)
Before sending the request, we add a custom User-Agent header. This pretends we're a real web browser. Without it, some sites may block automated scraper bots.
val headers = mapOf("User-Agent" to "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
Finally, we make the GET request and store the HTML content:
val responseText = client.get<String>(url) {
headers.forEach { (name, value) ->
header(name, value)
}
}
So with just a few lines of code, we've retrieved the latest homepage HTML from The Times!
Parsing the Content with JSoup
Now that we have the raw HTML content, we need to parse it to extract the bits we want - the article titles and links.
For this, we use JSoup - a handy Java library for working with HTML and XML.
We pass the HTML string into JSoup's
val doc = Jsoup.parse(responseText)
This DOM allows us to traverse the HTML elements by CSS selector queries to pinpoint what we need.
Extracting the Articles
Inspecting the page
We now inspect element in chrome to see how the code is structured…
You can see that the articles are contained inside section tags and with the class story-wrapper
We can grab all of them through this selector:
val articleSections = doc.select("section.story-wrapper")
Then we iterate through each section and find the title and link elements inside:
for (articleSection in articleSections) {
val titleElement = articleSection.selectFirst("h3.indicate-hover")
val linkElement = articleSection.selectFirst("a.css-9mylee")
// extract title and link...
}
We check they exist before extracting and storing the text and link URL.
Finally, we print out the results:
for (i in articleTitles.indices) {
println("Title: ${articleTitles[i]}")
println("Link: ${articleLinks[i]}")
println()
}
And we've successfully scraped the latest articles from the homepage!
The full code is included below to use as a reference.
Key Takeaways
Full Code
Here is the complete code for this New York Times scraper:
// CODE FROM ARTICLE
import io.ktor.client.*
import io.ktor.client.engine.okhttp.*
import io.ktor.client.request.*
import org.jsoup.Jsoup
suspend fun main() {
// URL of The New York Times website
val url = "https://www.nytimes.com/"
// Define a user-agent header to simulate a browser request
val headers = mapOf(
"User-Agent" to "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
)
// Create an HTTP client using the OkHttp engine
val client = HttpClient(OkHttp)
try {
// Send an HTTP GET request to the URL with headers
val responseText = client.get<String>(url) {
headers.forEach { (name, value) ->
header(name, value)
}
}
// Parse the HTML content of the page using JSoup
val doc = Jsoup.parse(responseText)
// Find all article sections with class 'story-wrapper'
val articleSections = doc.select("section.story-wrapper")
// Initialize lists to store the article titles and links
val articleTitles = mutableListOf<String>()
val articleLinks = mutableListOf<String>()
// Iterate through the article sections
for (articleSection in articleSections) {
// Check if the article title element exists
val titleElement = articleSection.selectFirst("h3.indicate-hover")
// Check if the article link element exists
val linkElement = articleSection.selectFirst("a.css-9mylee")
// If both title and link are found, extract and append
if (titleElement != null && linkElement != null) {
val articleTitle = titleElement.text().trim()
val articleLink = linkElement.attr("href")
articleTitles.add(articleTitle)
articleLinks.add(articleLink)
}
}
// Print or process the extracted article titles and links
for (i in articleTitles.indices) {
println("Title: ${articleTitles[i]}")
println("Link: ${articleLinks[i]}")
println()
}
} catch (e: Exception) {
println("Failed to retrieve the web page. Exception: $e")
} finally {
client.close()
}
}
In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!
If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Overcoming IP Blocks
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.