Scraping New York Times News Headlines in VB

Web scraping is the process of extracting data from websites automatically through code. It allows you to harvest and use public data in all kinds of beneficial ways.

In this article, we'll walk through a full example of scraping article titles and links from the home page of The New York Times (NYT). Their site doesn't have an API to access this article data directly, so web scraping provides a method to get it.

Why Scrape The New York Times?

The NYT publishes high-quality, timely articles across many topics. Scraping them allows you to tap into this great content for projects like:

Aggregating topic-specific articles over time

Generating meta-analysis of coverage and natural language processing

Building an alternative interface to their content

Populating a database for public analysis

Many ideas are possible once the data has been extracted!

Step 1: Set Up Imports and Modules

We first need to import the .NET namespaces and define a module for our scraper code:

Imports System.Net
Imports System.IO
Imports HtmlAgilityPack

Module Program

System.Net contains classes for network communications like web requests

System.IO has helpers for managing streams and files

HtmlAgilityPack allows HTML parsing to extract data

Wrapping our code in a module allows it to be called from other parts of the application.

Step 2: Create Request to NYT Website

To make a request to any web page, we need to specify the URL. For The New York Times home page:

Dim url As String = "<https://www.nytimes.com/>"

We also should define a user agent header that identifies our program as a browser. This gets around blocks some sites have against scraping bots:

Dim userAgent As String = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

Then we can construct an HttpWebRequest with our URL and user agent:

Dim request As HttpWebRequest = DirectCast(WebRequest.Create(url), HttpWebRequest)
request.UserAgent = userAgent

This request will simulate a browser visiting the URL.

Step 3: Send Request and Get Response

To actually call the URL, we use the GetResponse() method on our request:

Dim response As HttpWebResponse = DirectCast(request.GetResponse(), HttpWebResponse)

This will connect to the URL and return an HttpWebResponse object with the page contents.

We should check that the request was successful by looking at the status code:

If response.StatusCode = HttpStatusCode.OK Then
   ' Request succeeded
End If

Status code 200 means everything went smoothly. Other codes indicate an error happened.

Step 4: Load HTML into Parser

Now that we have the raw HTML content from the page, we need to parse it to extract the articles. The Html Agility Pack (HAP) library allows easy parsing and querying of HTML in .NET.

We load the response stream into an HtmlDocument:

Dim htmlDoc As New HtmlDocument()
htmlDoc.Load(response.GetResponseStream())

This document represents a structured tree of elements that we can now explore using CSS selectors and XPath queries.

Step 5: Use XPath to Extract Articles

Inspecting the page

We now inspect element in chrome to see how the code is structured…

You can see that the articles are contained inside section tags and with the class story-wrapper

HAP has many options to target elements - we'll use XPath queries here since they work well for scraping structured data.

First we get all

tags with the story wrapper CSS class:

Dim articleSections As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//section\\[contains(@class, 'story-wrapper')\\]")

This finds the key sections containing articles on NYT's home page.

We iterate over these sections and use more XPath queries to extract the title and link inside each:

For Each articleSection As HtmlNode In articleSections

    Dim titleElement As HtmlNode = articleSection.SelectSingleNode(".//h3\\[contains(@class, 'indicate-hover')\\]")

    Dim linkElement As HtmlNode = articleSection.SelectSingleNode(".//a\\[contains(@class, 'css-9mylee')\\]")

Next

These queries target the specific elements in each section that contain the data we want.

Step 6: Store Results in Lists

As we extract each title and link, we can store them in lists:

Dim articleTitles As New List(Of String)
Dim articleLinks As New List(Of String)

'...

articleTitles.Add(titleElement.InnerText.Trim())
articleLinks.Add(linkElement.GetAttributeValue("href", ""))

These lists give us easy access to the scraped data for any processing or output we want.

Step 7: Print/Process Results

Finally, we can print the article titles and links:

For i As Integer = 0 To articleTitles.Count - 1

    Console.WriteLine("Title: " & articleTitles(i))
    Console.WriteLine("Link: " & articleLinks(i))
    Console.WriteLine()

Next

This will output each article scraped from the homepage.

The full code can be found at the end of this article.

Key Takeaways

The key steps to scrape structured article data are:

Identify target site and elements
Craft web request with user agent
Parse page HTML
Extract data with XPath
Store in object model
Output/process results

From here you could expand to scrape additional fields, save to a database, or integrate with other systems. Web scraping opens up many possibilities!

Next Steps

To extend this simple scraper:

Scrape additional metadata like author, date, topics

Store the structured data in databases or services

Expand to scrape additional pages/sections

Automate on a schedule with cron jobs

Detect changes and new articles automatically

Web scraping brings the vast content of the web to your fingertips! Let us know if you have any other questions.

Full code:

Imports System.Net
Imports System.IO
Imports HtmlAgilityPack

Module Program
    Sub Main()
        ' URL of The New York Times website
        Dim url As String = "https://www.nytimes.com/"

        ' Define a user-agent header to simulate a browser request
        Dim userAgent As String = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"

        ' Create an HTTP request with the user-agent header
        Dim request As HttpWebRequest = DirectCast(WebRequest.Create(url), HttpWebRequest)
        request.UserAgent = userAgent

        ' Send an HTTP GET request to the URL
        Dim response As HttpWebResponse = DirectCast(request.GetResponse(), HttpWebResponse)

        ' Check if the request was successful (status code 200)
        If response.StatusCode = HttpStatusCode.OK Then
            ' Create an HtmlDocument to parse the HTML content
            Dim htmlDoc As New HtmlDocument()
            htmlDoc.Load(response.GetResponseStream())

            ' Find all article sections with class 'story-wrapper'
            Dim articleSections As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//section[contains(@class, 'story-wrapper')]")

            ' Initialize lists to store the article titles and links
            Dim articleTitles As New List(Of String)()
            Dim articleLinks As New List(Of String)()

            ' Iterate through the article sections
            If articleSections IsNot Nothing Then
                For Each articleSection As HtmlNode In articleSections
                    ' Check if the article title element exists
                    Dim titleElement As HtmlNode = articleSection.SelectSingleNode(".//h3[contains(@class, 'indicate-hover')]")
                    ' Check if the article link element exists
                    Dim linkElement As HtmlNode = articleSection.SelectSingleNode(".//a[contains(@class, 'css-9mylee')]")

                    ' If both title and link are found, extract and append
                    If titleElement IsNot Nothing AndAlso linkElement IsNot Nothing Then
                        Dim articleTitle As String = titleElement.InnerText.Trim()
                        Dim articleLink As String = linkElement.GetAttributeValue("href", "")

                        articleTitles.Add(articleTitle)
                        articleLinks.Add(articleLink)
                    End If
                Next
            End If

            ' Print or process the extracted article titles and links
            For i As Integer = 0 To articleTitles.Count - 1
                Console.WriteLine("Title: " & articleTitles(i))
                Console.WriteLine("Link: " & articleLinks(i))
                Console.WriteLine()
            Next
        Else
            Console.WriteLine("Failed to retrieve the web page. Status code: " & response.StatusCode)
        End If
    End Sub
End Module

In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

Overcoming IP Blocks

Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

Scraping New York Times News Headlines in VB

Why Scrape The New York Times?

Step 1: Set Up Imports and Modules

Step 2: Create Request to NYT Website

Step 3: Send Request and Get Response

Step 4: Load HTML into Parser

Step 5: Use XPath to Extract Articles

Inspecting the page

Step 6: Store Results in Lists

Step 7: Print/Process Results

Key Takeaways

Next Steps

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Scraping New York Times News Headlines in VB

Why Scrape The New York Times?

Step 1: Set Up Imports and Modules

Step 2: Create Request to NYT Website

Step 3: Send Request and Get Response

Step 4: Load HTML into Parser

Step 5: Use XPath to Extract Articles

Inspecting the page

Step 6: Store Results in Lists

Step 7: Print/Process Results

Key Takeaways

Next Steps

The easiest way to do Web Scraping

Don't leave just yet!