While debates around the ethics of web scraping continue, the practice remains a useful way for developers to extract data from websites. In this beginner-focused tutorial, we'll walk through a full code example for scraping key details from real estate listings on Realtor.com using a Java library called Jsoup.
This is the listings page we are talking about…
Getting Set Up
Before we dive into the code, you'll need to install Jsoup if you don't already have it. You can add this dependency in your project's build tool, such as Gradle or Maven. Here's an example for Gradle:
implementation 'org.jsoup:jsoup:1.14.3'
And for Maven:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version>
</dependency>
Now we're ready to scrape!
Connecting to the Page
Let's explore what's happening section-by-section:
// Define the URL of the Realtor.com search page
val url = "<https://www.realtor.com/realestateandhomes-search/San-Francisco_CA>"
We specify the exact Realtor URL that we want to scrape. This will contain the listings for San Francisco when visited in a browser.
// Set the User-Agent header
val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
Next we set the User-Agent header to mimic a Chrome browser visit. Many sites check this header to determine if the visitor is a real browser or an automated program.
// Fetch the HTML content of the page
val doc: Document = Jsoup.connect(url).userAgent(userAgent).get()
We use Jsoup to connect to the Realtor URL, passing in that User-Agent string we set. Jsoup downloads (or "fetches") the full HTML content from that page and stores it for us to work with in the
Tip: I find it helpful to think of this like browsing to the page and doing "View Source" to see all the underlying HTML. Jsoup handles that part for us programmatically.
Now let's move on to the most critical part - actually extracting information from that HTML using CSS selectors!
Extracting Data with Selectors
Inspecting the element
When we inspect element in Chrome we can see that each of the listing blocks is wrapped in a div with a class value as shown below…
// Find all the listing blocks using the provided class name
val listingBlocks: Seq[Element] = doc.select("div.BasePropertyCard_propertyCardWrap__J0xUj")
The main listings on Realtor.com are contained in
We use Jsoup's
Tip: You can discover class names to target by inspecting elements in your browser's dev tools. The styles and classes applied to each element are visible there.
Now we can iterate through each listing:
for (listingBlock <- listingBlocks) {
// Extract data from each listingBlock
}
And inside that loop, we use additional selectors to pull text from specific tags:
val brokerName: String = listingBlock.selectFirst("span.BrokerTitle_titleText__20u1P").text()
This selector finds the
Let's break down that selector:
We use very similar selectors to extract other data points like status, price, beds, baths etc.:
val status: String = listingBlock.selectFirst("div.message").text()
val price: String = listingBlock.selectFirst("div.card-price").text()
// And so on...
Notice how on some we look for a
Here a few key advantages of using selectors:
Now you have a high-level look at how Jsoup connects to pages and uses CSS selector queries to extract key listings details! Let's look at the full code for reference...
Full Code
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
object RealtorScraper {
def main(args: Array[String]): Unit = {
// Define the URL of the Realtor.com search page
val url = "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA"
// Set the User-Agent header
val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
try {
// Fetch the HTML content of the page
val doc: Document = Jsoup.connect(url).userAgent(userAgent).get()
// Find all the listing blocks using the provided class name
val listingBlocks: Seq[Element] = doc.select("div.BasePropertyCard_propertyCardWrap__J0xUj")
// Loop through each listing block and extract information
for (listingBlock <- listingBlocks) {
// Extract the broker information
val brokerInfo: Element = listingBlock.selectFirst("div.BrokerTitle_brokerTitle__ZkbBW")
val brokerName: String = brokerInfo.selectFirst("span.BrokerTitle_titleText__20u1P").text()
// Extract the status (e.g., For Sale)
val status: String = listingBlock.selectFirst("div.message").text()
// Extract the price
val price: String = listingBlock.selectFirst("div.card-price").text()
// Extract other details like beds, baths, sqft, and lot size
val beds: String = listingBlock.selectFirst("li[data-testid=property-meta-beds]").text()
val baths: String = listingBlock.selectFirst("li[data-testid=property-meta-baths]").text()
val sqft: String = listingBlock.selectFirst("li[data-testid=property-meta-sqft]").text()
val lotSize: String = listingBlock.selectFirst("li[data-testid=property-meta-lot-size]").text()
// Extract the address
val address: String = listingBlock.selectFirst("div.card-address").text()
// Print the extracted information
println(s"Broker: $brokerName")
println(s"Status: $status")
println(s"Price: $price")
println(s"Beds: $beds")
println(s"Baths: $baths")
println(s"Sqft: $sqft")
println(s"Lot Size: $lotSize")
println(s"Address: $address")
println("-" * 50) // Separating listings
}
} catch {
case e: Exception =>
println(s"Failed to retrieve the page. Error: ${e.getMessage}")
}
}
}
As you work on more scrapers, the process will become second-nature - connect, select elements, extract data. But it takes examples like this walkthrough to fully demystify what's happening behind the scenes.
Hopefully as a beginner you now feel equipped to start writing scrapers using Jsoup and CSS selectors!