In this article, we will learn how to scrape property listings from Booking.com using PHP. We will use common PHP libraries to fetch the HTML content and then parse and extract key information like property name, location, ratings, etc.
Prerequisites
To follow along, you will need:
Installing Dependencies
We will use two PHP packages -
Install them using Composer:
composer require guzzlehttp/guzzle symfony/dom-crawler
This will download the packages into the
Including Dependencies
At the top of your PHP script, include the Composer autoloader and the packages:
require __DIR__ . '/vendor/autoload.php';
use GuzzleHttp\\Client;
use Symfony\\Component\\DomCrawler\\Crawler;
The autoloader will load the classes when needed.
Defining the Target URL
We will scrape listings from this URL on Booking.com:
$url = '<https://www.booking.com/searchresults.en-gb.html?ss=New+York&checkin=2023-03-01&checkout=2023-03-05&group_adults=2>';
You can modify the parameters as needed.
Setting User Agent
We need to set a valid User Agent string:
$userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36';
Fetching the HTML Page
Use Guzzle to send a GET request and get the response:
$client = new Client(['headers' => ['User-Agent' => $userAgent]]);
$response = $client->request('GET', $url);
$html = $response->getBody();
We configure Guzzle with the User Agent header and fetch the page HTML.
Parsing the HTML
Use DomCrawler to parse the HTML:
$crawler = new Crawler($html);
This creates a Crawler instance with the document structure.
Extracting Property Cards
The property cards have a
$cards = $crawler->filter('div[data-testid="property-card"]');
This extracts all divs with that attribute into a Crawler collection.
Looping Through Cards
Loop through the cards:
foreach ($cards as $card) {
// Extract data from $card
}
Inside the loop we can extract information from each
Extracting Property Name
The title is in a
$title = $card->filter('h3')->text();
Get the
Extracting Location
The location is in a
$location = $card->filter('span[data-testid="address"]')->text();
Filter by the
Extracting Rating
Get the
$rating = $card->filter('div.e4755bbd60')->attr('aria-label');
Filter by the CSS class name.
Extracting Review Count
Get text of the review count
$reviewCount = $card->filter('div.abf093bdfe')->text();
Again filter by class name.
Extracting Description
Get the description
$description = $card->filter('div.d7449d770c')->text();
Printing the Data
Print out the extracted information:
echo "Name: $title\\n";
echo "Location: $location\\n";
echo "Rating: $rating\\n";
echo "Review Count: $reviewCount\\n";
echo "Description: $description\\n\\n";
This prints the key details for each property listing card.
You can also store the data in an array instead of printing.
Full Script
Here is the full scraping script:
<?php
require __DIR__ . '/vendor/autoload.php';
use GuzzleHttp\\Client;
use Symfony\\Component\\DomCrawler\\Crawler;
$url = '<https://www.booking.com/searchresults.en-gb.html?ss=New+York&checkin=2023-03-01&checkout=2023-03-05&group_adults=2>';
$userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36';
$client = new Client(['headers' => ['User-Agent' => $userAgent]]);
$response = $client->request('GET', $url);
$html = $response->getBody();
$crawler = new Crawler($html);
$cards = $crawler->filter('div[data-testid="property-card"]');
foreach ($cards as $card) {
$title = $card->filter('h3')->text();
$location = $card->filter('span[data-testid="address"]')->text();
$rating = $card->filter('div.e4755bbd60')->attr('aria-label');
$reviewCount = $card->filter('div.abf093bdfe')->text();
$description = $card->filter('div.d7449d770c')->text();
echo "Name: $title\\n";
echo "Location: $location\\n";
echo "Rating: $rating\\n";
echo "Review Count: $reviewCount\\n";
echo "Description: $description\\n\\n";
}
This script scrapes and prints key details from Booking.com property listings using PHP and common libraries like Guzzle and DomCrawler. The same technique can be applied to any site.
While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.
Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.
This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.
With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.