Introduction
Scraping business listings from Yelp can provide useful data about local businesses, their reviews, price ranges, locations, and more. This information can power business intelligence tools, market analysis, lead generation, and other applications.
In this comprehensive guide, we'll walk through a full Objective-C scraper to extract key details on Chinese restaurant listings in San Francisco from the Yelp website.
This is the page we are talking about
Here's the exact data we'll pull from each listing:
We'll use the proxies API from ProxiesAPI to bypass Yelp's anti-scraper protections. As we'll see, premium proxies that rotate IP addresses are essential for scraping sites like Yelp without quickly getting blocked.
Install Dependencies
Let's quickly cover installing the dependencies we'll need:
TFHpple
This Objective-C library parses HTML/XML documents and allows XPath queries to extract data.
pod 'TFHpple'
The scraper also relies on Foundation and other standard Objective-C libraries.
With the imports and dependencies handled, let's get to the data extraction!
Encode the Target URL
We first construct the target URL pointing to Yelp listings in San Francisco:
NSString *urlString = @"<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>";
Next we URL-encode this string to handle any special characters:
NSString *encodedURLString = [urlString stringByAddingPercentEncodingWithAllowedCharacters:[NSCharacterSet URLQueryAllowedCharacterSet]];
This encoded URL will be embedded in the request to ProxiesAPI.
Use Premium Proxies
To avoid immediately getting blocked by Yelp's bot detection, we'll use the premium proxy API from ProxiesAPI:
NSString *apiURLString = [NSString stringWithFormat:@"<http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=%@>", encodedURLString];
Key things to note:
So each request will go through a different proxy IP, fooling Yelp into thinking it's organic user traffic. Sneaky! ๐
Set HTTP Headers
We next construct a dictionary of request headers that mimic a real Chrome browser:
NSDictionary *headers = @{
@"User-Agent": @"Mozilla/5.0...",
@"Accept-Language": @"en-US,en;q=0.5",
@"Accept-Encoding": @"gzip, deflate, br",
@"Referer": @"<https://www.google.com/>"
};
And convert the headers into the required
NSMutableArray *headerFields = [NSMutableArray array];
[headers enumerateKeysAndObjectsUsingBlock:^(NSString *key, NSString *value, BOOL *stop) {
[headerFields addObject:[NSURLRequest requestHTTPHeaderFieldWithName:key value:value]];
}];
Mimicking a real browser via headers decreases the chances of getting flagged as a bot.
Construct NSURLRequest
We assemble all the pieces into an
NSURLComponents *components = [NSURLComponents componentsWithString:apiURLString];
NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:components.URL];
request.allHTTPHeaderFields = [NSDictionary dictionaryWithObjects:headerFields
forKeys:[headerFields valueForKey:@"name"]];
request.HTTPMethod = @"GET";
This request points to the ProxiesAPI URL, includes our mimic-browser headers, and performs a GET.
Make the HTTP Request
With our request prepped, we kick it off:
NSURLSession *session = [NSURLSession sharedSession];
NSURLSessionDataTask *task = [session dataTaskWithRequest:request
completionHandler:...];
[task resume];
The code handles the async response in the completion block:
Now the fun begins - using XPath to extract fields!
Extract Business Listings
With the HTML loaded into a
Inspecting the page
When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x
First we grab all the listings containers:
NSArray *listings = [parser searchWithXPathQuery:@"//div[contains(@class,'arrange-unit__09f24__rqHTg')]"];
Key things to note:
Then we loop through each listing:
for (TFHppleElement *listing in listings) {
// Extract data for this listing
}
Inside the loop, we use very specific XPath queries to extract each data field!
Extract Business Name
For business name, we grab the
TFHppleElement *businessNameElement = [listing firstChildWithClassName:@"css-19v1rkv"];
NSString *businessName = [businessNameElement text];
This neatly returns just the business name string!
Extract Rating, Reviews, Price, Location
The other fields require more nuanced XPath queries:
// Rating
TFHppleElement *ratingElement = [listing firstChildWithClassName:@"css-gutk1c"];
// Number of Reviews
NSArray *spanElements = [listing searchWithXPathQuery:@"//span[contains(@class,'css-chan6m')]"];
// Price Range
TFHppleElement *priceRangeElement = [listing firstChildWithClassName:@"priceRange__09f24__mmOuH"];
// Location
NSString *location = @"N/A";
if ([spanElements count] >= 2) {
location = [[spanElements[1] text] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
}
We have to handle cases where fields are missing or contain unpredictable whitespace in the HTML.
But ultimately we extract and print all the pieces we need:
NSLog(@"Business Name: %@", businessName);
NSLog(@"Rating: %@", rating);
// etc...
The full code handles edge cases and surfaces everything in an easy-to-process structure.
Key Takeaways
Scraping Yelp listings relies heavily on:
With these key ingredients, you can build robust Yelp scrapers in Objective-C and other languages.
Next Steps
To expand on this project:
Hopefully this gives you a firm handle on tackling third-party sites like Yelp. Happy scraping!
Full Objective-C Code
Here again is the full scraper code:
#import <Foundation/Foundation.h>
#import "TFHpple.h"
int main(int argc, const char * argv[]) {
@autoreleasepool {
NSString *urlString = @"https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA";
// URL-encode the URL
NSString *encodedURLString = [urlString stringByAddingPercentEncodingWithAllowedCharacters:[NSCharacterSet URLQueryAllowedCharacterSet]];
// API URL with the encoded URL
NSString *apiURLString = [NSString stringWithFormat:@"http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=%@", encodedURLString];
// Define user-agent header and other headers
NSDictionary *headers = @{
@"User-Agent": @"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
@"Accept-Language": @"en-US,en;q=0.5",
@"Accept-Encoding": @"gzip, deflate, br",
@"Referer": @"https://www.google.com/"
};
// Convert headers to an array of NSURLRequestHTTPHeaderField objects
NSMutableArray *headerFields = [NSMutableArray array];
[headers enumerateKeysAndObjectsUsingBlock:^(NSString *key, NSString *value, BOOL *stop) {
[headerFields addObject:[NSURLRequest requestHTTPHeaderFieldWithName:key value:value]];
}];
// Create an NSURLComponents object to build the URL
NSURLComponents *components = [NSURLComponents componentsWithString:apiURLString];
// Create an NSURLRequest object with the URL and headers
NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:components.URL];
request.allHTTPHeaderFields = [NSDictionary dictionaryWithObjects:headerFields forKeys:[headerFields valueForKey:@"name"]];
request.HTTPMethod = @"GET";
// Send an HTTP GET request
NSURLSession *session = [NSURLSession sharedSession];
NSURLSessionDataTask *task = [session dataTaskWithRequest:request completionHandler:^(NSData *data, NSURLResponse *response, NSError *error) {
if (error) {
NSLog(@"Failed to retrieve data. Error: %@", error.localizedDescription);
} else {
NSHTTPURLResponse *httpResponse = (NSHTTPURLResponse *)response;
if (httpResponse.statusCode == 200) {
NSString *htmlString = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
// Save the HTML to a file (optional)
[htmlString writeToFile:@"yelp_html.html" atomically:YES encoding:NSUTF8StringEncoding error:nil];
// Parse the HTML content using TFHpple
TFHpple *parser = [TFHpple hppleWithHTMLData:data];
// Find all the listings
NSArray *listings = [parser searchWithXPathQuery:@"//div[contains(@class,'arrange-unit__09f24__rqHTg') and contains(@class,'arrange-unit-fill__09f24__CUubG') and contains(@class,'css-1qn0b6x')]"];
NSLog(@"Number of Listings: %ld", (long)[listings count]);
// Loop through each listing and extract information
for (TFHppleElement *listing in listings) {
// Extract information here
// Extract business name
TFHppleElement *businessNameElement = [listing firstChildWithClassName:@"css-19v1rkv"];
NSString *businessName = [businessNameElement text];
// Extract rating
TFHppleElement *ratingElement = [listing firstChildWithClassName:@"css-gutk1c"];
NSString *rating = [ratingElement text];
// Extract price range
TFHppleElement *priceRangeElement = [listing firstChildWithClassName:@"priceRange__09f24__mmOuH"];
NSString *priceRange = [priceRangeElement text];
// Extract number of reviews and location
NSArray *spanElements = [listing searchWithXPathQuery:@"//span[contains(@class,'css-chan6m')]"];
NSString *numReviews = @"N/A";
NSString *location = @"N/A";
if ([spanElements count] >= 2) {
numReviews = [[spanElements[0] text] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
location = [[spanElements[1] text] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
} else if ([spanElements count] == 1) {
NSString *text = [[spanElements[0] text] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
if ([text integerValue] > 0) {
numReviews = text;
} else {
location = text;
}
}
// Print the extracted information
NSLog(@"Business Name: %@", businessName);
NSLog(@"Rating: %@", rating);
NSLog(@"Number of Reviews: %@", numReviews);
NSLog(@"Price Range: %@", priceRange);
NSLog(@"Location: %@", location);
NSLog(@"===========================");
}
} else {
NSLog(@"Failed to retrieve data. Status Code: %ld", (long)httpResponse.statusCode);
}
}
}];
[task resume];
[[NSRunLoop currentRunLoop] run];
}
return 0;
}
The code runs as-is - just insert your own ProxiesAPI auth key and try it out! Let me know if any part needs more explanation.