Web scraping is useful to programmatically extract data from websites. Often you need to scrape multiple pages from a site to gather complete information. In this article, we'll see how to scrape multiple pages in Objective-C using NSURLSession and XPathQuery.
Prerequisites
To follow along, you'll need:
Import Libraries
We'll need the following imports:
#import <Foundation/Foundation.h>
#import "XPathQuery.h"
Define Base URL
—
We'll scrape a blog -
<https://copyblogger.com/blog/>
<https://copyblogger.com/blog/page/2/>
<https://copyblogger.com/blog/page/3/>
Let's define the base URL pattern:
NSString *baseURL = @"<https://copyblogger.com/blog/page/%d/>";
The
Specify Number of Pages
Next, we'll specify how many pages to scrape. Let's scrape the first 5 pages:
int numPages = 5;
Loop Through Pages
We can now loop from 1 to
for (int page = 1; page <= numPages; page++) {
// Construct URL
NSString *urlString = [NSString stringWithFormat:baseURL, page];
NSURL *url = [NSURL URLWithString:urlString];
// Code to scrape each page
}
Send Request and Parse HTML
Inside the loop, we'll send a request and parse the HTML:
NSURLSessionTask *task = [[NSURLSession sharedSession] dataTaskWithURL:url completionHandler:^(NSData * _Nullable data, NSURLResponse * _Nullable response, NSError * _Nullable error) {
if (error == nil) {
TFHpple *xpathParser = [TFHpple hppleWithHTMLData:data];
}
}];
[task resume];
This gives us an XPath parser to extract data.
Extract Data
Now within the completion handler we can use XPath queries to extract data from each page:
NSArray *articles = [xpathParser searchWithXPathQuery:@"//article"];
for (TFHppleElement *article in articles) {
// Extract data from article
NSString *title = [[article firstChildWithXPathQuery:@"h2[@class='entry-title']"] content];
NSString *url = [[article firstChildWithXPathQuery:@"a[@class='entry-title-link']"] objectForKey:@"href"];
NSString *author = [[article firstChildWithXPathQuery:@"div[@class='post-author']/a"] content];
// Print extracted data
NSLog(@"%@", title);
NSLog(@"%@", url);
NSLog(@"%@", author);
}
Full Code
Our full code to scrape 5 pages is:
#import <Foundation/Foundation.h>
#import "XPathQuery.h"
int main() {
NSString *baseURL = @"https://copyblogger.com/blog/page/%d/";
int numPages = 5;
for (int page = 1; page <= numPages; page++) {
NSString *urlString = [NSString stringWithFormat:baseURL, page];
NSURL *url = [NSURL URLWithString:urlString];
NSURLSessionTask *task = [[NSURLSession sharedSession] dataTaskWithURL:url completionHandler:^(NSData * _Nullable data, NSURLResponse * _Nullable response, NSError * _Nullable error) {
if (error == nil) {
TFHpple *xpathParser = [TFHpple hppleWithHTMLData:data];
NSArray *articles = [xpathParser searchWithXPathQuery:@"//article"];
for (TFHppleElement *article in articles) {
NSString *title = [[article firstChildWithXPathQuery:@"h2[@class='entry-title']"] content];
NSString *url = [[article firstChildWithXPathQuery:@"a[@class='entry-title-link']"] objectForKey:@"href"];
NSString *author = [[article firstChildWithXPathQuery:@"div[@class='post-author']/a"] content];
NSMutableArray *categories = [NSMutableArray new];
for (TFHppleElement *cat in [article xpathQuery:@"div[@class='entry-categories']/a"]) {
[categories addObject:[cat content]];
}
NSLog(@"%@", title);
NSLog(@"%@", url);
NSLog(@"%@", author);
NSLog(@"%@", categories);
}
}
}];
[task resume];
}
}
This allows us to scrape and extract data from multiple pages sequentially. The code can be extended to scrape any number of pages.
Summary
Web scraping enables collecting large datasets programmatically. With the techniques here, you can scrape and extract information from multiple pages of a website in Objective-C.
While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.
Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.
This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.
With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.