When scraping or testing websites protected by Cloudflare, you may encounter redirect loops that prevent accessing the final HTML page. This occurs because Cloudflare checks for bots and blocks automated requests to protect sites from abuse. However, there are ways to properly configure HtmlUnit to bypass these protections.
The Cloudflare Challenge
Many sites use Cloudflare to protect against DDoS attacks, spam bots, and other threats. Cloudflare acts as a reverse proxy, sitting in front of the origin web server and applying rules to filter requests.
One of the techniques Cloudflare employs is checking for browser characteristics like cookies, headers, and JavaScript execution. Requests lacking these human-like qualities may be flagged as bots and blocked or endlessly redirected.
This causes problems for tools like HtmlUnit that programmatically request pages. Out of the box, HtmlUnit connects directly without mimicking a real browser close enough to get past Cloudflare.
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage("https://example.cloudflare.com");
// Endless redirect loops or access denied
Configuring the WebClient
To properly imitate a browser, our
User Agent
We must spoof a real desktop or mobile browser agent string:
webClient.getOptions().setUserAgent("Mozilla/5.0...");
JavaScript
Enable JavaScript execution in the client:
webClient.getOptions().setJavaScriptEnabled(true);
Cookies
Allow cookies and maintain them across page requests:
CookieManager cookieManager = new CookieManager();
webClient.setCookieManager(cookieManager);
Caching
Set up cache storage to mimic browser resource caching:
webClient.getOptions().setCache(new InMemoryCache());
Getting Past the Redirects
With a tuned
WebClient webClient = new WebClient();
// Apply configurations listed above
HtmlPage page = webClient.getPage("https://example.cloudflare.com");
// Access granted, extract page content
However, on some sites, an initial redirect to an intermediate URL occurs before landing on the true destination page:
https://example.cloudflare.com
-> https://example.cloudflare.com/?__cf_chl_f_tk=HASH
-> https://www.example.com/home.html
We need to follow these hops programmatically:
webClient.getOptions().setRedirectEnabled(true);
HtmlPage page1 = webClient.getPage("https://example.cloudflare.com");
HtmlPage page2 = (HtmlPage) page1.getEnclosingWindow().getTopWindow().getEnclosedPage();
// Extract content from final page2
Now page2 contains the true protected page content past Cloudflare.
Dealing with Bot Detection
Sometimes custom JavaScript executes on sites trying to catch automation tools. For example:
// Site JS code
var start = new Date().getTime();
while(new Date().getTime() < start + 1000); // Delay
if(took < 1000) {
// Flag as bot
}
We can override the JavaScript environment to skip past traps like this:
ScriptEngine engine = new ScriptEngine(new HtmlUnitScriptEngine(),
new ClassShutterObject());
webClient.setJavaScriptEngine(engine);