Solving Cloudflare Redirect Loops with HtmlUnit in Java

When scraping or testing websites protected by Cloudflare, you may encounter redirect loops that prevent accessing the final HTML page. This occurs because Cloudflare checks for bots and blocks automated requests to protect sites from abuse. However, there are ways to properly configure HtmlUnit to bypass these protections.

The Cloudflare Challenge

Many sites use Cloudflare to protect against DDoS attacks, spam bots, and other threats. Cloudflare acts as a reverse proxy, sitting in front of the origin web server and applying rules to filter requests.

One of the techniques Cloudflare employs is checking for browser characteristics like cookies, headers, and JavaScript execution. Requests lacking these human-like qualities may be flagged as bots and blocked or endlessly redirected.

This causes problems for tools like HtmlUnit that programmatically request pages. Out of the box, HtmlUnit connects directly without mimicking a real browser close enough to get past Cloudflare.

WebClient webClient = new WebClient();

HtmlPage page = webClient.getPage("https://example.cloudflare.com"); 
// Endless redirect loops or access denied

Configuring the WebClient

To properly imitate a browser, our WebClient needs tweaking. Here are key areas to address:

User Agent

We must spoof a real desktop or mobile browser agent string:

webClient.getOptions().setUserAgent("Mozilla/5.0...");

JavaScript

Enable JavaScript execution in the client:

webClient.getOptions().setJavaScriptEnabled(true);

Cookies

Allow cookies and maintain them across page requests:

CookieManager cookieManager = new CookieManager();
webClient.setCookieManager(cookieManager);

Caching

Set up cache storage to mimic browser resource caching:

webClient.getOptions().setCache(new InMemoryCache());

Getting Past the Redirects

With a tuned WebClient, we can now access Cloudflare sites properly. For example:

WebClient webClient = new WebClient();

// Apply configurations listed above

HtmlPage page = webClient.getPage("https://example.cloudflare.com");

// Access granted, extract page content

However, on some sites, an initial redirect to an intermediate URL occurs before landing on the true destination page:

https://example.cloudflare.com 
   -> https://example.cloudflare.com/?__cf_chl_f_tk=HASH
   -> https://www.example.com/home.html

We need to follow these hops programmatically:

webClient.getOptions().setRedirectEnabled(true);

HtmlPage page1 = webClient.getPage("https://example.cloudflare.com"); 

HtmlPage page2 = (HtmlPage) page1.getEnclosingWindow().getTopWindow().getEnclosedPage();

// Extract content from final page2

Now page2 contains the true protected page content past Cloudflare.

Dealing with Bot Detection

Sometimes custom JavaScript executes on sites trying to catch automation tools. For example:

// Site JS code  

var start = new Date().getTime();

while(new Date().getTime() < start + 1000); // Delay

if(took < 1000) {
  // Flag as bot
}

We can override the JavaScript environment to skip past traps like this:

ScriptEngine engine = new ScriptEngine(new HtmlUnitScriptEngine(), 
  new ClassShutterObject());

webClient.setJavaScriptEngine(engine);

Key Takeaways

Cloudflare blocking can cause scraping and testing tools like HtmlUnit to be endlessly redirected or denied access.

Properly configuring the WebClient (browser emulation, cookies, caching, etc.) allows bypassing these protections.

Additional tweaks to follow redirects and override JS bot detection may be needed on some sites.

With the right setup, HtmlUnit can programmatically access sites shielded by Cloudflare.

Solving Cloudflare Redirect Loops with HtmlUnit in Java

The Cloudflare Challenge

Configuring the WebClient

Getting Past the Redirects

Dealing with Bot Detection

Key Takeaways

Browse by language:

The easiest way to do Web Scraping

Solving Cloudflare Redirect Loops with HtmlUnit in Java

The Cloudflare Challenge

Configuring the WebClient

Getting Past the Redirects

Dealing with Bot Detection

Key Takeaways

The easiest way to do Web Scraping

Don't leave just yet!