When trying to download images from websites protected by Cloudflare, you may run into issues using Python's requests library. Cloudflare helps protect sites from DDoS attacks and abusive bots by acting as a reverse proxy, hiding the original server's IP address. This can block requests from scraping and downloading tools.
Fortunately, there are some workarounds to download images from Cloudflare-protected sites with Python requests. In this guide, I'll walk through the key steps and code to handle Cloudflare protection and successfully download images.
Understanding Cloudflare Protection
When a site is protected by Cloudflare, all traffic gets routed through Cloudflare's network first before reaching the origin server. This hides the real IP address of the site.
Cloudflare also generates browser integrity checks to detect bots and abusive traffic. This is done through JavaScript challenges and a cookie check. If these verification steps are not passed, Cloudflare will block the requests.
Bypassing Cloudflare with a Browser Session
The easiest method is to utilize an existing browser session that has already passed the Cloudflare checks. We can extract the verified session cookies and headers to include in our Requests calls.
Here is an example code snippet:
import requests
session = requests.Session()
# Extract cookies and headers from browser
session.cookies = cookie_dict_from_browser
session.headers = headers_dict_from_browser
response = session.get("https://example.com/image.jpg")
This leverages the prior verification done by the browser to bypass Cloudflare. Just make sure to extract the cookies and headers from the same browser session.
Using a Proxy Service
Another option is to route your requests through a proxy service. The proxy will funnel your traffic through residential IPs, bypassing Cloudflare bot detection.
Some popular proxy services are Luminati and Oxylabs. Here's an example using proxies with Requests:
import requests
proxies = {
"http": "http://lum-customer-hl_3a***:[email protected]:22225",
"https": "http://lum-customer-hl_3a***:[email protected]:22225"
}
response = requests.get("https://example.com/image.jpg", proxies=proxies)
This can be effective, but proxy services can get expensive for large volumes of traffic.
Modifying Request Headers
We can also mimic a real browser's headers to try and bypass Cloudflare:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9"
}
response = requests.get("https://example.com/image.jpg", headers=headers)
This spoofs a Chrome browser User-Agent string. However, Cloudflare also employs browser JavaScript challenge checks which can still detect the non-browser environment.
Using a Headless Browser
Lastly, we can integrate Requests with a headless browser like Selenium. The browser handles JavaScript rendering and verification challenges transparently:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(options=options)
driver.get("https://example.com")
cookies = driver.get_cookies()
session = requests.Session()
session.cookies = cookies
response = session.get("https://example.com/image.jpg")
The headless Chrome browser will execute any JavaScript checks before we extract the cookies to apply to a Requests session.
Key Takeaways
Here are some key tips when downloading images behind Cloudflare protection with Python Requests:
Overcoming Cloudflare protection requires understanding and working around the bot detection and mitigation systems. Combining the above techniques can help successfully download images protected by Cloudflare.