CAPTCHAs are one of the biggest annoyances when scraping the web. Those squiggly letter and image puzzles are designed to halt bots in their tracks. But with the right approach, you can sneak past CAPTCHAs undetected.
In this article, we'll use Python libraries to automatically solve CAPTCHAs so you can focus on extracting the data you want.
The Problem with CAPTCHAs
Websites use CAPTCHAs as a way to distinguish humans from bots. CAPTCHAs act as a roadblock that can grind your scraping efforts to a frustrating halt.
Some common CAPTCHA implementations include:
These challenges are effective at stopping most scrapers in their tracks. But there are still ways around them with Python.
Automatically Solving CAPTCHAs with 2Captcha
To bypass CAPTCHAs in our scraper, we can leverage API-based CAPTCHA solving services like 2Captcha.
2Captcha has a large network of human solvers that can quickly solve CAPTCHAs via its API. This allows us to integrate real-time CAPTCHA solving into our scripts.
Here's an example using 2Captcha with Python:
import undetected_chromedriver as uc
from twocaptcha import TwoCaptcha
driver = uc.Chrome()
driver.get("<https://example.com>")
# Get CAPTCHA site-key
sitekey = driver.find_element(by='id', value='captchaElement').get_attribute('data-sitekey')
# Setup 2Captcha API
api_key = '2CAPTCHA_API_KEY'
solver = TwoCaptcha(api_key)
# Solve CAPTCHA
print("Solving captcha...")
response = solver.recaptcha(sitekey=sitekey, url=driver.current_url)
# Submit solution
driver.execute_script("document.getElementById('g-recaptcha-response').innerHTML='"+response['code']+"';")
We use undetected-chromedriver to avoid bot detection while navigating to the target page.
2Captcha handles solving the CAPTCHA behind the scenes and returns the CAPTCHA solution code. We inject this into the page to bypass the challenge.
This allows us to scrape uninterrupted without having to manually solve endless CAPTCHAs!
Conclusion
By incorporating 2Captcha or similar services, you can easily bypass even the toughest CAPTCHAs when scraping.
Just be sure to follow a website's robots.txt directives and terms of service. Automating CAPTCHA solving can be controversial if done excessively on certain sites.
With the techniques covered here, you'll be prepared to scrape intelligently at scale and overcome one of the top bot detection methods on the web.
Rather than building and managing your own captcha solving infrastructure, services like Proxies API handle all of this complexity for you.
With Proxies API, you make a simple API request with the target URL. It will handle:
And return the rendered HTML. No need to orchestrate the numerous steps required for reliable captcha solving.
For example:
curl "http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://targetpage.com"
This takes care of all the headaches of automation. No proxies, browsers, or captcha solving services to manage.
Proxies API offers 1000 free API calls to get started. Check it out if you need to integrate robust captcha solving and proxy rotation in your projects.