Web scraping can often lead to getting blocked from websites due to too many requests coming from a single IP address. One technique to avoid this is using proxies to rotate your IP address with each request.
This article will cover how to find free proxies and successfully rotate through them in your Python code to avoid getting blocked while web scraping.
What is a Proxy Server?
A proxy server acts as an intermediary between your computer and the website you are accessing. When you make a request through a proxy server, the website will see the proxy's IP address rather than your own.
This allows you to hide your real IP address and distribute requests across multiple IP addresses to avoid getting blocked.
Finding Free Proxies
There are many websites that provide free proxy lists that you can use. However, most of these proxies are well known and likely already blocked by sites trying to prevent scraping.
A better approach is to scrape these free proxy sites yourself to get fresh proxies that are more likely to work. Here's how to scrape and collect free proxies in Python:
import requests
from bs4 import BeautifulSoup
def getProxies():
r = requests.get('<https://free-proxy-list.net/>')
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find('tbody')
proxies = []
for row in table:
if row.find_all('td')[4].text == 'elite proxy':
proxy = ':'.join([row.find_all('td')[0].text, row.find_all('td')[1].text])
proxies.append(proxy)
else:
pass
return proxies
This scrapes the free proxy table on free-proxy-list.net, checks for "elite" proxies, and returns a list of proxies in the format
Testing Proxies with Python
Once you have a list of potential proxies, you'll want to test them to verify they are working. Here is a simple way to test proxies:
import requests
def testProxy(proxy):
try:
r = requests.get("<https://httpbin.org/ip>", proxies={"http": proxy, "https": proxy}, timeout=1)
print(r.json())
print("Working!")
except:
print("Not working")
This makes a request to httpbin.org/ip which returns your originating IP address. If the proxy works, it will return the proxy's IP rather than your own.
You can test a list of proxies quickly using Python's
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(testProxy, proxies)
This will test all proxies concurrently and print out the working ones.
Rotating Proxies in Your Code
Once you have a list of working proxies, you can rotate through them in your scraper code to avoid getting blocked.
Here is an example using the
import requests
from random import choice
proxies = [# list of working proxies]
proxy = choice(proxies)
headers = {"User-Agent": "Mozilla/5.0"}
for page in range(1, 11):
r = requests.get("<https://example.com/>" + str(page), headers=headers, proxies={"http": proxy, "https": proxy})
# scraping logic here
proxy = choice(proxies) # rotate proxy
This chooses a random proxy from the list for each request and cycles through them.
The key points are:
This should allow you to scrape sites successfully without getting blocked.
Conclusion
Proxy rotation is important for avoiding blocks while web scraping. This article covered:
While these tools are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.
Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.
This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.
With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.