I clearly remember the first time I used Puppeteer for a web scraping project. It was awesome - the headless Chrome browser following my code, rapidly scraping pages, extracting data at lightning speed.
Until it suddenly stopped with a bunch of 403 errors! The target site had apparently detected the scraping activity and blocked my server IP.
I tried varying the delays between requests, randomizing user agents, everything I could think of. But the errors persisted.
It was only after days of frustration that a network admin friend suggested trying out proxies. And that changed the scraping game for me forever!
In this post, I want to share my hard-won experience on the significance of using proxies with Puppeteer, explain the key concepts in simple terms, and take you through a journey of effectively leveraging them in your web automation projects.
We'll start from the basics, explore some real-world stories and learnings through progressively more complex proxy configurations.
By the end, you'll have all the insider tricks and actionable tips to master proxies with Puppeteer!
Why Proxies Matter for Web Scraping
First question from readers at this point is usually:
What exactly are proxies and why do I need them for web scraping?
Let me use an analogy here. Suppose you want to check out a classmate's Facebook profile that you don't have access to. You could try viewing it while logged into your own account, but Facebook would just block you.
Instead, you ask a mutual friend to log into their account and access the profile for you, then share what they see.
A proxy server acts as that mutual friend - making web requests on your behalf so that the target website allows access, acting as an intermediary that fetches and relays the responses back to you.
This helps overcome blocks and access restrictions in two ways:
- Anonymity: The requests appear to come from the proxy server instead of exposing your scraping code's IP, making it harder to detect and block
- Rotation: Using multiple proxy servers allows you to distribute requests across different IPs, avoiding rate limits and usage blocks on individual IPs
Now that you have an idea of what proxies do, let's look at how we can configure them in Puppeteer...
Launching Puppeteer with Proxies
The most basic way of using a proxy with Puppeteer is passing it via a
const browser = await puppeteer.launch({
args: ['--proxy-server=1.2.3.4:8080']
});
This routes all traffic from the controlled browser instance through the specified proxy server with IP
You can use this to set up Puppeteer with any standard HTTP, HTTPS or SOCKS proxy.
For example, here's how I configured it to work with a free SOCKS proxy I found for one of my early testing experiments:
const proxyUrl = 'socks5://164.68.118.25:59433';
const browser = await puppeteer.launch({
args: [`--proxy-server=${proxyUrl}`]
});
However, I soon realized two issues while using free public proxies:
- They are super slow and unreliable
- The single proxy's IP itself gets blocked after some scraping activity
Hence, I needed a way to integrate multiple proxies and keep rotating them...
Rotating Multiple Proxies to Avoid Blocks
The key to avoiding blocks and scraping effectively at scale is using a pool of proxies and switching them continuously.
Let me walk you through a nifty trick to achieve this proxy rotation with Puppeteer:
Step 1: Get a list of proxies from a provider
const proxies = [
'<http://152.32.90.10:8080>',
'<http://98.162.19.15:3128>',
// ...
];
Step 2: Randomly select one proxy URL for each Puppeteer launch
const randomProxy =
proxies[Math.floor(Math.random() * proxies.length)];
Step 3: Pass it to the
const browser = await puppeteer.launch({
args: [`--proxy-server=${randomProxy}`],
});
This ensures every new Puppeteer browser instance uses a different proxy from the pool, achieving rotation!
With this approach, I was able to scrape data from tens of thousands of listings from a real estate portal without getting IP banned, by rotating ~100 residential proxies.
Configuring Authentication for Premium Proxies
Many paid proxy services require authenticating to the proxy server before usage, especially for residential proxies.
The proxy URL in such cases looks like:
http://<username>:<password>@<proxy-IP>:<port>
However, Chrome does not directly support proxy auth information in the URL.
Instead, we need to use Puppeteer's page.authenticate() method to provide credentials:
await page.authenticate({ username: 'px123', password: 'p@55w0rd'});
For example, with the Bright Data residential proxy service, I used their proxy URLs in this format:
const proxyUrl = '<http://<account_id>-zone-static-country-us:[email protected]:22225>';
const browser = await puppeteer.launch({
args: [`--proxy-server=${proxyUrl}`]
});
const page = await browser.newPage();
await page.authenticate({
username: '<account-id>-zone-static-country-us',
password: 'us'
});
This allowed me to leverage over 70 million residential IPs for scraping at scale!
Advanced Proxy Chaining
In certain complex scenarios, you may need to tunnel or chain proxies for added privacy.
The proxy-chain package makes this easy to set up.
It launches an intermediary local proxy server through which you can route your Puppeteer requests via chained external proxies:
const ProxyChain = require('proxy-chain');
(async() => {
const server = new ProxyChain.Server({port: 8000});
await server.listen();
const proxyUrl = `http://127.0.0.1:8000`;
const browser = await puppeteer.launch({
args: [`--proxy-server=${proxyUrl}`]
});
// Puppeteer code here...
await browser.close();
await server.close();
})();
I implemented a multi-layer proxy chain in one project to scrape data from a website that was actively blocking most scraping tools and proxies. This provided additional obscurity and made it tougher for them to trace back the actual scraper source!
Common Issues and Troubleshooting Tips
Of course, getting proxies set up smoothly with Puppeteer does take some trial and error.
Let me share solutions for some common proxy errors I faced:
1. 407 Authentication Required
This means your proxy needs authentication credentials.
Solution:
Use the
2. ERR_CONNECTION_RESET or ERR_EMPTY_RESPONSE
The proxy server closed the connection prematurely.
Solution:
The proxy is likely overloaded or unusable. Try rotating to use a different proxy URL.
3. Proxy URL format incorrect
Launched Puppeteer but your script is unable to connect through the proxy.
Solution:
Double check your proxy URL format - schemes like
4. High latency and slow responses
This typically happens with free/public proxies. Residential proxies also tend to be slower.
Solution:
For best performance, use dedicated datacenter proxies instead. Major providers like Luminati and Oxylabs offer high quality dedicated proxies.
Proxy Selection Criteria
Through extensive experimentation with different proxes for web scraping, I realized that proxy quality and performance can vary widely.
Here are 5 key criteria I consider now while selecting proxy services:
1. Reliability - Choose established proxy providers with high uptime and availability guarantees
2. Latency - Ensure low ping times and latency, especially if rendering JavaScript
3. Location Diversity - For scraping country/region specific data, target proxies in those locales
4. Rotation Frequency - For heavy scraping, prefer rapidly changing IP addresses
5. Ease of Integration - Opt for ready-to-use proxy APIs that simplify integration
Besides configuring your own proxies, I recommend considering specialized proxy services with robust infrastructure.
Leveraging Proxies API for Uninterrupted Web Scraping
Handling proxies at scale for continuous scraping can get extremely complex:
All this overhead was ultimately why I built Proxies API - a full service proxy solution specialized for web scraping.
It provides simple API access to a large network of fast residential and datacenter proxies to scrape any site with just a single function call:
scrape_with_proxy(URL, render=True)
Some of the key benefits unique to Proxies API:
Here is an actual request to scrape a sample page:
<https://api.proxiesapi.com/?key=xxx&url=https://www.example.com&render=true>
This endpoint automatically rotates IPs and solves captchas in the background, providing the fully rendered HTML in the response.
Our customers tend to get over 99% accuracy scraping even sophisticated websites compared to DIY proxies. The service frees them up to focus on their data models and business logic rather than proxies and scraping infrastructure.
Many developers and testers have successfully solved the headache of IP blocks with our simple API. The whole thing can be accessed in any programming language like Python, NodeJS, PHP etc.
We have a special offer for readers of 1000 free API calls to try it out! Just register for a free account and you're ready to start scraping at scale!