Configuring Proxies in rvest
The key library powering most R based crawlers and scrapers is rvest. Luckily it makes setting up proxies quite straight-forward by building off of R's base http communication library httr.
Let's walk through a simple example:
# Load libraries
library(rvest)
library(httr)
# Authenticated proxy url
my_proxy <- '<http://user:[email protected]:8080>'
# Set this proxy for http requests
httr::set_config(httr::use_proxy(my_proxy))
# Test it
webpage <- read_html("<http://httpbin.org/ip>")
> print(webpage)
<html>
<head></head>
<body>
<pre>
{
"origin": "123.45.6.7"
}
</pre>
</body>
</html>
By setting the proxy url directly in httr using
Some key pointers on aspects of the proxy url:
You can confirm it works by reading a page that shows the accessing IP like we just did with httpbin.org.
Next let's look at some alternate ways to configure proxies in rvest beyond just using a URL directly.
Setting Environment Variables
An approach I prefer for easier management is specifying proxies via environment variables. This keeps proxy credentials separate and avoids exposing IPs in code - especially useful when working collaboratively.
Here is how to configure environment variables for an authenticated HTTP proxy server:
# Set proxy environment variables
Sys.setenv(http_proxy = "<http://123.45.6.7:8080>")
Sys.setenv(http_proxy_user = "username:password")
# Confirm env variables
Sys.getenv()
Now rvest will automatically pick up these env variables when sending requests without needing to pass anything explicitly.
You can also set a no_proxy environment variable to disable proxying for certain hosts and IP ranges if needed.
Using Separate Proxy Lists
Another flexible approach is maintaining a separate list of proxies and cycling through them to distribute requests.
Let's see this in action:
# List of proxies
proxies <- data.frame(
ip = c("123.45.6.7", "98.76.54.3"),
port = c(8080, 8080),
username = c("user1", "user2"),
password = c("", "pass#8")
)
# Function to retrieve proxy config
get_proxy_config <- function() {
# Select random proxy
proxy <- sample_n(proxies, 1)
# Build proxy url
url <- paste0("http://",
ifelse(proxy$username == "",
"", paste(proxy$username, proxy$password, "@")),
proxy$ip, ":",
proxy$port)
# Set proxy for httr
httr::set_config(httr::use_proxy(url))
}
# Usage
get_proxy_config()
webpage <- read_html("<http://httpbin.org/ip>")
The key advantages being:
Now that you understand how proxies can be configured for rvest in different ways, let's move on to an even more vital technique - rotating proxies dynamically for best results.
Why You Need to Rotate Proxies for Web Scraping
If proxies are essential to distribute scraping traffic across multiple IPs, wouldn't sticking to just a handful be enough?
Unfortunately, in my early experiments I found even proxy servers get blocked eventually if you hit sites hard enough!
The solution? Rotate amongst hundreds or even thousands of proxies automatically as you gather data. Let's dissect why this is critical:
1. Prevent Proxy Blocks
With even a pool of 4-5 proxies, repeating hits from the same small IP list allows sites to profile and block them. Rotation fundamentally defeats this.
I've rotated amongst over 50K residential IPs simultaneously for months of continuous usage without tripping defenses on some large sites!
2. Improve Success Rates
Not all proxies work consistently, and many cheap ones fail often for various reasons. By dynamically picking only WORKING proxies for each request, success rates improve dramatically.
3. Adjust Location Targeting
Need to extract content from the Japan catalog of an ecommerce store? Or compare pricing across EU?
Rotating geo-targeted proxies lets you intelligently switch location context between requests drawing from a world-wide residential IP pool.
Clearly, for any serious scraping activity, automatically rotating amongst a large, reliable proxy pool is almost mandatory nowadays.
Manually checking and handling dead proxies can become nightmarish pretty quickly. Let's look at smart ways to programmatically rotate proxies using R.
Implementing Intelligent Proxy Rotation in rvest
Rotating proxies manually between requests may seem simple enough by just randomly picking from a populated list. But in high volume scraping dealing with fails, blocks and weighting location priorities can get tricky fast.
Let's construct a robust algorithm supporting dynamic rotation:
Step 1 - Prepare Proxy List
# Load proxy list from file
proxies <- read.csv("proxies.txt", header = FALSE)
proxies <- data.frame(
ip = proxies$V1,
port = proxies$V2
)
I maintain a frequently updated proxy txt file collating free and paid proxies from multiple sources into a standard IP:Port format.
Even if initially all are marked as working, many invariably fail when actually used. The next steps filter these out.
Step 2 - Validate List
# Helper functions to update proxy state
set_working <- function(proxy) {
proxy$status <- "working"
}
set_failed <- function(proxy) {
proxy$status <- "failed"
}
# Test each proxy
for(i in seq_len(nrow(proxies))) {
tryCatch({
proxy <- proxies[i,]
# Configure proxy
httr::set_config(
httr::use_proxy(paste(proxy$ip, proxy$port, sep=":"))
)
# Test if gets 200 status
resp <- GET("<http://httpbin.org/ip>")
status_code <- status_code(resp)
# If working, tag proxy
if(status_code == 200) {
set_working(proxy)
} else {
set_failed(proxy)
}
}, error = function(err) {
# Tag on any errors
set_failed(proxy)
})
}
# Filter working proxies
proxies <- proxies %>%
filter(status == "working")
This loops through each proxy attempting a test request and categorizing them as per response. Finally we filter to only the working ones for further usage.
Step 3 - Pick Proxy Randomly
Now that we have a sanity checked list of active proxies, we can integrate it into our main scraper code:
# Select random proxy
get_random_proxy <- function() {
proxy <- sample_n(proxies, 1)
proxy <- proxy$ip:proxy$port # Construct proxy url
httr::set_config(httr::use_proxy(proxy)) # Configure
}
# Usage
get_random_proxy()
webpage <- read_html("<http://httpbin.org/ip>")
Every scraper request triggers a fresh proxy IP leading to effortless rotation!
Step 4 - Recycle Failed Proxies
To make full use of resources, we should re-check failed proxies after some time as they could come back online.
Adding a scheduled job to re-validate and promoting recovered ones back to the working pool completes a robust IP rotating system for rvest.
# Re-check failed proxies
retry_failed_proxies <- function() {
failed_proxies <- filter(proxies, status == "failed")
for(i in seq_len(nrow(failed_proxies))) {
# Re-run validation steps
}
# Add back working proxies
proxies <- bind_rows(proxies, validated_proxies)
}
# Schedule job
Sys.setenv(TZ="UTC")
schedule <- scheduleJob("0 */4 * * *", retry_failed_proxies) # Every 4 hrs
This revolutionized stability for my long running commercial web scrapers!
Now that you understand how to configure and rotate proxies for optimal performance, let's go through some pro-tips and best practices worth implementing.
Pro Proxy Tips for Expert-Level Web Scraping in R
Over the years, I've learned many small proxy nuances through trial & error which have levelled up my scraping capabilities significantly.
Here are some pro suggestions worth incorporating:
Filter Proxy Locations
Certain sites serve customized homepage content and product catalogs based on visitor geo-location. It's invaluable then to filter proxy lists by country for consistency:
# Helper function to select country
get_country_proxy <- function(country) {
# Filter proxy dataframe by country code
country_proxies <- filter(proxies, country == ??US??)
# Other steps same as random selection
}
# Usage
get_country_proxy("US")
webpage <- read_html("<http://www.xyzshop.com/>")
This extracts the US version reliably despite proxies switching.
Authenticate through Proxy
Sites requiring login credentials before scraping data require authenticated proxy support:
# Construct authenticated proxy url
proxy <- "<http://user:pass@IP>:port"
# Add user-agent headers for stealth
curloptions <- curlOptions(
useragent = "Mozilla/5.0",
httpheader = c(Accept = "text/html")
)
# Login and extract cookie
login <- POST(url = "<http://www.website.com/login>",
body = list(email = "[email protected]", password = "****"),
curloptions,
httr::use_proxy(proxy)
)
# Store + reuse cookie for scraping
cookie <- content(login, "parsed")$response$cookies
# Scraping steps...
GET(url, curloptions, httr::use_proxy(proxy), cookies = cookie)
This logs in just once saving the authenticated session for subsequent scraping requests.
Integrate Selenium for JS Sites
An extreme challenge I faced was scraping complex JavaScript rendered sites like Facebook and Linkedin to extract user profile info.
The RSelenium package which ports Selenium webdriver to R came to the rescue:
# Launch headless Chrome browser proxyed through selenium
remDr$extraCapabilities <- makeSeleniumProxy(proxy)
remDr <- remoteDriver$new(port = 4445L, extraCapabilities = caps)
remDr$open()
# Navigate to site
remDr$navigate("<https://www.linkedin.com/feed/>")
# Extract info
profiles <- remDr$findElements(using = 'css selector', "div.profile")
names <- sapply(profiles, function(x) x$getElementText())
The key learning here was around configuring Selenium to route traffic through proxies. With that figuring out any JS site was a breeze!
I hope these tips help you become an expert proxy handler in R! Let's conclude with some final words of wisdom.
Key Takeaways - The Ideal Proxy Setup for Web Scraping
After having configured proxies across dozens of scrapers over the years, here is what I think comprises an ideal setup:
Getting all of the above right can get extremely complex quickly cutting into the time required for actual data analytics.
My advice therefore is outsourcing proxy management to capable third party services like Proxies API handling the heavy lifting through simple APIs. Proxies API is my own SAAS service.
It takes care of sourcing, validating and rotating millions of residential IPs automatically in the backend across footprints globally while exposing a straightforward interface to manage proxies across all my scrapers in Python, R, PHP etc.
Definitely give Proxies API a spin with our 1000 request trial offer to simplify your scraping infrastructure.