For Ruby scrapers, open-uri makes fetching and parsing pages breeze. But it still suffers from blocks without proxies!
Let's look at how to configure proxies for use with open-uri.
Specifying Proxies in Open-URI
The
proxy_url = '<http://username:[email protected]:8000>'
open('<https://page.to.scrape/>', proxy: proxy_url) { |f|
# scrape page
}
We simply pass the proxy URL including any auth details. This uses the built-in
For authenticated proxies, we must pass the credentials separately:
proxy_url = '<http://proxy.example.com:8000>'
username = 'proxyuser'
password = 'proxypass'
open('<https://page.to.scrape/>',
proxy_http_basic_authentication: [proxy_url, username, password]
) { |f|
# scrape page
}
This allows using proxies that require authentication.
To disable proxies, we pass a falsey value:
open('<https://page.to.scrape/>', proxy: false) { |f|
# scrape page with no proxy
}
Underlying proxy environment variables still apply by default.
Leveraging Environment Variables
Open-uri respects standard proxy environment variables out-of-the-box:
For example:
export http_proxy="<http://proxy.example.com:8000>"
ruby -ropen-uri -e "open('<http://page.to.scrape>') {...}"
This proxies all HTTP requests made with open-uri.
Note: The capitalized versions like HTTP_PROXY work too.
no_proxy provides a workaround from using the proxy for specific domains.
So environment variables provide an easy mechanism for bulk proxy configuration.
Working With HTTP Proxies
While open-uri supports both HTTP and SOCKS proxies, additional options are available for HTTP proxies specifically.
For example, we can configure timeouts:
open('<https://page.to.scrape/>',
read_timeout: 10, # seconds
open_timeout: 5 # seconds
)
This helps avoid stalled requests getting stuck when using proxies.
For HTTPS requests:
open('<https://page.to.scrape/>',
ssl_ca_cert: '/path/to/ca.cert' # custom cert
)
Passing a custom CA cert may be required if the proxy uses a self-signed certificate for inspection.
Redirects can also be configured:
open('<http://page.to.scrape>', redirect: true) # handle redirects (default)
open('<http://page.to.scrape>', redirect: false) # disable redirects
This handles scenarios where the proxied IP receives different redirects compared to a direct IP.
Authentication and Authorization with Proxies
Web scraping with proxies also needs special care around authentication and authorization.
Open-uri provides a
open('<https://page.to.scrape>',
http_basic_authentication: ['username', 'password']
)
This handles HTTP basic auth with the target site's credentials.
For proxy authentication, we covered earlier the
A common mistake is confusing site auth vs proxy auth! Be sure to use the right credentials in the right place.
Advanced Proxy Usage with Open-URI
Open-URI provides some lesser known options to ease proxy usage for your scraper.
Monitor download progress with a callback:
progress_proc = -> (size) do
puts "Downloaded #{size} bytes"
end
open(url, progress_proc: progress_proc)
Or get total size before downloading:
length_proc = -> (content_length) do
puts "Total size: #{content_length} bytes"
end
open(url, content_length_proc: length_proc)
Streaming response bodies is also possible with a bit of work. This enables processing page content as it downloads via the proxy instead of after.
We can build failure handling by wrapping proxy requests in a retrying mechanism. This lessens issues with poor proxies going bad.
Common Errors and Troubleshooting Tips
Here are some frequent proxy errors along with troubleshooting suggestions:
407 Authentication Required - Use the correct proxy credentials via
Connection reset by peer - The proxy server cut off mid-request. Try a different proxy or check for issues with your network/firewall.
SSL certificate verify failed - Pass the CA cert file to validate self-signed certs from the proxy when using HTTPS.
cef_filter_peer_reset - This obscure Chrome DevTools error indicates Chrome detected the request coming from a proxy and blocked it. Use proxy inspection tools to validate requests are proxying properly with the expected headers, SSL certificates etc.
Too many redirects - Adjust the redirect option based on if the proxy alters redirects compared to direct requests.
Ruby's superb net-http-cheat-sheet has more handy debugging tips relevant for proxying.