Using Proxies with Ruby's Open-URI for Web Scraping in 2024

For Ruby scrapers, open-uri makes fetching and parsing pages breeze. But it still suffers from blocks without proxies!

Let's look at how to configure proxies for use with open-uri.

Specifying Proxies in Open-URI

The open method in open-uri accepts a :proxy option to route requests via a proxy:

proxy_url = '<http://username:[email protected]:8000>'

open('<https://page.to.scrape/>', proxy: proxy_url) { |f|
  # scrape page
}

We simply pass the proxy URL including any auth details. This uses the built-in Net::HTTP::Proxy behind the scenes.

For authenticated proxies, we must pass the credentials separately:

proxy_url = '<http://proxy.example.com:8000>'
username = 'proxyuser'
password = 'proxypass'

open('<https://page.to.scrape/>',
  proxy_http_basic_authentication: [proxy_url, username, password]
) { |f|
  # scrape page
}

This allows using proxies that require authentication.

To disable proxies, we pass a falsey value:

open('<https://page.to.scrape/>', proxy: false) { |f|
  # scrape page with no proxy
}

Underlying proxy environment variables still apply by default.

Leveraging Environment Variables

Open-uri respects standard proxy environment variables out-of-the-box:

http_proxy

https_proxy

ftp_proxy

no_proxy

For example:

export http_proxy="<http://proxy.example.com:8000>"
ruby -ropen-uri -e "open('<http://page.to.scrape>') {...}"

This proxies all HTTP requests made with open-uri.

Note: The capitalized versions like HTTP_PROXY work too.

no_proxy provides a workaround from using the proxy for specific domains.

So environment variables provide an easy mechanism for bulk proxy configuration.

Working With HTTP Proxies

While open-uri supports both HTTP and SOCKS proxies, additional options are available for HTTP proxies specifically.

For example, we can configure timeouts:

open('<https://page.to.scrape/>',
  read_timeout: 10, # seconds
  open_timeout: 5  # seconds
)

This helps avoid stalled requests getting stuck when using proxies.

For HTTPS requests:

open('<https://page.to.scrape/>',
  ssl_ca_cert: '/path/to/ca.cert' # custom cert
)

Passing a custom CA cert may be required if the proxy uses a self-signed certificate for inspection.

Redirects can also be configured:

open('<http://page.to.scrape>', redirect: true) # handle redirects (default)
open('<http://page.to.scrape>', redirect: false) # disable redirects

This handles scenarios where the proxied IP receives different redirects compared to a direct IP.

Authentication and Authorization with Proxies

Web scraping with proxies also needs special care around authentication and authorization.

Open-uri provides a :http_basic_authentication option:

open('<https://page.to.scrape>',
  http_basic_authentication: ['username', 'password']
)

This handles HTTP basic auth with the target site's credentials.

For proxy authentication, we covered earlier the proxy_http_basic_authentication option. This uses the supplied proxy username and password.

A common mistake is confusing site auth vs proxy auth! Be sure to use the right credentials in the right place.

Advanced Proxy Usage with Open-URI

Open-URI provides some lesser known options to ease proxy usage for your scraper.

Monitor download progress with a callback:

progress_proc = -> (size) do
  puts "Downloaded #{size} bytes"
end

open(url, progress_proc: progress_proc)

Or get total size before downloading:

length_proc = -> (content_length) do
  puts "Total size: #{content_length} bytes"
end

open(url, content_length_proc: length_proc)

Streaming response bodies is also possible with a bit of work. This enables processing page content as it downloads via the proxy instead of after.

We can build failure handling by wrapping proxy requests in a retrying mechanism. This lessens issues with poor proxies going bad.

Common Errors and Troubleshooting Tips

Here are some frequent proxy errors along with troubleshooting suggestions:

407 Authentication Required - Use the correct proxy credentials via proxy_http_basic_authentication.

Connection reset by peer - The proxy server cut off mid-request. Try a different proxy or check for issues with your network/firewall.

SSL certificate verify failed - Pass the CA cert file to validate self-signed certs from the proxy when using HTTPS.

cef_filter_peer_reset - This obscure Chrome DevTools error indicates Chrome detected the request coming from a proxy and blocked it. Use proxy inspection tools to validate requests are proxying properly with the expected headers, SSL certificates etc.

Too many redirects - Adjust the redirect option based on if the proxy alters redirects compared to direct requests.

Ruby's superb net-http-cheat-sheet has more handy debugging tips relevant for proxying.

Using Proxies with Ruby's Open-URI for Web Scraping in 2024

Specifying Proxies in Open-URI

Leveraging Environment Variables

Working With HTTP Proxies

Authentication and Authorization with Proxies

Advanced Proxy Usage with Open-URI

Common Errors and Troubleshooting Tips

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Using Proxies with Ruby's Open-URI for Web Scraping in 2024

Specifying Proxies in Open-URI

Leveraging Environment Variables

Working With HTTP Proxies

Authentication and Authorization with Proxies

Advanced Proxy Usage with Open-URI

Common Errors and Troubleshooting Tips

The easiest way to do Web Scraping

Don't leave just yet!