Overview of Scalaj.http
For the uninitiated, scalaj.http is a convenient Scala wrapper over Java's HttpURLConnection to make HTTP requests.
It has a simple API that lets you do stuff like:
import scalaj.http._
val response = Http("<http://www.example.com>").asString
print(response.body)
This simplicity makes it quite popular, and if proxies are configured correctly, you can scale your scraping efforts without headaches.
Let's see the common proxy server options available.
Know Your Proxies
There are largely two types of proxies in use:
- HTTP Proxy: These forward HTTP and HTTPS traffic unchanged. Sites see the proxy's IP/location, not yours. But the traffic is unencrypted between you and the proxy.
- SOCKS Proxy: These route any TCP traffic, including HTTP and HTTPS. Traffic is encrypted end-to-end. But they can be slower as the proxy has to process all traffic.
Within these, you also have options like:
Now let's see how to configure them in scalaj.http.
Basic Proxy Setup in Scalaj.http
The simplest way is to use
import scalaj.http._
val proxyHost = "1234.myproxy.com"
val proxyPort = 8080
val response = Http("<http://www.example.com>")
.proxy(proxyHost, proxyPort)
.asString
Here my HTTP requests route through the proxy 1234.myproxy.com on port 8080. The target site sees the proxy's IP.
We can also specify the proxy type explicitly. This defaults to HTTP:
val proxyHost = "1234.myproxy.com"
val proxyPort = 8080
val proxyType = Proxy.Type.SOCKS // Can also be HTTP
// Route requests over a SOCKS proxy
val response = Http("<http://www.example.com>")
.proxy(proxyHost, proxyPort, proxyType)
.asString
And that's it for basic configuration!
But in the real world, you often have to deal with stuff like authenticated proxies and HTTPS sites.
Dealing With Authentication
Many proxy providers require authentication to prevent abuse.
This involves dealing with the
Here is how to authenticate your Scala proxy requests:
import java.net.InetSocketAddress
import scalaj.http.ProxyServerCredentials
val proxyHost = "buyproxy.com"
val proxyPort = 3128
// Setup proxy auth credentials
val proxyAuth = ProxyServerCredentials(username = "my_username", password = "1234")
// Authenticate proxy
val response = Http("<http://www.example.com>")
.proxy(InetSocketAddress.createUnresolved(proxyHost, proxyPort), proxyAuth)
.asString
We pass the credentials to the proxy method. The key thing is it expects an InetSocketAddress instead of plain host and port.
This took me hours to figure out through trial and error! But now I can use any authenticated proxy easily.
HTTPS Calls Over Proxy
Things get slightly tricky when using HTTPS sites compared to plain HTTP.
Many proxy providers explicitly only support HTTPS traffic tunneling and not normal HTTP requests.
This means your Scala client will throw errors like "407 Proxy Authentication Required" on HTTP sites but work fine for HTTPS sites.
After banging my head debugging non-working proxies, I found the reason is that HTTPS uses the CONNECT tunneling method under the hood.
So with HTTPS sites specifically:
import scalaj.http._
val proxyHost = "buyproxy.com"
val proxyPort = 3128
// Route HTTPS traffic through proxy
Http("<https://www.some-https-site.com>")
.proxy(proxyHost, proxyPort)
.asString
This lets my HTTPS requests tunnel safely through the proxy.
If you still get 407 authentication errors on some HTTPS sites, reconsider your proxy choice. Not all providers support tunneling.
Going Pro with Custom Transports
The
For advanced use cases, scalaj.http lets you plugin custom transports with more control.
For example, this shows implementing a custom transport to rotate across multiple proxies randomly:
import scalaj.Http
// Container for proxy host/ports
case class Proxy(host:String, port:Int)
// Custom transport class
class RandomProxyTransport extends ClientTransport {
// Available proxies
private val proxies = Vector(
Proxy("proxy1.com", 8000),
Proxy("proxy2.com", 8000),
Proxy("proxy3.com", 8000)
)
override def connectTo(
host: String,
port: Int,
settings: ConnectionSettings
)(implicit system: ActorSystem): Flow[ByteString, ByteString, Future[OutgoingConnection]] = {
// Pick a random proxy
val randProxy = util.Random.nextInt(proxies.size)
val proxy = proxies(randProxy)
// Connect using the randomly chosen proxy
val transport = Http().getDefaultClientTransport()
transport.connectTo(proxy.host, proxy.port, settings)
}
}
// Usage:
val transport = new RandomProxyTransport()
val settings = ConnectionPoolSettings(system).withTransport(transport)
Http().singleRequest(requst, settings = settings)
This allows me to add logic for stuff like:
The sky's the limit!
How Akka HTTP Compares
The other popular Scala http library is Akka HTTP.
It has some similarities with Scalaj in proxy configuration:
However through my experience, Akka HTTP has a steeper learning curve compared to the simplicity of Scalaj.
Another key difference is that Akka HTTP supports proxy authentication out-of-the-box while Scalaj required custom handling.
So if your use case is simple scraping, I found Scalaj faster to get off the ground with. But Akka HTTP offers richer features for complex use cases.
Putting It All Together
Let's take stock of what we've learnt through an example use case.
Say I want to scrape user profile information from the site
My IP kept getting blocked so I got a rotating residential proxy service
It requires authentication so I have the credentials.
Here's how I can leverage all techniques learnt:
import scalaj.http._
import java.net.InetSocketAddress
// Credentials
val username = "my_username"
val password = "my_secret"
// Pick a random proxy
val proxyHost = getRandProxy()
val proxyPort = 8080
val proxyAuth = ProxyServerCredentials(username, password)
// Route requests over rotating authenticated proxy
val response = Http("<https://some-site.com/users/john_doe>")
.proxy(InetSocketAddress.createUnresolved(proxyHost, proxyPort), proxyAuth)
.asString
// Extract data..
val name = extractName(response.body)
println(name)
This lets me keep scraping day and night without IP blocks or usage limits!
Some key things:
Phew, that was quite the epic journey!
We went from the basics of using Scala and scalaj.http to leveraging proxies for effective large-scale web scraping without headaches.
Hopefully the practical tips shared here based on painful experience will help you avoid common scraper issues like captchas and blocks.
Of course, self-hosting and maintaining proxies introduces operational complexities. The authentication, tunneling, region rotation, proxy refreshing to avoid blocks involves quite some engineering.
My Proxies API service which takes care of these complexities through a simple API.
With auto region rotating residential proxies, captcha solving and built-in retry logic, it lets me focus on writing scraping logic instead of proxy management.
It also seamlessly handles Javascript rendering, cookies, headers etc. Do check it out if you want to scrape at scale without headaches!