As a seasoned web scraper, I've learned that HTTP headers are the duct tape holding together your fragile scraping scripts. They identify your client, control caching, and help avoid detection. Crafting the right headers makes scraping feel effortless, while the wrong ones lead to frustration and failure.
That's why I always use request sessions when scraping with Python. They let you setup default headers just once, then apply them persistently across all your requests. No more repeating the same header code endlessly!
Sessions are magic, but you need to understand how they work to get the most out of them. In this guide, I'll share my hard-earned knowledge for using sessions effectively to handle headers.
Creating Persistent Scraping Sessions
First, import Requests and instantiate a session:
import requests
session = requests.Session()
This session will let us carry over headers between requests simply by using
session.get(url, headers=headers)
But even better, we can set default headers on the session itself. These will automatically apply to every request through the session:
session.headers.update(headers)
Default Headers - The Scraping Essentials
When scraping, I like to set a few headers on every session:
User Agent - I rotate between Chrome, Firefox, Safari and Edge user agents to appear human. Browsers reject unfamiliar clients.
Accept Language - Setting languages like
Referer - Populating the referer header fools sites into serving assets as if you came from a normal page view.
Accept Encoding - Scraper-friendly sites will gzip responses when you advertise
Other Headers - Depending on the site, you may need headers like Host, Origin, or Content-Type too.
Authentication - Staying Logged In
Many sites require login before accessing content. Sessions let you login once then keep accessing authorized pages:
session.auth = ('username', 'password')
This will add Authorization or other needed authentication headers to all requests automatically.
For APIs, you may have to pass OAuth tokens or custom authentication headers. Sessions simplify reusing these too:
token = authenticate_and_get_token()
session.headers['Authorization'] = f'Bearer {token}'
Dynamically Changing Headers
While default headers are great for boilerplate needs, we often have to tweak headers dynamically per request.
For example, scraping links sequentially needs the
No problem - headers passed directly to a request will override the session defaults:
user_agent = random_user_agent()
response = session.get(url, headers={'User-Agent': user_agent})
My one caution is that session headers remain unchanged for future requests. So you have to update headers each time you need new values.
Header Order Matters
Unlike Python dicts, HTTP headers have an order. So make sure any headers you need to come first get added first.
For example, appending
One pattern I follow is starting sessions with default headers, then appending conditional request-specific ones after.
Debugging Headers
Sometimes scraping fails mysteriously due to headers you didn't expect or realize were missing.
To debug, log requests through the session to check headers:
import logging
logger = logging.getLogger('scraper')
logger.setLevel(logging.DEBUG)
logging_hook = {'response': logger.debug,
'request': logger.debug}
session.hooks.update(logging_hook)
You can also explicitly print headers of responses and requests - super useful for debugging!
Advanced Scraping Patterns
Beyond the basics, there are powerful patterns leveraging sessions and headers for robust scraping:
And much more! Sessions are the backbone enabling advanced workflows.
While mastering scrapers takes experience, sessions and headers give you fined-grained control of HTTP traffic. Learn them well and you'll be able to scrape less like a skiddie and more like a pro!