Introduction
Scraping dynamic websites that require logging in can be tricky. Often you may be able to login initially, but will then be logged out when trying to access other pages. This article will walk through how to keep a session alive when web scraping sites with login using Python requests.
Overview
Here's a quick overview of what we'll cover:
Inspecting the Login Form
The first step is analyzing the login form and post request. This can be done using the Network panel in browser developer tools:
Key Steps
This will give us the information needed to mimic the login request in Python.
Sending Login Request
We can now send a POST request to the login URL with the payload:
import requests
login_url = '<https://website.com/login>'
payload = {
'username': 'myusername',
'password': 'mypassword'
}
response = requests.post(login_url, data=payload)
This will log us in. However, we are not yet maintaining the session.
Keeping the Session Alive
To keep logged in across requests, we need to use a session object:
with requests.Session() as session:
session.post(login_url, data=payload)
r = session.get('<https://website.com/restricted>')
# successful as we are logged in!
This will allow us to access restricted pages successfully after logging in.
Hiding Credentials
It's good practice to keep credentials in a separate file:
# cred.py
username = 'myusername'
password = 'mypassword'
# main.py
import cred
payload = {
'username': cred.username,
'password': cred.password
}
This avoids exposing sensitive info if sharing your main code file.
Full Code Example
Below is full code for web scraping a site with login using this approach:
import requests
from bs4 import BeautifulSoup
import cred
login_url = '<https://website.com/login>'
restricted_page = '<https://website.com/restricted>'
payload = {
'username': cred.username,
'password': cred.password
}
with requests.Session() as session:
session.post(login_url, data=payload)
r = session.get(restricted_page)
soup = BeautifulSoup(r.text, 'html.parser')
# Continue scraping/parsing data from soup here...
Summary
Using this approach you can now successfully scrape data from websites requiring login with Python.
While these tools are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.
Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.
This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.
With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.