URLs may seem like simple strings of text, but they actually contain a wealth of structured data. Being able to efficiently extract parts of a URL is an invaluable skill for any developer working with web technologies. In this guide, I'll walk you through 5 simple steps to extract hostnames, paths, query parameters, and more from URLs in your code.
1. Parse the URL into components
Most programming languages provide built-in libraries for parsing URLs. For example, in Python:
from urllib.parse import urlparse
url = 'https://www.example.com/path/to/page?foo=bar&baz=1'
parsed = urlparse(url)
This breaks the URL down into distinct components that we can access:
So already with the standard library we can easily extract the key parts of a URL.
2. Get the query parameters
To get at the data in the query string, we use the
from urllib.parse import parse_qs
query = parse_qs(parsed.query)
print(query['foo'][0]) # 'bar'
3. Validate the hostname
Often you'll want to validate that a URL is intended for your site or API. To get the hostname:
print(parsed.hostname) # 'www.example.com'
And compare against a list of allowed hosts:
ALLOWED_HOSTS = ['www.example.com', 'example.com']
if parsed.hostname not in ALLOWED_HOSTS:
raise ValueError('Invalid hostname')
4. Extract parts of the path
Paths can contain useful slugs and IDs. To extract a section:
path = parsed.path # '/path/to/page'
parts = path.split('/')
print(parts[2]) # 'page'
Use
5. Reconstruct the URL
Once you've extracted the data you need, reconstruct the URL programatically:
from urllib.parse import urlunparse
url = urlunparse((
parsed.scheme,
parsed.hostname,
new_path,
parsed.params,
parsed.query,
parsed.fragment
))
Putting the pieces back together makes it easy to modify URLs programmatically.
Key Takeaways
With these basic tools, you can efficiently extract all kinds of data from URLs in your code to power your web scraping, APIs, redirects, and more!