Wget is a powerful command-line utility for downloading content from the web. This article will explore two main methods for harnessing Wget functionality in Python scripts:
- Using the Wget module
- Calling the Wget command via subprocess
Both approaches have their own pros and cons which I'll cover in detail below. Overall, Wget is extremely useful for web scraping and automation tasks thanks to features like:
Let's look at how we can leverage these capabilities by invoking Wget from Python.
Prerequisites
Before using Wget from Python, you will need:
Importing the Wget Module
Python has a built-in Wget module that exposes several convenient functions:
import wget
url = '<http://example.com/file.pdf>'
wget.download(url) # Download to local file
Some of the useful options this unlocks include:
This wraps Wget functionality into simple Python method calls. However, we lose fine-grained control over arguments and options compared to the command line interface. When more configurability is needed, invoking Wget directly often works better.
Calling Wget via Subprocess
Python's subprocess module allows executing external programs from scripts and fetching the response data. Here is a simple example:
import subprocess
subprocess.run(['wget', '<http://example.com/files.zip>'])
# Downloads files.zip from example.com
By passing a list argument, we can add any flags and options as additional elements:
subprocess.run([
'wget',
'--limit-rate=100k',
'--user-agent="Custom User Agent String"',
'<https://example.com/page.html>'
])
This provides complete access to the full capabilities of the Wget CLI. Some things that may require using subprocess over the Wget module include:
The tradeoff is subprocess introduces more complexity. We need to parse outputs to check for errors, handle streaming for long downloads, etc.
Wget Functionalities
Wget is packed with tons of useful features for downloading web content. Here are some of the main functions it provides:
Recursive Downloading
Wget can mirror entire website structures with the
subprocess.run([
'wget',
'-r',
'-N',
'-np',
'<https://example.com>'
])
This crawls all linked pages recursively. Helpful options include:
Resuming Downloads
If a download gets interrupted, use
subprocess.run([
'wget',
'-c',
'<https://example.com/large_file.zip>'
])
This retains the bytes downloaded so far and adds the remaining portions.
User Agents
Spoof custom user agent strings with
subprocess.run([
'wget',
'--user-agent="Custom Browser 1.0"',
'<https://example.com>'
])
Helps avoid blocks from servers limiting automation.
Rate Limiting
Throttle download speed with
subprocess.run([
'wget',
'--limit-rate=3000k',
'<https://example.com/video_file.mp4>'
])
Prevents using up too much bandwidth.
There are many more - timestamping, headers, attributes etc. Refer to Wget docs for full list.
By scripting the command line Wget, you can leverage any of these versatile options for custom web scraping jobs.
Conclusion
In summary, Python and Wget work extremely well together for web scraping tasks. The Wget module provides a simple API for basic downloads while subprocess grants full access to advanced configuration options. Combining these approaches unlocks the benefits of Wget in an easy to use Python interface.
Some examples where using Wget from Python shines: