Web scraping refers to the automated extraction of data from websites. You may be wondering - why is it called "scraping" websites instead of just extracting or collecting data?
The term has its origins in the early days of the web when websites were mostly static HTML pages. Developers would write programs to systematically download web pages and "scrape" the relevant data from the raw HTML. It was like scraping bits of information from different pages.
For example, back then a simple web scraper might:
1. Fetch the HTML of a product page
2. Use regular expressions to scrape the product title, description, and price
3. Store the scraped data in a database
So web scraping involved scraping semi-structured data from HTML in a programmatic way. The term stuck even as websites became more dynamic and web scrapers evolved to render JavaScript pages using headless browsers before extracting data.
These days, web scraping is used for many purposes:
However, while convenient, web scraping does come with caveats around site terms of service, data freshness, scale limits etc. Scrapers should include throttling, caching, proxies, and user-agent rotation.
The terminology was coined early on when scrapers actually "scraped" basic data from HTML pages. And it stuck even as the techniques advanced!