Web scraping allows you to extract data from websites and save it in a structured format like Excel. With ChatGPT, you can generate Python code to scrape websites without any prior coding knowledge. In this article, we'll see how to use ChatGPT to scrape a book website into an Excel sheet.
Overview
Here's a quick overview of the process we'll cover:
Generate Scraping Code with ChatGPT
To start, copy the URL of the website you want to scrape. For this example, we'll use a books website.
Next, go to ChatGPT and enter this prompt:
Generate Python code to scrape the title, link and price of all books from this URL into variables: [paste URL here]
ChatGPT will provide Python code to scrape the requested data from the site. It will look something like this:
import requests
from bs4 import BeautifulSoup
url = '[paste URL here]'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
titles = []
links = []
prices = []
for book in soup.find_all('div', class_='book'):
title = book.h2.text
link = book.a['href']
price = book.find('span', class_='price').text
titles.append(title)
links.append(link)
prices.append(price)
This code uses the Requests library to download the webpage content, then BeautifulSoup to parse the HTML and extract the data we want into lists.
Run the Code to Extract Data
Copy the ChatGPT generated code into a Python file and run it. This will scrape the website and print out the extracted data.
You can modify the code as needed - for example, to extract additional data points or iterate through paginated content.
Format and Output Data to Excel
To get the scraped data into an Excel sheet, modify the Python script to:
- Import the Pandas library
- Create a Pandas DataFrame from the extracted data lists
- Use the
to_excel() method to export the DataFrame to an Excel file
Here is how the script would look:
# Imports
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Scraping code
# Create DataFrame
df = pd.DataFrame({'Title': titles, 'Link': links, 'Price': prices})
# Export to Excel
df.to_excel('books.xlsx', index=False)
Now when you run the script, it will generate an Excel file with the scraped data!
Tips for Web Scraping with ChatGPT
Full Python Code for Scraping Books Website
Here is the full Python code to scrape a books website into an Excel sheet using ChatGPT:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = '<https://books.toscrape.com>'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
titles = []
prices = []
links = []
for book in soup.find_all('article', class_='product_pod'):
# Get title
title = book.find('h3').find('a')['title']
titles.append(title)
# Get price
price = book.find(class_='price_color').get_text()
prices.append(price)
# Get link
link = book.find('h3').find('a')['href']
links.append(url + link)
# Create dataframe and export to Excel
df = pd.DataFrame({'Title': titles, 'Price': prices, 'Link': links})
df.to_excel('books.xlsx', index=False)
This script scrapes the book title, price, and link from each book on the homepage. It stores the data in lists then exports to an Excel file.
So that's how you can leverage ChatGPT to easily generate web scraping code and output data to Excel without coding experience! Let me know if you have any other questions.
ChatGPT heralds an exciting new era in intelligent automation!
However, this approach also has some limitations:
A more robust solution is using a dedicated web scraping API like Proxies API
With Proxies API, you get:
With features like automatic IP rotation, user-agent rotation and CAPTCHA solving, Proxies API makes robust web scraping easy via a simple API:
curl "https://api.proxiesapi.com/?key=API_KEY&url=targetsite.com"
Get started now with 1000 free API calls to supercharge your web scraping!