Skip to content

Emmanuel10701/Data_Scraping

Repository files navigation

E-commerce Reviews Scraping

This Python project scrapes customer reviews from an e-commerce website (or a local HTML file) and saves the extracted data into both CSV and Excel formats. It uses libraries like BeautifulSoup, Pandas, Requests, and openpyxl to achieve this.

Requirements

Before using this project, ensure that Python is installed on your machine, and the necessary libraries are set up:

  1. Python Installation:

    • Make sure you have Python installed on your system. You can download it from python.org.
    • After installation, verify by running the following command in your terminal:
      python --version
      or
      python3 --version
      This should print the Python version (e.g., Python 3.x.x).
  2. Library Installation: The following Python libraries are required to run this project:

    • pandas: For handling and saving the scraped data.
    • beautifulsoup4: For parsing and extracting data from HTML.
    • requests: For sending HTTP requests to scrape data from a live URL.
    • openpyxl: For saving data to an Excel file.

    To install the required libraries, open your terminal and run the following command:

    pip install pandas beautifulsoup4 requests openpyxl

Usage

  1. Scraping from a Live Website:

    • Update the url variable in the script with the target e-commerce website URL.
    • Run the script, and it will fetch and parse the reviews.
  2. Scraping from a Local HTML File:

    • Save the e-commerce page as an HTML file.
    • Update the script to read from the local file instead of making an HTTP request.
  3. Saving the Data:

    • The script extracts key information such as review text, rating, author, and date.
    • The data is then saved into both reviews.csv and reviews.xlsx.

Example Output

A sample of the extracted data:

Author Rating Review Date
JohnDoe 5 "Great product!" 2024-03-10
JaneSmith 4 "Good value for money." 2024-03-11

Notes

  • Ensure compliance with the website's robots.txt and terms of service before scraping.
  • If the website uses JavaScript to load reviews dynamically, consider using Selenium or Scrapy for advanced scraping techniques.

Future Enhancements

  • Implement multi-threading for faster scraping.
  • Support for additional data formats (JSON, SQLite database).
  • Integration with sentiment analysis for review insights.