🚀 Extract, Transform, and Load (ETL) real-time Reddit data, analyze sentiment, and store it in a star-schema data warehouse.
This project extracts trending Reddit posts from a subreddit, performs sentiment analysis, and stores structured data in a PostgreSQL data warehouse using a star schema.
✅ ETL Pipeline: Extracts, cleans, and loads real-time Reddit data.
✅ Sentiment Analysis: Uses AI (TextBlob
) to categorize sentiment (Positive, Neutral, Negative).
✅ Data Warehouse: Stores posts in a structured star schema for optimized querying.
✅ SQL Optimization: Enables fast queries for insights into top trends and discussions.
🔹 Python – For data extraction, transformation, and loading
🔹 Reddit API (PRAW
) – Fetches live Reddit data
🔹 PostgreSQL – Stores structured data
🔹 SQLAlchemy – Connects Python to PostgreSQL
🔹 TextBlob – Performs sentiment analysis
🔹 Pandas – Data transformation
🔹 Git – Version control
┌──────────────────────────┐
│ Reddit API (PRAW) │
└──────────┬──────────────┘
▼
┌──────────────────────────┐
│ Python ETL Script │
│ - Extract: Fetch posts │
│ - Transform: Clean data │
│ - Sentiment Analysis │
└──────────┬──────────────┘
▼
┌──────────────────────────┐
│ PostgreSQL (Star Schema)│
│ - fact_posts │
│ - dim_authors
│ - dim_subreddit │
│ - dim_date │
└──────────┬──────────────┘
▼
┌──────────────────────────┐
│ SQL Queries & Insights │
│ - Top trending posts │
│ - Sentiment breakdown │
│ - Most active users │
└──────────────────────────┘
git clone https://github.com/yourusername/reddit-trend-tracker.git
cd reddit-trend-tracker
Ensure you have Python 3.8+ installed. Then, install the required libraries:
pip install praw pandas textblob sqlalchemy psycopg2
Make sure to setup the praw configuration from praw's docs.
- Open PostgreSQL and create a new database:
CREATE DATABASE reddit_dw;
- Run the SQL schema script to set up tables:
psql -U your_user -d reddit_dw -f setup_schema.sql
Column | Type | Description |
---|---|---|
post_id | VARCHAR(20) | Reddit post ID |
author_id | INT | Foreign key to dim_authors |
subreddit_id | INT | Foreign key to dim_subreddit |
score | INT | Upvotes for the post |
num_comments | INT | Number of comments on the post |
sentiment | VARCHAR(20) | Positive, Negative, or Neutral |
date_id | INT | Foreign key to dim_date |
Column | Type | Description |
---|---|---|
id | SERIAL (PK) | Unique author ID |
author_name | VARCHAR(255) | Reddit username |
Column | Type | Description |
---|---|---|
id | SERIAL (PK) | Unique subreddit ID |
subreddit_name | VARCHAR(255) | Subreddit name |
Column | Type | Description |
---|---|---|
date_id | SERIAL (PK) | Unique date ID |
date_value | DATE | Actual date |
year | INT | Year of post |
month | INT | Month of post |
day_of_week | INT | Day of the week |
Run the Python script to fetch and process Reddit posts:
python extract_transform.py
After extracting data, load it into your database:
python load_to_postgres.py
SELECT p.post_id, a.author_name, p.score
FROM fact_posts p
JOIN dim_authors a ON p.author_id = a.author_id
ORDER BY p.score DESC
LIMIT 5;
SELECT sentiment, COUNT(*) as post_count
FROM fact_posts
GROUP BY sentiment
ORDER BY post_count DESC;
SELECT a.author_name, COUNT(*) as post_count
FROM fact_posts p
JOIN dim_authors a ON p.author_id = a.author_id
GROUP BY a.author_name
ORDER BY post_count DESC
LIMIT 10;
- Enhance Sentiment Analysis: Use NLTK or a machine learning model instead of
TextBlob
. - Streamline with Apache Airflow: Automate ETL pipeline scheduling.
- Deploy on AWS: Store processed data in Amazon S3, run queries using AWS Athena.
- Create a Web Dashboard: Use Tableau or Power BI to visualize Reddit trends.
🌟 Ready to dive into the data? Clone, run, and explore the trends!
📌 GitHub Repository: 🔗 Link to Repo