Skip to content

End-to-end data warehousing project integrating APIs, ETL workflows, and PostgreSQL for analytics and reporting.

License

Notifications You must be signed in to change notification settings

melinteflxrin/SoftServe-BigData-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Warehousing Project:
Health, Fitness & Nutrition Analytics

This project is a comprehensive data warehousing solution designed for a hypothetical health-tech startup.
It demonstrates best practices in ETL, data governance, privacy, and analytics with interactive dashboards and role-based access control.


Table of contents

  1. Description
  2. Business Requirements & Goals
  3. Reports, Dashboards & KPIs
  4. Data Warehouse Design, Tables & Sources
    4.1 APIs and Data Sources
    4.2 ETL Process
    4.3 Schemas
  5. Database Administration & Data Governance
  6. Graphical User Interface (GUI)

1. Description

This repository showcases an end-to-end data warehouse solution designed for a fictional health-tech startup.
The scenario centers on a mobile application that helps users monitor their health and wellness by tracking activity, sleep, and nutrition data, with a data engineer responsible for building and maintaining the pipelines that enable secure, reliable analysis of this information.

The project demonstrates how to:

  • Integrate and process data from multiple sources (including synthetic data and public APIs).
  • Implement robust ETL pipelines for cleaning, validating, and transforming raw data into analytics-ready tables.
  • Enforce strong data governance, privacy, and security practices, including the separation of PII and non-PII data.
  • Provide interactive dashboards and key performance indicators (KPIs) for user activity, sleep, nutrition, and goal achievement.
  • Apply role-based access control to protect sensitive data.

Note:
This project is for educational and demonstration purposes only. The app and its data are entirely fictional and intended to showcase best practices in data engineering, warehousing, and analytics.


2. Business Requirements & Goals

Business Requirements

  • Integrate and process user health data (activity, sleep, nutrition) from multiple sources.
  • Ensure data privacy by separating and protecting PII and non-PII data.
  • Maintain high data quality and secure, role-based access.
  • Support analytics and reporting with trusted, well-structured data.

Core Business Goals

  • Provide actionable insights on user health and goal achievement.
  • Demonstrate best practices in data governance and lifecycle management.
  • Enable interactive dashboards and KPIs for key health metrics.

3. Reports, Dashboards & KPIs

Note:
Before generating dashboards, you must first extract, transform, and load the data by running the following scripts in order:

  1. src/extract/healthapp.py
  2. src/transform/transform.py
  3. src/load/load.py

Interactive dashboards can be generated by running the provided script:

python src/dashboard/dashboard.py

This script (src/dashboard/dashboard.py) automatically creates and visualizes dashboards for your key health and fitness metrics using the trusted views.

Dashboards and KPIs included:

  • Goal Achievement:

  • Nutrition:

    • User's favourite food and typical meal time
    • Average macronutrient distribution per user
  • Activity:

  • All dashboards:

    • Displayed together for easy comparison
Dashboard Example Output

Dashboard Example
Dashboard example shown for 10 generated users across 7 days


4. Data Warehouse Design, Tables & Sources

4.1 APIs and Data Sources

  • HealthApp: Synthetic data is generated using the healthapp.py script.
    • To generate raw data, run the following command:
      python src/extract/healthapp.py
    • This script creates the necessary schemas and tables in the raw schema (if they do not already exist) and populates them with raw user data (see raw.user_data), for activity, sleep, nutrition, and goals.
  • USDA API: Nutritional data for food items is fetched dynamically via API calls within the healthapp.py script.

4.2 ETL Process

Show ETL Pipeline Diagram

ETL Pipeline
Overview of the ETL pipeline: Extract, Transform, Load.


The ETL pipeline consists of the following stages, each with a dedicated script:

1. Extract

Show Sample Raw User Data

Sample Raw Data
Example of the raw.user_data table as generated by the extract script.

Show Sample Raw Nutrition Log

Sample Nutrition Log
Example of the raw.nutrition_log table as generated by the extract script.


2. Transform

Show Sample Staging User Profile

Sample Staging User Profile
Example of the staging.dim_user_profile table after transformation.

Show Sample Staging Nutrition Log

Sample Staging Nutrition Log
Example of the staging.fact_nutrition_log table after transformation.


3. Load

Note:
Each script will automatically create the required tables in its schema if they do not already exist.

Show Sample Trusted Nutrition

Sample Trusted Nutrition Data
Example of the trusted.nutrition_data table after loading.

Show Sample Trusted Goals

Sample Trusted Goals Data
Example of the trusted.goals_data table after loading.


4.3 Schemas

The data warehouse is organized into three schemas: raw, staging, and trusted. Each schema serves a specific purpose in the ETL pipeline:

Raw Schema

  • Purpose: The initial storage for raw, unprocessed data directly extracted from the sources.

raw.user_data

Column Name Data Type Description
record_id SERIAL PRIMARY KEY Unique identifier for each record.
user_id INTEGER Unique identifier for the user.
name VARCHAR(100) User's name.
age INTEGER User's age.
weight_kg NUMERIC(4,1) User's weight in kilograms.
height_cm NUMERIC(4,1) User's height in centimeters.
gender VARCHAR(10) User's gender.
calorie_goal INTEGER Daily calorie goal for the user.
macro_goal JSON JSON object containing macro goals (carbs, protein, fat).
activity_start TIMESTAMP Start time of the activity.
activity_type VARCHAR(50) Type of activity (e.g., walking, running).
steps INTEGER Daily step count
heart_rate INTEGER Heart rate during the activity.
calories_burned INTEGER Calories burned during the day.
sleep_start TIMESTAMP Start time of sleep.
sleep_end TIMESTAMP End time of sleep.
sleep_quality_score INTEGER Quality score of sleep.
goal_type VARCHAR(50) Type of goal (e.g., calories burned, steps taken).
goal_target INTEGER Target value for the goal.
created_at TIMESTAMP Timestamp when the record was created.

raw.nutrition_log

Column Name Data Type Description
nutrition_id SERIAL PRIMARY KEY Unique identifier for the nutrition record.
user_id INTEGER Unique identifier for the user.
date DATE Date of the nutrition log.
food_item VARCHAR(255) Name of the food item.
meal_type VARCHAR(100) Type of meal (e.g., breakfast, lunch).
calories_per_100g INTEGER Calories per 100 grams of the food item.
carbs_per_100g INTEGER Carbohydrates per 100 grams of the food item.
protein_per_100g INTEGER Protein per 100 grams of the food item.
fat_per_100g INTEGER Fat per 100 grams of the food item.

Staging Schema

  • Purpose: Stores cleaned and transformed data, ready for further processing.
Show Staging Schema Star Diagram

Staging Star Diagram
Star diagram for the staging schema, illustrating fact and dimension tables and their relationships.

staging.dim_user_profile

Column Name Data Type Description
user_id BIGINT PRIMARY KEY Unique identifier for the user.
name VARCHAR(255) User's name.
age INTEGER User's age.
weight_kg DECIMAL(4,1) User's weight in kilograms.
height_cm DECIMAL(4,1) User's height in centimeters.
gender VARCHAR(50) User's gender.
calorie_goal INTEGER Daily calorie goal for the user.
carbs_goal INTEGER Daily carbohydrate goal for the user.
protein_goal INTEGER Daily protein goal for the user.
fat_goal INTEGER Daily fat goal for the user.

staging.dim_food_item

Column Name Data Type Description
food_item_id BIGINT PRIMARY KEY Unique identifier for the food item.
food_item VARCHAR(255) Name of the food item.
calories_per_100g DECIMAL(4,0) Calories per 100 grams of the food item.
carbs_per_100g DECIMAL(3,0) Carbohydrates per 100 grams of the food item.
protein_per_100g DECIMAL(3,0) Protein per 100 grams of the food item.
fat_per_100g DECIMAL(3,0) Fat per 100 grams of the food item.

staging.fact_activity_log

Column Name Data Type Description
activity_id BIGINT PRIMARY KEY Unique identifier for the activity record.
user_id BIGINT Unique identifier for the user.
timestamp TIMESTAMP Timestamp of the activity.
activity_type VARCHAR(100) Type of activity (e.g., walking, running).
steps INTEGER Number of steps taken during the activity.
heart_rate INTEGER Heart rate during the activity.
calories_burned INTEGER Calories burned during the activity.

staging.fact_sleep_log

Column Name Data Type Description
sleep_id BIGINT PRIMARY KEY Unique identifier for the sleep record.
user_id BIGINT Unique identifier for the user.
date DATE Date of the sleep record.
sleep_start TIMESTAMP Start time of sleep.
sleep_end TIMESTAMP End time of sleep.
sleep_duration_hours DECIMAL(5,1) Duration of sleep in hours.
sleep_quality_score INTEGER Quality score of sleep.

staging.fact_nutrition_log

Column Name Data Type Description
nutrition_id BIGINT PRIMARY KEY Unique identifier for the nutrition record.
user_id BIGINT Unique identifier for the user.
date DATE Date of the nutrition log.
food_item_id BIGINT Foreign key to the food item dimension.
meal_type VARCHAR(100) Type of meal (e.g., breakfast, lunch).

staging.fact_goals_log

Column Name Data Type Description
goal_id BIGINT PRIMARY KEY Unique identifier for the goal record.
user_id BIGINT Unique identifier for the user.
date DATE Date of the goal record.
goal_type VARCHAR(100) Type of goal (e.g., calories burned, steps taken).
target_value INTEGER Target value for the goal.
actual_value INTEGER Actual value achieved for the goal.
status VARCHAR(50) Status of the goal (e.g., achieved, not achieved).

Trusted Schema

  • Purpose: Stores the final, fully processed data that is ready for analytics and reporting.
Show Trusted Schema Star Diagram

Trusted Star Diagram
Star diagram for the trusted schema, illustrating the relationships between tables.

trusted.nutrition_data

Column Name Data Type Description
nutrition_id BIGINT PRIMARY KEY Unique identifier for the nutrition record.
user_id BIGINT Unique identifier for the user (foreign key to PII table).
date DATE Date of the nutrition log.
food_item VARCHAR(255) Name of the food item.
meal_type VARCHAR(100) Type of meal (e.g., breakfast, lunch).
calories_per_100g DECIMAL(4,0) Calories per 100 grams of the food item.
carbs_per_100g DECIMAL(3,0) Carbohydrates per 100 grams of the food item.
protein_per_100g DECIMAL(3,0) Protein per 100 grams of the food item.
fat_per_100g DECIMAL(3,0) Fat per 100 grams of the food item.

trusted.activity_data

Column Name Data Type Description
activity_id BIGINT PRIMARY KEY Unique identifier for the activity record.
user_id BIGINT Unique identifier for the user (foreign key to PII table).
timestamp TIMESTAMP Timestamp of the activity.
activity_type VARCHAR(100) Type of activity (e.g., walking, running).
steps INT Number of steps taken during the activity.
heart_rate INT Heart rate during the activity.
calories_burned INT Calories burned during the activity.

trusted.sleep_data

Column Name Data Type Description
sleep_id BIGINT PRIMARY KEY Unique identifier for the sleep record.
user_id BIGINT Unique identifier for the user (foreign key to PII table).
date DATE Date of the sleep record.
sleep_start TIMESTAMP Start time of sleep.
sleep_end TIMESTAMP End time of sleep.
sleep_duration_hours DECIMAL(5,1) Duration of sleep in hours.
sleep_quality_score INTEGER Quality score of sleep.

trusted.goals_data

Column Name Data Type Description
goal_id BIGINT PRIMARY KEY Unique identifier for the goal record.
user_id BIGINT Unique identifier for the user (foreign key to PII table).
date DATE Date of the goal record.
goal_type VARCHAR(100) Type of goal (e.g., calories burned, steps).
target_value INT Target value for the goal.
actual_value INT Actual value achieved for the goal.
status VARCHAR(50) Status of the goal (e.g., achieved, not achieved, pending).

trusted.user_profile

Column Name Data Type Description
user_id BIGINT PRIMARY KEY Unique identifier for the user.
name VARCHAR(255) User's name (PII, access restricted).
age INT User's age (PII, access restricted).
gender VARCHAR(50) User's gender (PII, access restricted).

Note:
All analytics and reporting are performed using the trusted fact tables, which only reference users by user_id.
PII is only accessible to authorized roles.


5. Database Administration & Data Governance

5.1 Database Administration


5.2 Data Governance

  • Data Privacy & PII Handling:
    • PII data (e.g., user names, ages) is stored only in trusted.user_data_pii and managed via scripts in sql/data/pii/.
    • Non-PII data is stored in trusted.user_data_non_pii (see sql/data/non_pii/), where user identifiers are hashed and ages are grouped.
    • Access to PII tables is strictly limited to users with the dba_role.
  • Data Lifecycle Management:
    • Old or inactive records are archived using scripts in sql/data/archive/.
    • Archived records are deleted from the main tables only after successful archival, ensuring no data loss.
  • Data Quality Assurance:
    • All trusted tables enforce strong data quality constraints (NOT NULL, CHECK, valid value lists) at the schema level.
    • ETL scripts filter out incomplete or invalid records before loading into trusted tables.
  • Security & Access Control:
    • Role-based access is enforced at the schema and table level.
    • PII data is never exposed to analytics or reporting users.
  • Documentation:
    • All scripts and policies are documented in this README for transparency and auditability.


6. Graphical User Interface (GUI)

A simple GUI is included to make running the ETL pipeline and generating dashboards more user-friendly.

How it Works

  • The GUI is built with Tkinter.
  • Launch it with:
    python src/interface/main.py
  • The interface will prompt you to enter:
    • Number of Users
    • Number of Days
    • Database Host, Port, User, Password, and Name
    • USDA API Key

When you click "Run Pipeline", the GUI passes your input as command-line arguments and environment variables to the ETL scripts. The scripts are then run in sequence: extract, transform, load, and dashboard generation.

File location:
src/interface/main.py

Important:

  • If you use the GUI, your input for number of users and days will be used for that run.
  • If you run the extract script directly (e.g., python src/extract/healthapp.py), it will use the default global variables (NO_USERS and NO_DAYS) defined in the script.

Note:
For larger numbers of users or days, the pipeline will take longer to complete.
This is because the USDA API is called to retrieve food information for every user and day, which can be time-consuming.

Example GUI window:

Show Example GUI window

GUI Example
The GUI for running the ETL pipeline and dashboards with custom parameters.


About

End-to-end data warehousing project integrating APIs, ETL workflows, and PostgreSQL for analytics and reporting.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages