This project is a comprehensive data warehousing solution designed for a hypothetical health-tech startup.
It demonstrates best practices in ETL, data governance, privacy, and analytics with interactive dashboards and role-based access control.
- Description
- Business Requirements & Goals
- Reports, Dashboards & KPIs
- Data Warehouse Design, Tables & Sources
4.1 APIs and Data Sources
4.2 ETL Process
4.3 Schemas
- Database Administration & Data Governance
- Graphical User Interface (GUI)
This repository showcases an end-to-end data warehouse solution designed for a fictional health-tech startup.
The scenario centers on a mobile application that helps users monitor their health and wellness by tracking activity, sleep, and nutrition data, with a data engineer responsible for building and maintaining the pipelines that enable secure, reliable analysis of this information.
The project demonstrates how to:
- Integrate and process data from multiple sources (including synthetic data and public APIs).
- Implement robust ETL pipelines for cleaning, validating, and transforming raw data into analytics-ready tables.
- Enforce strong data governance, privacy, and security practices, including the separation of PII and non-PII data.
- Provide interactive dashboards and key performance indicators (KPIs) for user activity, sleep, nutrition, and goal achievement.
- Apply role-based access control to protect sensitive data.
Note:
This project is for educational and demonstration purposes only. The app and its data are entirely fictional and intended to showcase best practices in data engineering, warehousing, and analytics.
- Integrate and process user health data (activity, sleep, nutrition) from multiple sources.
- Ensure data privacy by separating and protecting PII and non-PII data.
- Maintain high data quality and secure, role-based access.
- Support analytics and reporting with trusted, well-structured data.
- Provide actionable insights on user health and goal achievement.
- Demonstrate best practices in data governance and lifecycle management.
- Enable interactive dashboards and KPIs for key health metrics.
Note:
Before generating dashboards, you must first extract, transform, and load the data by running the following scripts in order:
Interactive dashboards can be generated by running the provided script:
python src/dashboard/dashboard.py
This script (src/dashboard/dashboard.py
) automatically creates and visualizes dashboards for your key health and fitness metrics using the trusted views.
Dashboards and KPIs included:
-
Goal Achievement:
- % of users who have achieved their goals
trusted.vw_pct_users_achieved_goals
: Calculates the percentage of user goals (across all goal types) that have been achieved.
- % of users who have achieved their goals
-
Nutrition:
- User's favourite food and typical meal time
trusted.vw_user_favourite_food
: Shows each user's most frequently consumed food and the meal time they usually eat it.
- Average macronutrient distribution per user
trusted.vw_user_avg_macros
: Displays the average intake of calories, carbs, protein, and fat per user.
- User's favourite food and typical meal time
-
Activity:
- Daily average calories burned per user
trusted.vw_user_daily_avg_calories_burned
: Reports the average number of calories burned per user per day.
- Daily average calories burned per user
-
All dashboards:
- Displayed together for easy comparison
- HealthApp: Synthetic data is generated using the
healthapp.py
script.- To generate raw data, run the following command:
python src/extract/healthapp.py
- This script creates the necessary schemas and tables in the
raw
schema (if they do not already exist) and populates them with raw user data (seeraw.user_data
), for activity, sleep, nutrition, and goals.
- To generate raw data, run the following command:
- USDA API: Nutritional data for food items is fetched dynamically via API calls within the
healthapp.py
script.
The ETL pipeline consists of the following stages, each with a dedicated script:
- Purpose: Generate and collect raw data from synthetic sources and public APIs.
- Code:
src/extract/healthapp.py
- Generates synthetic user/activity/nutrition data and loads it into the
raw
schema. - Tables created in
raw
schema:
- Generates synthetic user/activity/nutrition data and loads it into the
- Supporting code:
src/extract/search_foods_api.py
: Fetches nutrition data from the USDA API.src/extract/load_data_from_csv.py
: Loads food and activity types from CSV files.
- How to run:
python src/extract/healthapp.py
Show Sample Raw Nutrition Log
Example of the raw.nutrition_log
table as generated by the extract script.
- Purpose: Clean, validate, and transform raw data into analytics-ready staging tables.
- Code:
src/transform/transform.py
- Reads from the
raw
schema, processes the data, and loads it into thestaging
schema. - Tables created in
staging
schema:
- Reads from the
- How to run:
python src/transform/transform.py
Show Sample Staging User Profile
Example of the staging.dim_user_profile
table after transformation.
Show Sample Staging Nutrition Log
Example of the staging.fact_nutrition_log
table after transformation.
- Purpose: Load fully processed data from staging into the trusted schema for analytics and reporting.
- Code:
src/load/load.py
- Moves data from the
staging
schema into thetrusted
schema, enforcing all business and data quality rules. - Tables created in
trusted
schema:
- Moves data from the
- How to run:
python src/load/load.py
Note:
Each script will automatically create the required tables in its schema if they do not already exist.
The data warehouse is organized into three schemas: raw, staging, and trusted. Each schema serves a specific purpose in the ETL pipeline:
- Purpose: The initial storage for raw, unprocessed data directly extracted from the sources.
Column Name | Data Type | Description |
---|---|---|
record_id |
SERIAL PRIMARY KEY |
Unique identifier for each record. |
user_id |
INTEGER |
Unique identifier for the user. |
name |
VARCHAR(100) |
User's name. |
age |
INTEGER |
User's age. |
weight_kg |
NUMERIC(4,1) |
User's weight in kilograms. |
height_cm |
NUMERIC(4,1) |
User's height in centimeters. |
gender |
VARCHAR(10) |
User's gender. |
calorie_goal |
INTEGER |
Daily calorie goal for the user. |
macro_goal |
JSON |
JSON object containing macro goals (carbs, protein, fat). |
activity_start |
TIMESTAMP |
Start time of the activity. |
activity_type |
VARCHAR(50) |
Type of activity (e.g., walking, running). |
steps |
INTEGER |
Daily step count |
heart_rate |
INTEGER |
Heart rate during the activity. |
calories_burned |
INTEGER |
Calories burned during the day. |
sleep_start |
TIMESTAMP |
Start time of sleep. |
sleep_end |
TIMESTAMP |
End time of sleep. |
sleep_quality_score |
INTEGER |
Quality score of sleep. |
goal_type |
VARCHAR(50) |
Type of goal (e.g., calories burned, steps taken). |
goal_target |
INTEGER |
Target value for the goal. |
created_at |
TIMESTAMP |
Timestamp when the record was created. |
Column Name | Data Type | Description |
---|---|---|
nutrition_id |
SERIAL PRIMARY KEY |
Unique identifier for the nutrition record. |
user_id |
INTEGER |
Unique identifier for the user. |
date |
DATE |
Date of the nutrition log. |
food_item |
VARCHAR(255) |
Name of the food item. |
meal_type |
VARCHAR(100) |
Type of meal (e.g., breakfast, lunch). |
calories_per_100g |
INTEGER |
Calories per 100 grams of the food item. |
carbs_per_100g |
INTEGER |
Carbohydrates per 100 grams of the food item. |
protein_per_100g |
INTEGER |
Protein per 100 grams of the food item. |
fat_per_100g |
INTEGER |
Fat per 100 grams of the food item. |
- Purpose: Stores cleaned and transformed data, ready for further processing.
Show Staging Schema Star Diagram
Star diagram for the staging schema, illustrating fact and dimension tables and their relationships.
Column Name | Data Type | Description |
---|---|---|
user_id |
BIGINT PRIMARY KEY |
Unique identifier for the user. |
name |
VARCHAR(255) |
User's name. |
age |
INTEGER |
User's age. |
weight_kg |
DECIMAL(4,1) |
User's weight in kilograms. |
height_cm |
DECIMAL(4,1) |
User's height in centimeters. |
gender |
VARCHAR(50) |
User's gender. |
calorie_goal |
INTEGER |
Daily calorie goal for the user. |
carbs_goal |
INTEGER |
Daily carbohydrate goal for the user. |
protein_goal |
INTEGER |
Daily protein goal for the user. |
fat_goal |
INTEGER |
Daily fat goal for the user. |
Column Name | Data Type | Description |
---|---|---|
food_item_id |
BIGINT PRIMARY KEY |
Unique identifier for the food item. |
food_item |
VARCHAR(255) |
Name of the food item. |
calories_per_100g |
DECIMAL(4,0) |
Calories per 100 grams of the food item. |
carbs_per_100g |
DECIMAL(3,0) |
Carbohydrates per 100 grams of the food item. |
protein_per_100g |
DECIMAL(3,0) |
Protein per 100 grams of the food item. |
fat_per_100g |
DECIMAL(3,0) |
Fat per 100 grams of the food item. |
Column Name | Data Type | Description |
---|---|---|
activity_id |
BIGINT PRIMARY KEY |
Unique identifier for the activity record. |
user_id |
BIGINT |
Unique identifier for the user. |
timestamp |
TIMESTAMP |
Timestamp of the activity. |
activity_type |
VARCHAR(100) |
Type of activity (e.g., walking, running). |
steps |
INTEGER |
Number of steps taken during the activity. |
heart_rate |
INTEGER |
Heart rate during the activity. |
calories_burned |
INTEGER |
Calories burned during the activity. |
Column Name | Data Type | Description |
---|---|---|
sleep_id |
BIGINT PRIMARY KEY |
Unique identifier for the sleep record. |
user_id |
BIGINT |
Unique identifier for the user. |
date |
DATE |
Date of the sleep record. |
sleep_start |
TIMESTAMP |
Start time of sleep. |
sleep_end |
TIMESTAMP |
End time of sleep. |
sleep_duration_hours |
DECIMAL(5,1) |
Duration of sleep in hours. |
sleep_quality_score |
INTEGER |
Quality score of sleep. |
Column Name | Data Type | Description |
---|---|---|
nutrition_id |
BIGINT PRIMARY KEY |
Unique identifier for the nutrition record. |
user_id |
BIGINT |
Unique identifier for the user. |
date |
DATE |
Date of the nutrition log. |
food_item_id |
BIGINT |
Foreign key to the food item dimension. |
meal_type |
VARCHAR(100) |
Type of meal (e.g., breakfast, lunch). |
Column Name | Data Type | Description |
---|---|---|
goal_id |
BIGINT PRIMARY KEY |
Unique identifier for the goal record. |
user_id |
BIGINT |
Unique identifier for the user. |
date |
DATE |
Date of the goal record. |
goal_type |
VARCHAR(100) |
Type of goal (e.g., calories burned, steps taken). |
target_value |
INTEGER |
Target value for the goal. |
actual_value |
INTEGER |
Actual value achieved for the goal. |
status |
VARCHAR(50) |
Status of the goal (e.g., achieved, not achieved). |
- Purpose: Stores the final, fully processed data that is ready for analytics and reporting.
Show Trusted Schema Star Diagram
Star diagram for the trusted schema, illustrating the relationships between tables.
Column Name | Data Type | Description |
---|---|---|
nutrition_id |
BIGINT PRIMARY KEY |
Unique identifier for the nutrition record. |
user_id |
BIGINT |
Unique identifier for the user (foreign key to PII table). |
date |
DATE |
Date of the nutrition log. |
food_item |
VARCHAR(255) |
Name of the food item. |
meal_type |
VARCHAR(100) |
Type of meal (e.g., breakfast, lunch). |
calories_per_100g |
DECIMAL(4,0) |
Calories per 100 grams of the food item. |
carbs_per_100g |
DECIMAL(3,0) |
Carbohydrates per 100 grams of the food item. |
protein_per_100g |
DECIMAL(3,0) |
Protein per 100 grams of the food item. |
fat_per_100g |
DECIMAL(3,0) |
Fat per 100 grams of the food item. |
Column Name | Data Type | Description |
---|---|---|
activity_id |
BIGINT PRIMARY KEY |
Unique identifier for the activity record. |
user_id |
BIGINT |
Unique identifier for the user (foreign key to PII table). |
timestamp |
TIMESTAMP |
Timestamp of the activity. |
activity_type |
VARCHAR(100) |
Type of activity (e.g., walking, running). |
steps |
INT |
Number of steps taken during the activity. |
heart_rate |
INT |
Heart rate during the activity. |
calories_burned |
INT |
Calories burned during the activity. |
Column Name | Data Type | Description |
---|---|---|
sleep_id |
BIGINT PRIMARY KEY |
Unique identifier for the sleep record. |
user_id |
BIGINT |
Unique identifier for the user (foreign key to PII table). |
date |
DATE |
Date of the sleep record. |
sleep_start |
TIMESTAMP |
Start time of sleep. |
sleep_end |
TIMESTAMP |
End time of sleep. |
sleep_duration_hours |
DECIMAL(5,1) |
Duration of sleep in hours. |
sleep_quality_score |
INTEGER |
Quality score of sleep. |
Column Name | Data Type | Description |
---|---|---|
goal_id |
BIGINT PRIMARY KEY |
Unique identifier for the goal record. |
user_id |
BIGINT |
Unique identifier for the user (foreign key to PII table). |
date |
DATE |
Date of the goal record. |
goal_type |
VARCHAR(100) |
Type of goal (e.g., calories burned, steps). |
target_value |
INT |
Target value for the goal. |
actual_value |
INT |
Actual value achieved for the goal. |
status |
VARCHAR(50) |
Status of the goal (e.g., achieved, not achieved, pending). |
Column Name | Data Type | Description |
---|---|---|
user_id |
BIGINT PRIMARY KEY |
Unique identifier for the user. |
name |
VARCHAR(255) |
User's name (PII, access restricted). |
age |
INT |
User's age (PII, access restricted). |
gender |
VARCHAR(50) |
User's gender (PII, access restricted). |
Note:
All analytics and reporting are performed using the trusted fact tables, which only reference users byuser_id
.
PII is only accessible to authorized roles.
- Create DBA Roles:
- Use the scripts in
sql/roles/
to create thedba_role
and grant access to thetrusted
schema and its views:grant_usage_trusted_schema.sql
grant_select_trusted_views.sql
grant_access_pii_data.sql
(restricts PII access to DBAs)grant_access_non_pii_data.sql
(grants analytics access to non-PII data)grant_access_archive_user_data_pii.sql
(restricts archive access to DBAs)
- Use the scripts in
- Optimize DB Performance:
- Use
explain_query_execution.sql
to analyze and optimize query performance.
- Use
- Schema Organization:
- Schema Organization:
- All scripts for table creation, data insertion, archiving, and deletion are organized by data sensitivity in
sql/data/
.
- Data Privacy & PII Handling:
- PII data (e.g., user names, ages) is stored only in
trusted.user_data_pii
and managed via scripts insql/data/pii/
. - Non-PII data is stored in
trusted.user_data_non_pii
(seesql/data/non_pii/
), where user identifiers are hashed and ages are grouped. - Access to PII tables is strictly limited to users with the
dba_role
.
- PII data (e.g., user names, ages) is stored only in
- Data Lifecycle Management:
- Old or inactive records are archived using scripts in
sql/data/archive/
. - Archived records are deleted from the main tables only after successful archival, ensuring no data loss.
- Old or inactive records are archived using scripts in
- Data Quality Assurance:
- All trusted tables enforce strong data quality constraints (
NOT NULL
,CHECK
, valid value lists) at the schema level. - ETL scripts filter out incomplete or invalid records before loading into trusted tables.
- All trusted tables enforce strong data quality constraints (
- Security & Access Control:
- Role-based access is enforced at the schema and table level.
- PII data is never exposed to analytics or reporting users.
- Documentation:
- All scripts and policies are documented in this README for transparency and auditability.
A simple GUI is included to make running the ETL pipeline and generating dashboards more user-friendly.
- The GUI is built with Tkinter.
- Launch it with:
python src/interface/main.py
- The interface will prompt you to enter:
- Number of Users
- Number of Days
- Database Host, Port, User, Password, and Name
- USDA API Key
When you click "Run Pipeline", the GUI passes your input as command-line arguments and environment variables to the ETL scripts. The scripts are then run in sequence: extract, transform, load, and dashboard generation.
File location:
src/interface/main.py
Important:
- If you use the GUI, your input for number of users and days will be used for that run.
- If you run the extract script directly (e.g.,
python src/extract/healthapp.py
), it will use the default global variables (NO_USERS
andNO_DAYS
) defined in the script.
Note:
For larger numbers of users or days, the pipeline will take longer to complete.
This is because the USDA API is called to retrieve food information for every user and day, which can be time-consuming.
Example GUI window: