Data Warehousing Project:
Health, Fitness & Nutrition Analytics

This project is a comprehensive data warehousing solution designed for a hypothetical health-tech startup.
It demonstrates best practices in ETL, data governance, privacy, and analytics with interactive dashboards and role-based access control.

Description
Business Requirements & Goals
Reports, Dashboards & KPIs
- Dashboard Example Output
Data Warehouse Design, Tables & Sources
4.1 APIs and Data Sources
4.2 ETL Process
4.3 Schemas
Database Administration & Data Governance
- 5.1 Database Administration
- 5.2 Data Governance
Graphical User Interface (GUI)
- GUI Example

1. Description

This repository showcases an end-to-end data warehouse solution designed for a fictional health-tech startup.
The scenario centers on a mobile application that helps users monitor their health and wellness by tracking activity, sleep, and nutrition data, with a data engineer responsible for building and maintaining the pipelines that enable secure, reliable analysis of this information.

The project demonstrates how to:

Integrate and process data from multiple sources (including synthetic data and public APIs).
Implement robust ETL pipelines for cleaning, validating, and transforming raw data into analytics-ready tables.
Enforce strong data governance, privacy, and security practices, including the separation of PII and non-PII data.
Provide interactive dashboards and key performance indicators (KPIs) for user activity, sleep, nutrition, and goal achievement.
Apply role-based access control to protect sensitive data.

Note:
This project is for educational and demonstration purposes only. The app and its data are entirely fictional and intended to showcase best practices in data engineering, warehousing, and analytics.

2. Business Requirements & Goals

Business Requirements

Integrate and process user health data (activity, sleep, nutrition) from multiple sources.
Ensure data privacy by separating and protecting PII and non-PII data.
Maintain high data quality and secure, role-based access.
Support analytics and reporting with trusted, well-structured data.

Core Business Goals

Provide actionable insights on user health and goal achievement.
Demonstrate best practices in data governance and lifecycle management.
Enable interactive dashboards and KPIs for key health metrics.

3. Reports, Dashboards & KPIs

Note:
Before generating dashboards, you must first extract, transform, and load the data by running the following scripts in order:

src/extract/healthapp.py

src/transform/transform.py

src/load/load.py

Interactive dashboards can be generated by running the provided script:

python src/dashboard/dashboard.py

This script (src/dashboard/dashboard.py) automatically creates and visualizes dashboards for your key health and fitness metrics using the trusted views.

Dashboards and KPIs included:

Goal Achievement:
- % of users who have achieved their goals
  - trusted.vw_pct_users_achieved_goals: Calculates the percentage of user goals (across all goal types) that have been achieved.
Nutrition:
- User's favourite food and typical meal time
  - trusted.vw_user_favourite_food: Shows each user's most frequently consumed food and the meal time they usually eat it.
- Average macronutrient distribution per user
  - trusted.vw_user_avg_macros: Displays the average intake of calories, carbs, protein, and fat per user.
Activity:
- Daily average calories burned per user
  - trusted.vw_user_daily_avg_calories_burned: Reports the average number of calories burned per user per day.
All dashboards:
- Displayed together for easy comparison

Dashboard Example Output

Dashboard example shown for 10 generated users across 7 days

4. Data Warehouse Design, Tables & Sources

4.1 APIs and Data Sources

HealthApp: Synthetic data is generated using the healthapp.py script.
- To generate raw data, run the following command:
```
python src/extract/healthapp.py
```
- This script creates the necessary schemas and tables in the raw schema (if they do not already exist) and populates them with raw user data (see raw.user_data), for activity, sleep, nutrition, and goals.
USDA API: Nutritional data for food items is fetched dynamically via API calls within the healthapp.py script.

4.2 ETL Process

Show ETL Pipeline Diagram

Overview of the ETL pipeline: Extract, Transform, Load.

The ETL pipeline consists of the following stages, each with a dedicated script:

1. Extract

Purpose: Generate and collect raw data from synthetic sources and public APIs.
Code: src/extract/healthapp.py
- Generates synthetic user/activity/nutrition data and loads it into the raw schema.
- Tables created in raw schema:
  - raw.user_data
  - raw.nutrition_log
Supporting code:
- src/extract/search_foods_api.py: Fetches nutrition data from the USDA API.
- src/extract/load_data_from_csv.py: Loads food and activity types from CSV files.
How to run:
```
python src/extract/healthapp.py
```

Show Sample Raw User Data

Example of the raw.user_data table as generated by the extract script.

Show Sample Raw Nutrition Log

Example of the raw.nutrition_log table as generated by the extract script.

2. Transform

Purpose: Clean, validate, and transform raw data into analytics-ready staging tables.
Code: src/transform/transform.py
- Reads from the raw schema, processes the data, and loads it into the staging schema.
- Tables created in staging schema:
How to run:
```
python src/transform/transform.py
```

Show Sample Staging User Profile

Example of the staging.dim_user_profile table after transformation.

Show Sample Staging Nutrition Log

Example of the staging.fact_nutrition_log table after transformation.

3. Load

Purpose: Load fully processed data from staging into the trusted schema for analytics and reporting.
Code: src/load/load.py
- Moves data from the staging schema into the trusted schema, enforcing all business and data quality rules.
- Tables created in trusted schema:
How to run:
```
python src/load/load.py
```

Note:
Each script will automatically create the required tables in its schema if they do not already exist.

Show Sample Trusted Nutrition

Example of the trusted.nutrition_data table after loading.

Show Sample Trusted Goals

Example of the trusted.goals_data table after loading.

4.3 Schemas

The data warehouse is organized into three schemas: raw, staging, and trusted. Each schema serves a specific purpose in the ETL pipeline:

Raw Schema

Purpose: The initial storage for raw, unprocessed data directly extracted from the sources.

`raw.user_data`

Column Name	Data Type	Description
`record_id`	`SERIAL PRIMARY KEY`	Unique identifier for each record.
`user_id`	`INTEGER`	Unique identifier for the user.
`name`	`VARCHAR(100)`	User's name.
`age`	`INTEGER`	User's age.
`weight_kg`	`NUMERIC(4,1)`	User's weight in kilograms.
`height_cm`	`NUMERIC(4,1)`	User's height in centimeters.
`gender`	`VARCHAR(10)`	User's gender.
`calorie_goal`	`INTEGER`	Daily calorie goal for the user.
`macro_goal`	`JSON`	JSON object containing macro goals (carbs, protein, fat).
`activity_start`	`TIMESTAMP`	Start time of the activity.
`activity_type`	`VARCHAR(50)`	Type of activity (e.g., walking, running).
`steps`	`INTEGER`	Daily step count
`heart_rate`	`INTEGER`	Heart rate during the activity.
`calories_burned`	`INTEGER`	Calories burned during the day.
`sleep_start`	`TIMESTAMP`	Start time of sleep.
`sleep_end`	`TIMESTAMP`	End time of sleep.
`sleep_quality_score`	`INTEGER`	Quality score of sleep.
`goal_type`	`VARCHAR(50)`	Type of goal (e.g., calories burned, steps taken).
`goal_target`	`INTEGER`	Target value for the goal.
`created_at`	`TIMESTAMP`	Timestamp when the record was created.

`raw.nutrition_log`

Column Name	Data Type	Description
`nutrition_id`	`SERIAL PRIMARY KEY`	Unique identifier for the nutrition record.
`user_id`	`INTEGER`	Unique identifier for the user.
`date`	`DATE`	Date of the nutrition log.
`food_item`	`VARCHAR(255)`	Name of the food item.
`meal_type`	`VARCHAR(100)`	Type of meal (e.g., breakfast, lunch).
`calories_per_100g`	`INTEGER`	Calories per 100 grams of the food item.
`carbs_per_100g`	`INTEGER`	Carbohydrates per 100 grams of the food item.
`protein_per_100g`	`INTEGER`	Protein per 100 grams of the food item.
`fat_per_100g`	`INTEGER`	Fat per 100 grams of the food item.

Staging Schema

Purpose: Stores cleaned and transformed data, ready for further processing.

Show Staging Schema Star Diagram

Star diagram for the staging schema, illustrating fact and dimension tables and their relationships.

`staging.dim_user_profile`

Column Name	Data Type	Description
`user_id`	`BIGINT PRIMARY KEY`	Unique identifier for the user.
`name`	`VARCHAR(255)`	User's name.
`age`	`INTEGER`	User's age.
`weight_kg`	`DECIMAL(4,1)`	User's weight in kilograms.
`height_cm`	`DECIMAL(4,1)`	User's height in centimeters.
`gender`	`VARCHAR(50)`	User's gender.
`calorie_goal`	`INTEGER`	Daily calorie goal for the user.
`carbs_goal`	`INTEGER`	Daily carbohydrate goal for the user.
`protein_goal`	`INTEGER`	Daily protein goal for the user.
`fat_goal`	`INTEGER`	Daily fat goal for the user.

`staging.dim_food_item`

Column Name	Data Type	Description
`food_item_id`	`BIGINT PRIMARY KEY`	Unique identifier for the food item.
`food_item`	`VARCHAR(255)`	Name of the food item.
`calories_per_100g`	`DECIMAL(4,0)`	Calories per 100 grams of the food item.
`carbs_per_100g`	`DECIMAL(3,0)`	Carbohydrates per 100 grams of the food item.
`protein_per_100g`	`DECIMAL(3,0)`	Protein per 100 grams of the food item.
`fat_per_100g`	`DECIMAL(3,0)`	Fat per 100 grams of the food item.

`staging.fact_activity_log`

Column Name	Data Type	Description
`activity_id`	`BIGINT PRIMARY KEY`	Unique identifier for the activity record.
`user_id`	`BIGINT`	Unique identifier for the user.
`timestamp`	`TIMESTAMP`	Timestamp of the activity.
`activity_type`	`VARCHAR(100)`	Type of activity (e.g., walking, running).
`steps`	`INTEGER`	Number of steps taken during the activity.
`heart_rate`	`INTEGER`	Heart rate during the activity.
`calories_burned`	`INTEGER`	Calories burned during the activity.

`staging.fact_sleep_log`

Column Name	Data Type	Description
`sleep_id`	`BIGINT PRIMARY KEY`	Unique identifier for the sleep record.
`user_id`	`BIGINT`	Unique identifier for the user.
`date`	`DATE`	Date of the sleep record.
`sleep_start`	`TIMESTAMP`	Start time of sleep.
`sleep_end`	`TIMESTAMP`	End time of sleep.
`sleep_duration_hours`	`DECIMAL(5,1)`	Duration of sleep in hours.
`sleep_quality_score`	`INTEGER`	Quality score of sleep.

`staging.fact_nutrition_log`

Column Name	Data Type	Description
`nutrition_id`	`BIGINT PRIMARY KEY`	Unique identifier for the nutrition record.
`user_id`	`BIGINT`	Unique identifier for the user.
`date`	`DATE`	Date of the nutrition log.
`food_item_id`	`BIGINT`	Foreign key to the food item dimension.
`meal_type`	`VARCHAR(100)`	Type of meal (e.g., breakfast, lunch).

`staging.fact_goals_log`

Column Name	Data Type	Description
`goal_id`	`BIGINT PRIMARY KEY`	Unique identifier for the goal record.
`user_id`	`BIGINT`	Unique identifier for the user.
`date`	`DATE`	Date of the goal record.
`goal_type`	`VARCHAR(100)`	Type of goal (e.g., calories burned, steps taken).
`target_value`	`INTEGER`	Target value for the goal.
`actual_value`	`INTEGER`	Actual value achieved for the goal.
`status`	`VARCHAR(50)`	Status of the goal (e.g., achieved, not achieved).

Trusted Schema

Purpose: Stores the final, fully processed data that is ready for analytics and reporting.

Show Trusted Schema Star Diagram

Star diagram for the trusted schema, illustrating the relationships between tables.

`trusted.nutrition_data`

Column Name	Data Type	Description
`nutrition_id`	`BIGINT PRIMARY KEY`	Unique identifier for the nutrition record.
`user_id`	`BIGINT`	Unique identifier for the user (foreign key to PII table).
`date`	`DATE`	Date of the nutrition log.
`food_item`	`VARCHAR(255)`	Name of the food item.
`meal_type`	`VARCHAR(100)`	Type of meal (e.g., breakfast, lunch).
`calories_per_100g`	`DECIMAL(4,0)`	Calories per 100 grams of the food item.
`carbs_per_100g`	`DECIMAL(3,0)`	Carbohydrates per 100 grams of the food item.
`protein_per_100g`	`DECIMAL(3,0)`	Protein per 100 grams of the food item.
`fat_per_100g`	`DECIMAL(3,0)`	Fat per 100 grams of the food item.

`trusted.activity_data`

Column Name	Data Type	Description
`activity_id`	`BIGINT PRIMARY KEY`	Unique identifier for the activity record.
`user_id`	`BIGINT`	Unique identifier for the user (foreign key to PII table).
`timestamp`	`TIMESTAMP`	Timestamp of the activity.
`activity_type`	`VARCHAR(100)`	Type of activity (e.g., walking, running).
`steps`	`INT`	Number of steps taken during the activity.
`heart_rate`	`INT`	Heart rate during the activity.
`calories_burned`	`INT`	Calories burned during the activity.

`trusted.sleep_data`

Column Name	Data Type	Description
`sleep_id`	`BIGINT PRIMARY KEY`	Unique identifier for the sleep record.
`user_id`	`BIGINT`	Unique identifier for the user (foreign key to PII table).
`date`	`DATE`	Date of the sleep record.
`sleep_start`	`TIMESTAMP`	Start time of sleep.
`sleep_end`	`TIMESTAMP`	End time of sleep.
`sleep_duration_hours`	`DECIMAL(5,1)`	Duration of sleep in hours.
`sleep_quality_score`	`INTEGER`	Quality score of sleep.

`trusted.goals_data`

Column Name	Data Type	Description
`goal_id`	`BIGINT PRIMARY KEY`	Unique identifier for the goal record.
`user_id`	`BIGINT`	Unique identifier for the user (foreign key to PII table).
`date`	`DATE`	Date of the goal record.
`goal_type`	`VARCHAR(100)`	Type of goal (e.g., calories burned, steps).
`target_value`	`INT`	Target value for the goal.
`actual_value`	`INT`	Actual value achieved for the goal.
`status`	`VARCHAR(50)`	Status of the goal (e.g., achieved, not achieved, pending).

`trusted.user_profile`

Column Name	Data Type	Description
`user_id`	`BIGINT PRIMARY KEY`	Unique identifier for the user.
`name`	`VARCHAR(255)`	User's name (PII, access restricted).
`age`	`INT`	User's age (PII, access restricted).
`gender`	`VARCHAR(50)`	User's gender (PII, access restricted).

Note:
All analytics and reporting are performed using the trusted fact tables, which only reference users by user_id.
PII is only accessible to authorized roles.

5. Database Administration & Data Governance

5.1 Database Administration

Create DBA Roles:
- Use the scripts in sql/roles/ to create the dba_role and grant access to the trusted schema and its views:
  - grant_usage_trusted_schema.sql
  - grant_select_trusted_views.sql
  - grant_access_pii_data.sql (restricts PII access to DBAs)
  - grant_access_non_pii_data.sql (grants analytics access to non-PII data)
  - grant_access_archive_user_data_pii.sql (restricts archive access to DBAs)
Optimize DB Performance:
- Use explain_query_execution.sql to analyze and optimize query performance.
Schema Organization:
- Schema Organization:
- All scripts for table creation, data insertion, archiving, and deletion are organized by data sensitivity in sql/data/.

5.2 Data Governance

Data Privacy & PII Handling:
- PII data (e.g., user names, ages) is stored only in trusted.user_data_pii and managed via scripts in sql/data/pii/.
- Non-PII data is stored in trusted.user_data_non_pii (see sql/data/non_pii/), where user identifiers are hashed and ages are grouped.
- Access to PII tables is strictly limited to users with the dba_role.
Data Lifecycle Management:
- Old or inactive records are archived using scripts in sql/data/archive/.
- Archived records are deleted from the main tables only after successful archival, ensuring no data loss.
Data Quality Assurance:
- All trusted tables enforce strong data quality constraints (NOT NULL, CHECK, valid value lists) at the schema level.
- ETL scripts filter out incomplete or invalid records before loading into trusted tables.
Security & Access Control:
- Role-based access is enforced at the schema and table level.
- PII data is never exposed to analytics or reporting users.
Documentation:
- All scripts and policies are documented in this README for transparency and auditability.

6. Graphical User Interface (GUI)

A simple GUI is included to make running the ETL pipeline and generating dashboards more user-friendly.

How it Works

The GUI is built with Tkinter.
Launch it with:
```
python src/interface/main.py
```
The interface will prompt you to enter:
- Number of Users
- Number of Days
- Database Host, Port, User, Password, and Name
- USDA API Key

When you click "Run Pipeline", the GUI passes your input as command-line arguments and environment variables to the ETL scripts. The scripts are then run in sequence: extract, transform, load, and dashboard generation.

File location:
src/interface/main.py

Important:

If you use the GUI, your input for number of users and days will be used for that run.
If you run the extract script directly (e.g., python src/extract/healthapp.py), it will use the default global variables (NO_USERS and NO_DAYS) defined in the script.

Note:
For larger numbers of users or days, the pipeline will take longer to complete.
This is because the USDA API is called to retrieve food information for every user and day, which can be time-consuming.

Example GUI window:

Show Example GUI window

The GUI for running the ETL pipeline and dashboards with custom parameters.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
assets		assets
sql		sql
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

melinteflxrin/SoftServe-BigData-Project

Folders and files

Latest commit

History

Repository files navigation

Data Warehousing Project: Health, Fitness & Nutrition Analytics

Table of contents

1. Description

2. Business Requirements & Goals

Business Requirements

Core Business Goals

3. Reports, Dashboards & KPIs

4. Data Warehouse Design, Tables & Sources

4.1 APIs and Data Sources

4.2 ETL Process

1. Extract

2. Transform

3. Load

4.3 Schemas

Raw Schema

raw.user_data

raw.nutrition_log

Staging Schema

staging.dim_user_profile

staging.dim_food_item

staging.fact_activity_log

staging.fact_sleep_log

staging.fact_nutrition_log

staging.fact_goals_log

Trusted Schema

trusted.nutrition_data

trusted.activity_data

trusted.sleep_data

trusted.goals_data

trusted.user_profile

5. Database Administration & Data Governance

5.1 Database Administration

5.2 Data Governance

6. Graphical User Interface (GUI)

How it Works

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0