Data Warehouse Design and Implementation

Presentation Youtube Link: https://youtu.be/_5-avdmH-Qw

Requirement

Objectives

The Data Warehouse system aims to:

Store and analyze weather data (temperature, humidity, rainfall, etc.).
Store and analyze soil salinity data for different regions, provinces.
Support decision-making in agriculture, particularly in mitigating climate change risks.

Scope

Geographical coverage: Southern Vietnam provinces only.
Collected data:
- Weather data
- Soil salinity data
Retention period: At least 4 years for trend analysis.
Update frequency: Daily, weekly...

General Structure

Three-Tier Architecture

Data Source

The data used in this project comes from 2 source:

Weather data: collected daily through MeteoSource API and is provided in JSON format. It typically includes parameters such as temperature, humidity, precipitation, and other atmospheric conditions.
Soil salinity data: obtained from a flat file in CSV format. This file serves as a weekly aggregation of soil salinity levels. This dataset is collected weekly from reports issued by the Southern Regional Hydrometeorological Center of Vietnam's website.

Create Data houseware Schema on PostgreSQL

Create new Database using pgADMIN

Open pgAdmin.
Go to Object => Create => Database.
Enter the database name (for example: Climeweather_dw).
Select encoding: UTF8.
Click Save.

Write SQL to create schema and tables

Run SQL queries in pgAdmin's Query Tool to create Schema tables

Insert sample data

Run SQL queries to insert sample data into dimension tables
Run Python scripts to insert sample data into cleaned staging area tables

Visualize ERD

Choose a table in Object Explorer
Click right-mouse button then choose ERD for table

The Schema for Data Warehouse is the same as this:

Design Staging Area

Staging Area will store raw data from many different sources (weather API, IoT, satellite sensor) and cleaned data before entering Data Warehouse.
Type of Tables in Staging Area:
- Raw data table: store un-processed data from Data Source
  - The data will be saved in the form of JSONB for flexibility.
  - There is an additional Processed column to mark the processed data or not.
- Cleaned data table: store processed data, the data is ready to be imported into Data Warehouse

Create tables in Staging area

Run SQL queries in pgAdmin's Query Tool to create Raw data Staging Tables and Cleaned data Staging Tables

ETL (Extract-Transform-Load) Technology

Apache Nifi (Compatible with real-time data, provide visual interface) for ETL from Data Source to Staging Area

Install and Start Apache Nifi (version 2.2.0)

Download Nifi: Visit the Apache NiFi Downloads page and choose the version suitable for your operating system (Windows, macOS, or Linux).
Extract the Files: Extract the .zip file.
Set up Java: Ensure you have Java 8 or Java 11 installed. Running java -version in terminal to check.
Configure NiFi: Navigate to the conf folder in the extracted NiFi directory. Edit the nifi.properties file to set up configurations like ports, repositories, and sensitive properties.
Start NiFi: Open Terminal then navigate to bin folder. Run .\nifi.cmd start.
Access the NiFi UI: Open a web browser and go to https://localhost:8443/nifi (default port).
Log in: username and password are automatically created by Nifi and stored in logs/nifi-app.log

ETL Pipeline

ETL from Data Source to Staging Area

ETL weather data (import processor into Nifi): execute every day
- Extract: fetch weather data from API
- Transform: change unit
- Load: load transformed weather data into staging_weather_raw table
ETL soil salinity data (import processor into Nifi): execute every week
- Extract: fetch salinity data from CSV file (weekly salinity report)
- Transform: replace missing values
- Load: load transformed salinity data into staging_salinity_raw table

Process raw data in Staging Area

Process raw data in raw data tables. Tranform JSONB data into cleaned data then insert them into cleaned data tables. Execute every week (import processor).

Tranform raw weather data from staging_weather_raw table into cleaned weather data then insert into staging_weather_cleaned table using Python script
Tranform raw salinity data from staging_salinity_raw table into cleaned salinity data then insert into staging_salinity_cleaned table using Python script

ETL from Staging Area to DWH

Import processor, which use Python scripts to process ETL into Nifi (execute every week)

Extract: get cleaned data from staging_weather_cleaned and staging_salinity_cleaned
Transform: handling missing data
- Missing Weather data: use moving average to handle missing value (weather data is collected in a short time - every day)
- Missing Salinity data: label as missing value
Load: load transform data to fact tables
Delete cleaned data in staging_weather_cleaned and staging_salinity_cleaned

OLAP Server

PostgreSQL is not an OLAP-specialized database, but it can still serve as an effective OLAP server in many use cases. Although it uses row-based storage and is optimized for transactional (OLTP) workloads, PostgreSQL supports powerful analytical features such as GROUP BY, ROLLUP, CUBE, window functions, CTEs, and partitioning. These capabilities allow it to handle multi-dimensional analysis and reporting tasks quite well, especially when dealing with medium-sized datasets (from megabytes to tens of gigabytes). Extensions like cstore_fdw, TimescaleDB, and pg_partman further enhance its ability to perform OLAP-like operations, including time-series analysis and columnar storage

BI Tool and OLAP operations

Power BI

Power BI is a strong BI (Business Intelligence) tool, which helps you analyze and visualize data from Data Warehouse easily.

Connect Power BI to PostgreSQL

Open Power BI → Home → Get Data → More...
Choose Database → PostgreSQL database → Connect
Input connection information:
- Server: localhost:5432
- Database: climeweather_dw
Choose DirectQuery (Query directly from DB when opening the report)
Input username and password
In Navigator dialog box, select all tables that will be used and click Load (Since DirectQuery is chosen, Power Bi does not download data, but send SQL query to Postgresql whenever you change the filter, create charts, or open Dashboard - Loading data in this step is actually loading metadata)

Data Model in Power BI

OLAP works on the relational data model, so it is necessary to organize data under the Star Schema model or Snowflake Schema with the tables
After loading tables, Power BI will automatically create Data Model. Change to Model View, the data model will be the same as follow

Set up Hierarchies in Power BI

Each dimension can have hierarchies, which define the levels of granularity.
Hierarchies allow users to drill down or roll up through levels of detail.
Steps to set up Hierarchies:
1. Select a dim table in Model View.
2. Choose the hierarchy column (eg: Year, Quarter, Month, Day).
3. Right-click on a column → Create Hierarchy.
4. Name Hierarchy (eg: "Date Hierarchy").
5. Pull other columns into Hierarchy to create decentralized level.

Create charts in Power BI

Go to Report View
In Visualizations, select desired type of chart
Pull the columns into the chart in Build visual:
- Axis (X axis)
- Values (Value)
- Legend (classification)
- ...
Format chart's and axis's titles in Format visual

OLAP Operations in Power BI

When using DirectQuery, each OLAP operator on Power Bi will send one or more SQL queries to Postgresql to process data and return the results

Roll-Up
- Insert a column chart (bar chart, line chart, matrix).
- Pull Hierarchy into the X axis.
- Press Drill Up to return to synthesis.
Drill-Down
- Insert a column chart (bar chart, line chart, matrix).
- Pull Hierarchy into the X axis.
- Enable Drill Down Mode so you can explore data at different levels at a point on X axis
- Press Expand all down one level so you can explore data at different levels at every point on X axis.
Slice
- Go to Visualizations tab → Select Slicer.
- Pull the column to filter into Slicer (eg: Year, Region, Category).
- When choosing a value in Slicer, the report will automatically update.
Dice
- Select a chart or data sheet.
- Go to the Filters tab.
- Add desired filtration conditions (eg: Category = 'Laptop', Region = 'Hanoi').

Visualize Map in Power BI

Create GeoJSON file:
- Download Vietnam GeoJSON file from link and save as this GeoJSON file
- Run Python script to filter necessary provinces
Convert GeoJSON to TopoJSON: use this website
Build the Shape Map visual
- Drag the Shape Map to Power BI Canvas
- Under the Build tab, drag location into the Location Area
- Under the Format tab, expand Map Setting, set Map type to Custom map then select the saved TopoJSON file in Add a map type

For more specific instructions, please follow this article

The result after visualize data fetching from Data Warehouse may be the same as follow:

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
ApacheNifi_processor		ApacheNifi_processor
PowerBI		PowerBI
Python_script		Python_script
SQL		SQL
asset		asset
Presentation.pdf		Presentation.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Warehouse Design and Implementation

Requirement

Objectives

Scope

General Structure

Three-Tier Architecture

Data Source

Create Data houseware Schema on PostgreSQL

Create new Database using pgADMIN

Write SQL to create schema and tables

Insert sample data

Visualize ERD

Design Staging Area

Create tables in Staging area

ETL (Extract-Transform-Load) Technology

Install and Start Apache Nifi (version 2.2.0)

ETL Pipeline

ETL from Data Source to Staging Area

Process raw data in Staging Area

ETL from Staging Area to DWH

OLAP Server

BI Tool and OLAP operations

Power BI

Connect Power BI to PostgreSQL

Data Model in Power BI

Set up Hierarchies in Power BI

Create charts in Power BI

OLAP Operations in Power BI

Visualize Map in Power BI

About

Uh oh!

Releases

Packages

Uh oh!

Languages

thuanvu301103/CO4031_Data_Warehouse_and_Data_Support_System_Assignment_Weather_Warehouse

Folders and files

Latest commit

History

Repository files navigation

Data Warehouse Design and Implementation

Requirement

Objectives

Scope

General Structure

Three-Tier Architecture

Data Source

Create Data houseware Schema on PostgreSQL

Create new Database using pgADMIN

Write SQL to create schema and tables

Insert sample data

Visualize ERD

Design Staging Area

Create tables in Staging area

ETL (Extract-Transform-Load) Technology

Install and Start Apache Nifi (version 2.2.0)

ETL Pipeline

ETL from Data Source to Staging Area

Process raw data in Staging Area

ETL from Staging Area to DWH

OLAP Server

BI Tool and OLAP operations

Power BI

Connect Power BI to PostgreSQL

Data Model in Power BI

Set up Hierarchies in Power BI

Create charts in Power BI

OLAP Operations in Power BI

Visualize Map in Power BI

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages