Presentation Youtube Link: https://youtu.be/_5-avdmH-Qw
The Data Warehouse system aims to:
- Store and analyze weather data (temperature, humidity, rainfall, etc.).
- Store and analyze soil salinity data for different regions, provinces.
- Support decision-making in agriculture, particularly in mitigating climate change risks.
- Geographical coverage: Southern Vietnam provinces only.
- Collected data:
- Weather data
- Soil salinity data
- Retention period: At least 4 years for trend analysis.
- Update frequency: Daily, weekly...
The data used in this project comes from 2 source:
- Weather data: collected daily through MeteoSource API and is provided in JSON format. It typically includes parameters such as temperature, humidity, precipitation, and other atmospheric conditions.
- Soil salinity data: obtained from a flat file in CSV format. This file serves as a weekly aggregation of soil salinity levels. This dataset is collected weekly from reports issued by the Southern Regional Hydrometeorological Center of Vietnam's website.
- Open
pgAdmin
. - Go to
Object
=>Create
=>Database
. - Enter the database name (for example:
Climeweather_dw
). - Select encoding:
UTF8
. - Click
Save
.
Run SQL queries in pgAdmin
's Query Tool
to create Schema tables
- Run SQL queries to insert sample data into dimension tables
- Run Python scripts to insert sample data into cleaned staging area tables
- Choose a table in
Object Explorer
- Click right-mouse button then choose
ERD for table
The Schema for Data Warehouse is the same as this:
- Staging Area will store raw data from many different sources (weather API, IoT, satellite sensor) and cleaned data before entering Data Warehouse.
- Type of Tables in Staging Area:
- Raw data table: store un-processed data from Data Source
- The data will be saved in the form of
JSONB
for flexibility. - There is an additional
Processed
column to mark the processed data or not.
- The data will be saved in the form of
- Cleaned data table: store processed data, the data is ready to be imported into Data Warehouse
- Raw data table: store un-processed data from Data Source
Run SQL queries in pgAdmin
's Query Tool
to create Raw data Staging Tables and Cleaned data Staging Tables
Apache Nifi
(Compatible with real-time data, provide visual interface) for ETL from Data Source to Staging Area
- Download Nifi: Visit the Apache NiFi Downloads page and choose the version suitable for your operating system (Windows, macOS, or Linux).
- Extract the Files: Extract the
.zip
file. - Set up Java: Ensure you have Java 8 or Java 11 installed. Running
java -version
in terminal to check. - Configure NiFi: Navigate to the
conf
folder in the extracted NiFi directory. Edit thenifi.properties
file to set up configurations like ports, repositories, and sensitive properties. - Start NiFi: Open Terminal then navigate to
bin
folder. Run.\nifi.cmd start
. - Access the NiFi UI: Open a web browser and go to
https://localhost:8443/nifi
(default port). - Log in: username and password are automatically created by Nifi and stored in
logs/nifi-app.log
- ETL weather data (import processor into Nifi): execute every day
- Extract: fetch weather data from API
- Transform: change unit
- Load: load transformed weather data into
staging_weather_raw
table
- ETL soil salinity data (import processor into Nifi): execute every week
- Extract: fetch salinity data from CSV file (weekly salinity report)
- Transform: replace missing values
- Load: load transformed salinity data into
staging_salinity_raw
table
Process raw data in raw data tables. Tranform JSONB
data into cleaned data then insert them into cleaned data tables. Execute every week (import processor).
- Tranform raw weather data from
staging_weather_raw
table into cleaned weather data then insert intostaging_weather_cleaned
table using Python script - Tranform raw salinity data from
staging_salinity_raw
table into cleaned salinity data then insert intostaging_salinity_cleaned
table using Python script
Import processor, which use Python scripts to process ETL into Nifi (execute every week)
- Extract: get cleaned data from
staging_weather_cleaned
andstaging_salinity_cleaned
- Transform: handling missing data
- Missing Weather data: use moving average to handle missing value (weather data is collected in a short time - every day)
- Missing Salinity data: label as missing value
- Load: load transform data to fact tables
- Delete cleaned data in
staging_weather_cleaned
andstaging_salinity_cleaned
PostgreSQL
is not an OLAP-specialized database, but it can still serve as an effective OLAP server in many use cases. Although it uses row-based storage and is optimized for transactional (OLTP) workloads, PostgreSQL supports powerful analytical features such as GROUP BY
, ROLLUP
, CUBE
, window functions
, CTEs
, and partitioning
. These capabilities allow it to handle multi-dimensional analysis and reporting tasks quite well, especially when dealing with medium-sized datasets (from megabytes to tens of gigabytes). Extensions like cstore_fdw
, TimescaleDB
, and pg_partman
further enhance its ability to perform OLAP-like operations, including time-series analysis and columnar storage
Power BI is a strong BI (Business Intelligence) tool, which helps you analyze and visualize data from Data Warehouse easily.
- Open
Power BI
→Home
→Get Data
→More...
- Choose
Database
→PostgreSQL database
→Connect
- Input connection information:
- Server:
localhost:5432
- Database:
climeweather_dw
- Server:
- Choose
DirectQuery
(Query directly from DB when opening the report) - Input username and password
- In
Navigator
dialog box, select all tables that will be used and clickLoad
(SinceDirectQuery
is chosen, Power Bi does not download data, but send SQL query to Postgresql whenever you change the filter, create charts, or open Dashboard - Loading data in this step is actually loading metadata)
- OLAP works on the relational data model, so it is necessary to organize data under the Star Schema model or Snowflake Schema with the tables
- After loading tables, Power BI will automatically create Data Model. Change to
Model View
, the data model will be the same as follow
- Each dimension can have hierarchies, which define the levels of granularity.
- Hierarchies allow users to drill down or roll up through levels of detail.
- Steps to set up Hierarchies:
- Go to
Report View
- In
Visualizations
, select desired type of chart - Pull the columns into the chart in
Build visual
:- Axis (X axis)
- Values (Value)
- Legend (classification)
- ...
- Format chart's and axis's titles in
Format visual
When using DirectQuery
, each OLAP operator on Power Bi
will send one or more SQL queries to Postgresql
to process data and return the results
-
Roll-Up
- Insert a column chart (bar chart, line chart, matrix).
- Pull Hierarchy into the X axis.
- Press
Drill Up
to return to synthesis.
-
Drill-Down
-
Slice
-
Dice
- Create GeoJSON file:
- Download Vietnam GeoJSON file from link and save as this GeoJSON file
- Run Python script to filter necessary provinces
- Convert GeoJSON to TopoJSON: use this website
- Build the Shape Map visual
- Drag the
Shape Map
toPower BI Canvas
- Under the
Build
tab, draglocation
into theLocation Area
- Under the
Format
tab, expandMap Setting
, setMap type
toCustom map
then select the saved TopoJSON file inAdd a map type
- Drag the
For more specific instructions, please follow this article
The result after visualize data fetching from Data Warehouse may be the same as follow: