An end-to-end modern data engineering project, including deployment of ETL pipeline on Google Cloud Platform, using BigQuery for data analysis and leveraging Looker to generate an insight dashboard.
Languages:
- Python
- SQL
Google Cloud Platform:
- Google Storage
- Google Composer
- Big Query
- Looker Studio
The raw data and output files are too large to store in the repository. They are stored on Google Drive.
-
Raw data link : https://drive.google.com/drive/folders/13KLFWQbJXNKjIoQp4QL3Mahyfvn-isaA?usp=drive_link
-
Output link : https://drive.google.com/drive/folders/1c_BNYN2IqQGQFtJmo-gaGyE_LwBfY6kX?usp=drive_link
- The final output from Looker Studio can be accessed via the following link: View Dashboard. Note: The dashboard reads data from a static CSV file exported from BigQuery.
- Clone this repository :
git clone https://github.com/supakunz/Book-Revenue-Pipeline-GCP.git
- Navigate to the project folder and Set up the environment variables :
cd Book-Revenue-Pipeline-GCP
-
Create a
.env
file in the root directory. -
Add the following variables to the .env file, replacing the placeholder values with your own:
MYSQL_CONNECTION = mysql_default #file name in Data Storage --> <data_audible_data_merged.csv>
CONVERSION_RATE_URL = <your_api_url> #file name in Data Storage --> <data_conversion_rate.csv>
MYSQL_OUTPUT_PATH = /home/airflow/gcs/data/audible_data_merged.csv
CONVERSION_RATE_OUTPUT_PATH = /home/airflow/gcs/data/conversion_rate.csv
FINAL_OUTPUT_PATH = /home/airflow/gcs/data/output.csv
Supakun Thata (supakunt.thata@gmail.com)