Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file added .env
Empty file.
32 changes: 32 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Local folders
.terraform/
.terraform.lock.hcl

# Status files
*.tfstate
*.tfstate.*

# Logs files
crash.log
crash.*.log

# Files for override and backup
override.tf
override.tf.json
*_override.tf
*_override.tf.json
*.tfvars
*.tfvars.json
*.auto.tfvars

# Backup files for Terraform
*.backup

# Plan files
*.plan

# Files for IDE
.vscode/
.idea/
*.iml
.terraform/
20 changes: 20 additions & 0 deletions Dockerfile.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
FROM apache/airflow:2.6.0-python3.9

COPY requirements.txt /tmp/requirements.txt

USER root

# Java 17 installation
RUN apt-get update && \
apt-get install -y openjdk-17-jdk && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*


ENV JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
ENV PATH="${JAVA_HOME}/bin:${PATH}"

USER airflow

# Python dependencies
RUN pip install --no-cache-dir -r /tmp/requirements.txt
94 changes: 8 additions & 86 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,91 +1,13 @@
# 🚀 Cloud Data Engineer Challenge

Welcome to the **Cloud Data Engineer Challenge!** 🎉 This challenge is designed to evaluate your ability to work with **Infrastructure as Code (IaC), AWS data services, and data engineering workflows**, ensuring efficient data ingestion, storage, and querying.
## 1. Diagram for the solution

> [!NOTE]
> You can use **any IaC tool of your choice** (Terraform preferred, but alternatives are allowed). If you choose a different tool or a combination of tools, **justify your decision!**
![Architecture Diagram](case_diagram.png)

## ⚡ Challenge Overview
## 2. Step by Step to deploy the solution:

Your task is to deploy the following infrastructure on AWS:

> 🎯 **Key Objectives:**

- **An S3 bucket** that will receive data files as new objects.
- **A Lambda function** that is triggered by a `PUT` event in the S3 bucket.
- **The Lambda function must:**
- Process the ingested data and perform a minimal aggregation.
- Store the processed data in a **PostgreSQL database with PostGIS enabled**.
- Expose an API Gateway endpoint (`GET /aggregated-data`) to query and retrieve the aggregated data.
- **A PostgreSQL database** running in a private subnet with PostGIS enabled.
- **Networking must include:** VPC, public/private subnets, and security groups.
- **The Lambda must be in a private subnet** and use a NAT Gateway in a public subnet for internet access 🌍
- **CloudWatch logs** should capture Lambda execution details and possible errors.

> [!IMPORTANT]
> Ensure that your solution is modular, well-documented, and follows best practices for security and maintainability.

## 📌 Requirements

### 🛠 Tech Stack

> ⚡ **Must Include:**

- **IaC:** Any tool of your choice (**Terraform preferred**, but others are allowed if justified).
- **AWS Services:** S3, Lambda, API Gateway, CloudWatch, PostgreSQL with PostGIS (RDS or self-hosted on EC2).

### 📄 Expected Deliverables

> 📥 **Your submission must be a Pull Request that includes:**

- **An IaC module** that deploys the entire architecture.
- **A `README.md`** with deployment instructions and tool selection justification.
- **A working API Gateway endpoint** that returns the aggregated data stored in PostgreSQL.
- **CloudWatch logs** capturing Lambda execution details.
- **Example input files** to trigger the data pipeline (placed in an `examples/` directory).
- **A sample event payload** (JSON format) to simulate the S3 `PUT` event.

> [!TIP]
> Use the `docs` folder to store any additional documentation or diagrams that help explain your solution.
> Mention any assumptions or constraints in your `README.md`.

## 🌟 Nice to Have

> 💡 **Bonus Points For:**

- **Data Quality & Validation**: Implementing **schema validation before storing data in PostgreSQL**.
- **Indexing & Query Optimization**: Using **PostGIS spatial indexing** for efficient geospatial queries.
- **Monitoring & Alerts**: Setting up **AWS CloudWatch Alarms** for S3 event failures or Lambda errors.
- **Automated Data Backups**: Creating periodic **database backups to S3** using AWS Lambda or AWS Backup.
- **GitHub Actions for validation**: Running **`terraform fmt`, `terraform validate`**, or equivalent for the chosen IaC tool.
- **Pre-commit hooks**: Ensuring linting and security checks before committing.
- **Docker for local testing**: Using **Docker Compose to spin up**:
- Running a local PostgreSQL database with PostGIS to simulate the cloud environment 🛠
- Providing a local S3-compatible service (e.g., MinIO) to test file ingestion before deployment 🖥

> [!TIP]
> Looking for inspiration or additional ideas to earn extra points? Check out our **[Awesome NaNLABS repository](https://github.com/nanlabs/awesome-nan)** for reference projects and best practices! 🚀

## 📥 Submission Guidelines

> 📌 **Follow these steps to submit your solution:**

1. **Fork this repository.**
2. **Create a feature branch** for your implementation.
3. **Commit your changes** with meaningful commit messages.
4. **Open a Pull Request** following the provided template.
5. **Our team will review** and provide feedback.

## ✅ Evaluation Criteria

> 🔍 **What we'll be looking at:**

- **Correctness and completeness** of the **data pipeline**.
- **Use of best practices for event-driven processing** (S3 triggers, Lambda execution).
- **Data transformation & aggregation logic** implemented in Lambda.
- **Optimization for geospatial queries** using PostGIS.
- **Data backup & integrity strategies** (optional, e.g., automated S3 backups).
- **CI/CD automation using GitHub Actions and pre-commit hooks** (optional).
- **Documentation clarity**: Clear explanation of data flow, transformation logic, and infrastructure choices.

## 🎯 **Good luck and happy coding!** 🚀
1. Run the command *docker-compose up --build* to initialize the docker image and container
2. Run the command *docker exec -it container-name /bin/sh*
3. Run *terraform init* to initalize the terraform templates
4. Run *terraform plan* to validate the execution plan of the resources
5. Run *terraform apply* to push and deploy the changes into the cloud service
Binary file added case_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
16 changes: 16 additions & 0 deletions docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
services:
terraform:
image: hashicorp/terraform:light
volumes:
- ./terraform:/workspace
- ~/.aws:/root/.aws
env_file:
- .env
working_dir: /workspace
entrypoint: /bin/sh
tty: true
stdin_open: true

networks:
confluent:
driver: bridge
86 changes: 86 additions & 0 deletions examples/flights.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
[
{
"actual_departure_time": "2025-07-15T09:10:45.223184",
"actual_landed_time": null,
"airline": "Delta Airlines",
"arrival_airport": "John F. Kennedy International Airport",
"arrival_city": "New York",
"current_altitude_m": 10450,
"current_location": {
"latitude": 47.6097,
"longitude": -55.3502
},
"current_speed_km_h": 902,
"departure_airport": "Heathrow Airport",
"departure_city": "London",
"dest_lat": 40.6413,
"dest_lon": -73.7781,
"direction": 282.35,
"distance_travelled": 0,
"distance_travelled_km": 3120.5,
"flight_id": 2,
"flight_status": "active",
"lat": 51.4700,
"lon": -0.4543,
"scheduled_arrival_time": "2025-07-15T12:45:00.000000",
"scheduled_departure_time": "2025-07-15T09:00:00.223184",
"speed": 905,
"start_time": "Tue, 15 Jul 2025 09:10:45 GMT"
},
{
"actual_departure_time": "2025-08-21T23:05:13.987654",
"actual_landed_time": "2025-08-22T05:41:09.123456",
"airline": "LATAM Airlines",
"arrival_airport": "São Paulo/Guarulhos International Airport",
"arrival_city": "São Paulo",
"current_altitude_m": 0,
"current_location": {
"latitude": -23.4356,
"longitude": -46.4731
},
"current_speed_km_h": 0,
"departure_airport": "El Dorado International Airport",
"departure_city": "Bogotá",
"dest_lat": -23.4356,
"dest_lon": -46.4731,
"direction": 0.0,
"distance_travelled": 4325.7,
"distance_travelled_km": 4325.7,
"flight_id": 3,
"flight_status": "landed",
"lat": 4.7016,
"lon": -74.1469,
"scheduled_arrival_time": "2025-08-22T05:30:00.000000",
"scheduled_departure_time": "2025-08-21T23:00:00.000000",
"speed": 0,
"start_time": "Thu, 21 Aug 2025 23:05:13 GMT"
},
{
"actual_departure_time": "2025-08-22T23:05:13.987654",
"actual_landed_time": "2025-08-22T23:06:13.987654",
"airline": "Singapore Airlines",
"arrival_airport": "Changi Airport",
"arrival_city": "Singapore",
"current_altitude_m": 0,
"current_location": {
"latitude": 35.5494,
"longitude": 139.7798
},
"current_speed_km_h": 0,
"departure_airport": "Tokyo Haneda Airport",
"departure_city": "Tokyo",
"dest_lat": 1.3644,
"dest_lon": 103.9915,
"direction": 0.0,
"distance_travelled": 0,
"distance_travelled_km": 0.0,
"flight_id": 4,
"flight_status": "scheduled",
"lat": 35.5494,
"lon": 139.7798,
"scheduled_arrival_time": "2025-09-30T04:45:00.000000",
"scheduled_departure_time": "2025-09-29T23:55:00.000000",
"speed": 0,
"start_time": "Mon, 29 Sep 2025 23:55:00 GMT"
}
]
Loading