Skip to content

This repository contains a project for processing Brazilian epidemiological data extracted from DATASUS. Its main goal is to prepare and structure the data for use in the HDI Disease Tracker API.

License

Notifications You must be signed in to change notification settings

pedropaulodutra/hdi-disease-tracker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logo HDI - Disease Tracker

License Python

📌 Overview

HDI's Disease Tracker is an application designed to collect raw parquet files, load them, select relevant tables, clean, and convert them into new translated parquet files. The application is developed for use by Health Data Insights and serves as the foundation for the HDI - Disease Tracker API

🚀 Features

  • 📁 Raw Data Loading: Reads previously extracted Parquet files that must be manually placed in the data/raw folder.
  • 🧼 Data Cleaning: Standardizes column names and values to ensure consistency.
  • 🔎 Column Selection: Keeps only the fields relevant to epidemiological analysis.
  • 🌍 Location Mapping: Adds region and municipality information based on IBGE tables.
  • 🏥 Healthcare Facility Mapping: Integrates CNES codes to retrieve hospital names and locations.
  • 📤 Data Export: Generates clean, translated Parquet files ready for use.
  • 🔄 Modular Pipeline: Structured in reusable modules for loading, cleaning, selecting, and exporting data.
  • 🧪 API-Ready: Serves as a data foundation for the HDI Disease Tracker API.

Folder Structure

The project follows the structure below:

hdi-disease-tracker/
├── data/                              # Main data directory
│   ├── raw/                           # Stores raw (.parquet) files to be processed
│   └── processed/                     # Stores cleaned and transformed files
├── src/                               # Source code for the data pipeline
│   ├── maps/                          # Value mappings for standardization
│   │   ├── __init__.py                # Marks the directory as a Python package
│   │   ├── value_maps.py              # Contains dictionaries for value normalization
│   │   └── municipes_cnes_map.py      # Maps municipality names to CNES codes
│   ├── __init__.py                    # Initializes the `src` package
│   ├── data_cleaner.py                # Cleans and standardizes the dataset
│   ├── data_loader.py                 # Loads raw parquet files into DataFrames
│   ├── data_selector.py               # Selects relevant columns for analysis
│   └── data_converter.py              # Converts and saves final parquet output
├── main.py                            # Entry point script to run the pipeline
├── .gitignore                         # Specifies files/folders ignored by Git
├── LICENSE                            # Project license file
├── README.md                          # Project documentation
└── requirements.txt                   # Python dependencies

🚀 Installation

To set up the project locally, follow these steps:

  1. Clone the repository:
    git clone https://github.com/PedroDutra86/hdi-disease-tracker.git
    cd hdi-disease-tracker
    
  2. Create and activate a virtual environment:
  • On Windows (PowerShell):
    python -m venv venv
    venv/Scripts/activate
  • On macOS/Linux:
    python3 -m venv venv
    source venv/bin/activate
    
  1. Install dependencies:
pip install -r requirements.txt

⚒️ Usage

  1. After setting up the environment and installing dependencies, run the project with:
python src/data_converter.py

This will execute the full processing pipeline, including loading the data, selecting relevant columns, cleaning and translating information, and generating a new output file.

📜 License

This project is licensed under the MIT License - see the MIT License file for details.

📩 Contact

For questions or suggestions, reach out via:

About

This repository contains a project for processing Brazilian epidemiological data extracted from DATASUS. Its main goal is to prepare and structure the data for use in the HDI Disease Tracker API.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages