This project implements a Retrieval Augmented Generation (RAG) data ingestion pipeline** using Ray, OpenSearch, and PostgreSQL with pgvector for large-scale ML workloads.
rag-data-ingestion-pipeline/
│-- data/
│ │-- raw/
│ │ ├── data.jsonl
│ │-- processed/
│ │ ├── data.parquet
│-- src/
│ │-- convert.py # Converts JSONL to Parquet
│ │-- embeddings.py # Handles embedding generation with Ray
│ │-- opensearch_store.py # Stores embeddings in OpenSearch
│ │-- pgvector_store.py # Stores embeddings in PostgreSQL
│ │-- pipeline.py # Main script to run ingestion pipeline
│-- requirements.txt # Python dependencies
│-- README.md
- Efficient embedding generation using distributed processing
- Storage in OpenSearch for ANN retrieval
- Storage in PostgreSQL with pgvector for k-NN searches
pip install -r requirements.txt
python src/convert.py
python src/pipeline.py