RAG Data Ingestion Pipeline

Overview

This project implements a Retrieval Augmented Generation (RAG) data ingestion pipeline** using Ray, OpenSearch, and PostgreSQL with pgvector for large-scale ML workloads.

Project Structure:

rag-data-ingestion-pipeline/
│-- data/
│   │-- raw/
│   │   ├── data.jsonl
│   │-- processed/
│   │   ├── data.parquet
│-- src/
│   │-- convert.py  # Converts JSONL to Parquet
│   │-- embeddings.py  # Handles embedding generation with Ray
│   │-- opensearch_store.py  # Stores embeddings in OpenSearch
│   │-- pgvector_store.py  # Stores embeddings in PostgreSQL
│   │-- pipeline.py  # Main script to run ingestion pipeline
│-- requirements.txt  # Python dependencies
│-- README.md

Features

Efficient embedding generation using distributed processing
Storage in OpenSearch for ANN retrieval
Storage in PostgreSQL with pgvector for k-NN searches

Setup

1. Install dependencies

pip install -r requirements.txt

2. Convert Data

python src/convert.py

3. Run Ingestion Pipeline

python src/pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Data Ingestion Pipeline

Overview

Project Structure:

Features

Setup

1. Install dependencies

2. Convert Data

3. Run Ingestion Pipeline

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
convert.py		convert.py
embeddings.py		embeddings.py
opensearch_store.py		opensearch_store.py
pgvector_store.py		pgvector_store.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt

License

goswamipronnoy/rag-data-ingestion-pipeline

Folders and files

Latest commit

History

Repository files navigation

RAG Data Ingestion Pipeline

Overview

Project Structure:

Features

Setup

1. Install dependencies

2. Convert Data

3. Run Ingestion Pipeline

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages