Web RAG Pipeline

A complete web data Retrieval-Augmented Generation (RAG) pipeline built with TypeScript and Bun that scrapes news articles using Selenium, embeds them with Jina's cloud embeddings API, and stores semantic vectors in Qdrant vector database for fast similarity search and AI-powered applications.

Project Description

This project demonstrates an end-to-end pipeline that:

Scrapes headlines and summaries from news websites (currently Wired) using Selenium-powered browser automation.
Converts the scraped textual content into semantic vector embeddings via Jina's cloud neural embeddings API.
Uploads and indexes these embeddings with metadata (headline and summary) into a Qdrant cloud-managed vector database optimized for similarity search.
Enables downstream AI use cases such as semantic search, recommendation, or document retrieval over large volumes of web data.

Features

Modular scraper supporting expansion to multiple websites with clean separation.
Easy-to-use Jina cloud API integration for fast and accurate embedding generation.
Efficient, scalable storage and search with Qdrant vector database.
TypeScript and Bun-powered for fast development and execution.
Robust, production-ready practices including error handling and environment-based API configurations.

Prerequisites

Node.js or Bun installed (tested with Bun).
Docker installed (optional, if you want local Qdrant).
Jina.cloud API key.
Qdrant cloud API key and URL.
A modern browser and Chromedriver installed and accessible.

Setup Instructions

Clone the repository:

git clone https://github.com/your-username/web-rag-pipeline.git
cd web-rag-pipeline

Copy .env.example to .env and add your API keys:

JINA_API_KEY=your_jina_api_key QDRANT_API_KEY=your_qdrant_api_key QDRANT_URL=https://your-qdrant-cloud-instance

Install dependencies: bun install
Run the full pipeline: bun run main.ts

Project Structure

── main.ts # Orchestrates scraping → embedding → DB storage src/ ├── scraper.ts # Selenium web scraping module ├── embedder.ts # Jina cloud embedding API interface ├── qdrant.ts # Qdrant cloud client and ingestion logic

.env # API keys and configs package.json # Scripts and dependencies

Next Steps

Add more scrapers for other sources. Implement query and retrieval functionality over Qdrant. Add a frontend interface or API server for interactive search. Explore advanced embedding techniques or hybrid search with metadata filtering.

Summary

This pipeline serves as a foundational project for building powerful, scalable AI applications leveraging the synergy of web scraping, vector embedding, and similarity search.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
README.md		README.md
bun.lock		bun.lock
main.ts		main.ts
package.json		package.json
tsconfig.json		tsconfig.json
wired_articles.json		wired_articles.json
wired_articles_embedded.json		wired_articles_embedded.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web RAG Pipeline

Project Description

Features

Prerequisites

Setup Instructions

Project Structure

Next Steps

Summary

About

Uh oh!

Releases

Packages

Languages

abhisheksuhag/web-rag-pipeline

Folders and files

Latest commit

History

Repository files navigation

Web RAG Pipeline

Project Description

Features

Prerequisites

Setup Instructions

Project Structure

Next Steps

Summary

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages