Skip to content

Python-based solution that takes a LangChain / Faiss-powered pipeline to automatically fetch, chunk, embed and index transcripts from YouTube videos, and then leverages the GroqLLM to enable natural-language Q&A over the video content , all while logging the chat history for reproducibility and analysis.

Notifications You must be signed in to change notification settings

sameer-at-git/Youtube-Chatbot-using-Transcripts-with-GroqLLM

Repository files navigation

YouTube Chatbot Q&A with Groq LLM

PythonYouTube Transcript API LangChain FAISS HuggingFace Groq


Example: Asking a Question from YouTube Transcript

from chat import ask_question
from vector import create_faiss_index

# Load FAISS index
faiss_index = create_faiss_index("index/faiss_index.json")

# Ask a question about the video content
answer = ask_question(faiss_index, "What is the main topic discussed?")
print(answer)
text
Answer: The video discusses the advancements in nuclear fusion technology and its potential impact on future energy systems.

Project Description

This project allows you to:

  • Fetch YouTube video transcripts automatically.
  • Split the transcript into manageable chunks.
  • Generate vector embeddings using HuggingFace models.
  • Store and retrieve embeddings using FAISS vector store.
  • Query the transcript intelligently with a Groq-powered LLM (ChatGroq).
  • Save chat history in JSON format for later analysis.

The system ensures that responses are contextual, using only the transcript text for answering questions.


Features

  • Automatic transcript fetching from YouTube videos.
  • Recursive text splitting for long transcripts.
  • FAISS-based semantic search and retrieval.
  • HuggingFace all-MiniLM-L6-v2 embeddings for semantic encoding.
  • Integration with Groq LLM (llama-3.1-8b-instant) for question answering.
  • Logging of all user/AI interactions to chat_history.json.

Tech Stack

  • Python 3.10+
  • YouTube Transcript API – fetch video transcripts.
  • LangChain Community Libraries
    • RecursiveCharacterTextSplitter
    • FAISS vector store
    • HuggingFaceEmbeddings
    • ChatGroq (Groq LLM)
  • FAISS – fast similarity search for embeddings.
  • HuggingFace Embeddingsall-MiniLM-L6-v2 semantic embeddings.
  • Groq Cloud – access Groq LLMs via API key.

Setup Guide

1. Clone the Repository

git clone <repo-url> cd <repo-folder>

2. Install Dependencies

pip install youtube-transcript-api langchain_text_splitters langchain_community langchain_groq

3. Set Up Groq API

  1. Sign up at Groq Cloud.
  2. Generate an API key from your dashboard.
  3. When running the script, you will be prompted to input your Groq API key:
import os, getpass

if "GROQ_API_KEY" not in os.environ:
os.environ["GROQ_API_KEY"] = getpass.getpass("Enter your Groq API key: ")

4. Run the Script

python main.py
  • Fetch a transcript from a YouTube video by providing the video ID.
  • The transcript will be split, embedded, and stored in FAISS.
  • You can query the video using natural language questions.
  • Chat interactions are automatically logged to chat_history.json.

Usage Example

from main import main_chain, invoke_with_logging

Ask a question

response = invoke_with_logging(main_chain, "Can you summarize the video in less than 20 words?") print(response)


File Outputs

  • transcript_snippets.json – raw transcript snippets.
  • script.json – full transcript text.
  • chunks.json – transcript split into chunks for embeddings.
  • faiss_index/ – saved FAISS index for semantic search.
  • retriever_config.json – retriever settings.
  • chat_history.json – logged conversation history.

Notes

  • The project uses FAISS for fast retrieval and HuggingFace embeddings for semantic similarity.
  • Groq LLM ensures high-quality question answering on video transcripts.
  • All chat history is saved locally in JSON format for reproducibility.
  • System is robust to videos without captions (TranscriptsDisabled handled).

License

MIT License

About

Python-based solution that takes a LangChain / Faiss-powered pipeline to automatically fetch, chunk, embed and index transcripts from YouTube videos, and then leverages the GroqLLM to enable natural-language Q&A over the video content , all while logging the chat history for reproducibility and analysis.

Topics

Resources

Stars

Watchers

Forks