-
Updated
Oct 14, 2025
llm-training-data
Here are 8 public repositories matching this topic...
SyGra - Graph-oriented Synthetic data generation Pipeline
-
Updated
Oct 17, 2025 - Python
Following is the Open Email Marketing Dataset; you can use it without any restrictions.
-
Updated
Jul 12, 2025
An open-source collection of datasets, guides, and rankings for B2B email marketing and lead generation. Your go-to resource for sales prospecting strategies.
-
Updated
Sep 13, 2025
👨🏫This project was developed under the guidance of Mr. Lokesh Sir as part of the AI & ML Training Program. It explores LLM integration using Google Gemini APIs with a custom UI built on Streamlit.
-
Updated
Jul 27, 2025 - Python
Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity across 5 high-quality open datasets.
-
Updated
Oct 4, 2025 - HTML
Sample edition of The Stack Enriched: annotated, secure, and optimized code dataset, this is a sample version
-
Updated
Jul 19, 2025 - Python
🔧 Modular pipeline for generating high-quality, domain-specific datasets for LLM fine-tuning — from PDFs and web scraping to synthetic Q&A generation, quality filtering, and training-ready formatting.
-
Updated
Jul 15, 2025 - Python
Improve this page
Add a description, image, and links to the llm-training-data topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the llm-training-data topic, visit your repo's landing page and select "manage topics."