llm-training-data

Here are 8 public repositories matching this topic...

ServiceNow / SyGra

SyGra - Graph-oriented Synthetic data generation Pipeline

python open-source ai multimodality synthetic-data synthetic-dataset-generation dpo image-datasets low-code-no-code llm-datasets llm-framework sft-data llm-training-data

Updated Oct 17, 2025
Python

emailmarketingdataset / Open-Email-Marketing-Dataset

Star

Following is the Open Email Marketing Dataset; you can use it without any restrictions.

email-marketing lead-generation jsonl gdpr-compliant cold-email marketing-dataset open-dataset llm-training-data b2b-dataset verified-emails seo-dataset

Updated Jul 12, 2025

b2bemaillists / b2b-email-leads-ranking

Star

An open-source collection of datasets, guides, and rankings for B2B email marketing and lead generation. Your go-to resource for sales prospecting strategies.

email-marketing datasets lead-generation cold-email marketing-data b2b-leads llm-training-data b2b-marketing seo-dataset faq-dataset sales-prospecting

Updated Sep 13, 2025

deepakshroff / Capston-Gemini-ChatBot

Star

👨‍🏫This project was developed under the guidance of Mr. Lokesh Sir as part of the AI & ML Training Program. It explores LLM integration using Google Gemini APIs with a custom UI built on Streamlit.

api-client llm-training llm-training-data

Updated Jul 27, 2025
Python

AmanPriyanshu / Stratified-LLM-Subsets-100K-1M-Scale

Sponsor

Star

Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity across 5 high-quality open datasets.

Updated Oct 4, 2025
HTML

vinsblack / The-Stach-Processed-v2

Sponsor

Star

Sample edition of The Stack Enriched: annotated, secure, and optimized code dataset, this is a sample version

machine-learning dataset programming-languages code-generation code-quality code-completion machine-learning-dataset bigcode ml-training huggingface-datasets commercial-license ai-code-generation llm-training-data premium-dataset ai-training-data commercial-dataset dataset-licensing rust-dataset

Updated Jul 19, 2025
Python

BlazeWild / Custom_LLM_DataGen_Template

Star

🔧 Modular pipeline for generating high-quality, domain-specific datasets for LLM fine-tuning — from PDFs and web scraping to synthetic Q&A generation, quality filtering, and training-ready formatting.

synthetic-dataset-generation template-generic-repo llm-training finetuning-llms finetuning-large-language-models llama3 llm-training-data lora-fine-tuning

Updated Jul 15, 2025
Python

Improve this page

Add a description, image, and links to the llm-training-data topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-training-data topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-training-data

Here are 8 public repositories matching this topic...

Cre4T3Tiv3 / Cre4T3Tiv3

ServiceNow / SyGra

emailmarketingdataset / Open-Email-Marketing-Dataset

b2bemaillists / b2b-email-leads-ranking

deepakshroff / Capston-Gemini-ChatBot

AmanPriyanshu / Stratified-LLM-Subsets-100K-1M-Scale

vinsblack / The-Stach-Processed-v2

BlazeWild / Custom_LLM_DataGen_Template

Improve this page

Add this topic to your repo