collection of text2cypher datasets, evaluations, and finetuning instructions
-
Updated
Jun 13, 2024 - Jupyter Notebook
collection of text2cypher datasets, evaluations, and finetuning instructions
Repository for organizing datasets and papers used in Open LLM.
A data-centric AI package for ML/AI. Get the best high-quality data for the best results. Discord: https://discord.gg/t6ADqBKrdZ
A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks
SyGra - Graph-oriented Synthetic data generation Pipeline
A collection of recent open-source math datasets for training and evaluating Math LLMs
Efficiently fetch and perform sentiment analysis (Turkish Only) on eksisozluk.com entries using Rust
A framework to analyze how AGI/ASI might emerge from decentralized, adaptive systems, rather than as the fruit of a single model deployment. It also aims to present orientation as a dynamic and self-evolving Magna Carta, helping to guide the emergence of such phenomena.
Synthetically Generating Intent-Aware Information-Seeking Dialogues! Useful for various tasks such as training/evaluating User Intent Predictors with the possibility to training/evaluating on real human dialogues. The backbone LLM of SOLID is Zephyr-7b-beta.
Convert multi-speaker audio files to structured chat data for LLMs
A bunch of very famous repos source code's in python as pure localdocs all in this repo to train CODE AI
LLM-Powered Dataset Creation Tool
WikiText syntax dataset generation pipeline and open dataset for auto UI generation in TiddlyWiki. (WIP)
Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity across 5 high-quality open datasets.
PARROT (Performance Assessment of Reasoning and Responses On Trivia) is a novel benchmarking framework designed to evaluate Large Language Models (LLMs) on real-world, complex, and ambiguous QA tasks.
A modified dataset consisting of English dialogs between a user and an assistant discussing movie preferences in natural language.
Modifying and running the DCLM-Baseline data curation pipeline in German.
A collection of Persian poems structured for NLP and LLM tasks. Each poem is stored as a separate file, organized by poet, and formatted for easy use in training, fine-tuning, or text analysis workflows.
Add a description, image, and links to the llm-datasets topic page so that developers can more easily learn about it.
To associate your repository with the llm-datasets topic, visit your repo's landing page and select "manage topics."