A comprehensive toolkit for building, training, fine-tuning, and deploying GPT-style language models with advanced data processing capabilities and CPU-friendly defaults.
LLMBuilder is a production-ready framework for training and fine-tuning Large Language Models (LLMs). Designed for developers, researchers, and AI engineers, LLMBuilder provides a full pipeline to go from raw text data to deployable, optimized LLMs, all running locally on CPUs or GPUs.
- Multi-Format Ingestion: Process HTML, Markdown, EPUB, PDF, and text files with intelligent extraction
- OCR Integration: Automatic OCR fallback for scanned PDFs using Tesseract
- Smart Deduplication: Both exact and semantic duplicate detection with configurable similarity thresholds
- Batch Processing: Parallel processing with configurable worker threads and batch sizes
- Multiple Algorithms: BPE, SentencePiece, Unigram, and WordPiece tokenizers
- Custom Training: Train tokenizers on your specific datasets with advanced configuration options
- Validation Tools: Built-in tokenizer testing and benchmarking utilities
- GGUF Pipeline: Convert trained models to GGUF format for llama.cpp compatibility
- Quantization Options: Support for F32, F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0 quantization levels
- Batch Conversion: Convert multiple models with different quantization levels simultaneously
- Validation: Automatic output validation and integrity checking
- Template System: Pre-configured templates for different use cases (CPU, GPU, inference, etc.)
- Validation: Comprehensive configuration validation with detailed error reporting
- Override Support: Easy configuration customization with dot-notation overrides
- CLI Integration: Full command-line configuration management tools
- Complete Interface: Full command-line interface for all operations
- Interactive Modes: Guided setup and configuration for new users
- Progress Tracking: Real-time progress reporting with detailed logging
- Batch Operations: Support for processing multiple files and models
pip install llmbuilder# Create a new project with default structure
llmbuilder init my_llm_project
# Navigate to your project directory
cd my_llm_projectThis creates a structured project with the following directories:
data/raw- Place your raw input files here (.txt, .pdf, .docx)data/processed- Processed text filestokenizer- Tokenizer filesmodels/checkpoints- Training checkpointsmodels/final- Final trained modelsconfigs- Configuration filesoutputs- Output files
Complete documentation is available at: https://qubasehq.github.io/llmbuilder/
The documentation includes:
- Getting Started Guide - From installation to your first model
- User Guides - Comprehensive guides for all features
- CLI Reference - Complete command-line interface documentation
- Python API - Full API reference with examples
- Examples - Working code examples for common tasks
- FAQ - Answers to frequently asked questions
# Show help and available commands
llmbuilder --help
# Initialize a new project
llmbuilder init my_project
# Interactive welcome guide for new users
llmbuilder welcome
# Show package information and credits
llmbuilder info# List available configuration templates
llmbuilder config templates
# Create a configuration from a template
llmbuilder config create --preset cpu_small -o configs/my_config.json
# Validate configuration with detailed reporting
llmbuilder config validate configs/my_config.json# Process raw data files
llmbuilder data load -i data/raw -o data/processed/input.txt --clean
# Remove duplicates from your data
llmbuilder data deduplicate -i data/processed/input.txt -o data/processed/clean.txt --method both
# Train custom tokenizer
llmbuilder tokenizer train -i data/processed/clean.txt -o tokenizer/ --vocab-size 16000# Train model
llmbuilder train model -d data/processed/clean.txt -t tokenizer/ -o models/checkpoints
# Interactive text generation setup
llmbuilder generate text --setup
# Generate text with custom parameters
llmbuilder generate text -m models/checkpoints/latest.pt -t tokenizer/ -p "Hello world" --temperature 0.8 --max-tokens 100# Convert to GGUF format for deployment
llmbuilder export gguf models/checkpoints/latest.pt -o models/final/model.gguf -q Q8_0import llmbuilder as lb
# Load a preset config and build a model
cfg = lb.load_config(preset="cpu_small")
model = lb.build_model(cfg.model)
# Train (example; see examples/train_tiny.py for a runnable script)
from llmbuilder.data import TextDataset
dataset = TextDataset("./data/clean.txt", block_size=cfg.model.max_seq_length)
results = lb.train_model(model, dataset, cfg.training)
# Generate text
text = lb.generate_text(
model_path="./checkpoints/model.pt",
tokenizer_path="./tokenizers",
prompt="Hello world",
max_new_tokens=50,
)
print(text)LLMBuilder provides flexible configuration management:
# List available templates
llmbuilder config templates
# Create a configuration from a template
llmbuilder config create --preset cpu_small -o configs/my_config.json
# Validate your configuration
llmbuilder config validate configs/my_config.json- Python 3.8 or higher
- For PDF OCR Processing: Tesseract OCR
- For GGUF Model Conversion: llama.cpp or compatible tools
Missing Optional Dependencies
# Check what's installed
python -c "import llmbuilder; print('LLMBuilder installed')"
# Install missing dependencies
pip install pymupdf pytesseract ebooklib beautifulsoup4 lxml sentence-transformers
# Verify specific features
python -c "import pytesseract; print('OCR available')"
python -c "import sentence_transformers; print('Semantic deduplication available')"System Dependencies
# Tesseract OCR (for PDF processing)
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr tesseract-ocr-eng
# Verify Tesseract installation
tesseract --version
python -c "import pytesseract; pytesseract.get_tesseract_version()"PDF Processing Problems
# Enable debug logging
export LLMBUILDER_LOG_LEVEL=DEBUG
# Common fixes:
# 1. Install language packs: sudo apt-get install tesseract-ocr-eng
# 2. Set Tesseract path: export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
# 3. Lower OCR threshold: --ocr-threshold 0.3Memory Issues with Large Datasets
# Use configuration to optimize memory usage
llmbuilder config from-template cpu_optimized_config -o memory_config.json \
--override data.ingestion.batch_size=50 \
--override data.deduplication.batch_size=500 \
--override data.deduplication.use_gpu_for_embeddings=false
# Process in smaller chunks
llmbuilder data load -i large_dataset/ -o processed.txt --batch-size 25 --workers 2Semantic Deduplication Performance
# GPU issues - disable GPU acceleration
llmbuilder data deduplicate -i dataset.txt -o clean.txt --method semantic --no-gpu
# Slow processing - increase batch size
llmbuilder data deduplicate -i dataset.txt -o clean.txt --method semantic --batch-size 2000
# Memory issues - reduce embedding cache
llmbuilder config from-template basic_config -o config.json \
--override data.deduplication.embedding_cache_size=5000Missing llama.cpp
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Add to PATH or specify location
export PATH=$PATH:/path/to/llama.cpp
# Alternative: Use Python package
pip install llama-cpp-python
# Test conversion
llmbuilder convert gguf --helpConversion Failures
# Check available conversion scripts
llmbuilder convert gguf model.pt -o test.gguf --verbose
# Try different quantization levels
llmbuilder convert gguf model.pt -o test.gguf -q F16 # Less compression
llmbuilder convert gguf model.pt -o test.gguf -q Q8_0 # Balanced
# Increase timeout for large models
llmbuilder config from-template basic_config -o config.json \
--override gguf_conversion.conversion_timeout=7200Validation Errors
# Validate configuration with detailed output
llmbuilder config validate my_config.json --detailed
# Common fixes:
# 1. Vocab size mismatch - ensure model.vocab_size matches tokenizer_training.vocab_size
# 2. Sequence length issues - ensure data.max_length <= model.max_seq_length
# 3. Invalid quantization level - use: F32, F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0Template Issues
# List available templates
llmbuilder config templates
# Create from working template
llmbuilder config from-template basic_config -o working_config.json
# Validate before use
llmbuilder config validate working_config.jsonComplete documentation is available at: https://qubasehq.github.io/llmbuilder/
Apache-2.0