AI-powered semantic code search for your codebase. Find code by meaning, not just keywords.
- Semantic filtering temporarily disabled:
--type
,--scope
,--features
flags don't function with current chunking system - Chunking system changed: All files now use model-aware fixed-size chunking instead of previous approach
- Re-indexing recommended: Existing indexes should be rebuilt for optimal chunk sizes
- Semantic Search - Find code by meaning using vector embeddings and fixed-size chunking
- Multiple Providers - Local (Ollama) or cloud (VoyageAI) embeddings
- Smart Indexing - Incremental updates, git-aware, multi-project support
- Search Filtering - Filter by programming language, file paths, and similarity scores
- Multi-Language Support - Universal text processing for Python, JavaScript, TypeScript, Java, C#, Go, Kotlin, Groovy, Pascal/Delphi, SQL, C, C++, Rust, Swift, Ruby, Lua, HTML, CSS, YAML, XML
- CLI Interface - Simple commands with progress indicators
- AI Analysis - Integrates with Claude CLI for code analysis with semantic search
- Privacy Options - Full local processing or cloud for better performance
# Install the package
pipx install git+https://github.com/jsbattig/code-indexer.git
# Setup global registry (standalone command - requires sudo)
cidx setup-global-registry
python3 -m venv code-indexer-env
source code-indexer-env/bin/activate
pip install git+https://github.com/jsbattig/code-indexer.git
# Setup global registry (standalone command - requires sudo)
code-indexer setup-global-registry
- Python 3.9+
- Container Engine: Docker or Podman (for containerized services)
- Memory: 4GB+ RAM recommended
- Disk Space: ~500MB for base containers + vector data storage
Code Indexer supports both Docker and Podman container engines:
- Auto-detection: Automatically detects and uses available container engine
- Podman preferred: Uses Podman by default when available (better rootless support)
- Force Docker: Use
--force-docker
flag to force Docker usage - Rootless containers: Fully supports rootless container execution
Global Port Registry: The setup-global-registry
command configures system-wide port coordination at /var/lib/code-indexer/port-registry
, preventing conflicts when running multiple code-indexer projects simultaneously. This is required for proper multi-project support.
# 1. Setup global registry (once per system)
cidx setup-global-registry
# 2. Navigate to your project
cd /path/to/your/project
# 3. Start services and index code
code-indexer start # Auto-creates config if needed
code-indexer index # Smart incremental indexing
# 4. Search semantically
code-indexer query "authentication logic"
# 5. Search with filtering
code-indexer query "user" --language python --min-score 0.7
code-indexer query "save" --path "*/models/*" --limit 20
# 6. AI-powered analysis (requires Claude CLI)
code-indexer claude "How does auth work in this app?"
# Optional: Initialize with custom settings first
code-indexer init --embedding-provider voyage-ai --max-file-size 2000000
code-indexer start
code-indexer index
# Global system setup
code-indexer setup-global-registry # Setup global port registry (requires sudo)
code-indexer setup-global-registry --test-access --quiet # Test registry access
# Project initialization
code-indexer init # Initialize with default settings
code-indexer init --embedding-provider voyage-ai # Use VoyageAI instead of Ollama
code-indexer init --max-file-size 2000000 # Set 2MB file size limit
code-indexer init --setup-global-registry # Init + setup registry (legacy)
code-indexer init --create-override-file # Create .code-indexer-override.yaml
# Service lifecycle
code-indexer start # Start services (smart detection)
code-indexer start --force-docker # Force Docker instead of Podman
code-indexer start --force-recreate # Force recreate containers
code-indexer start --quiet # Silent mode
code-indexer start -m all-minilm-l6-v2 # Different Ollama model
code-indexer status # Check service status
code-indexer status --force-docker # Check Docker status specifically
code-indexer stop # Stop services (preserve data)
code-indexer stop --force-docker # Stop Docker services specifically
# Standard indexing
code-indexer index # Smart incremental indexing
code-indexer index --clear # Force full reindex
code-indexer index --reconcile # Reconcile disk vs database
code-indexer index --detect-deletions # Handle deleted files
code-indexer index --batch-size 25 # Custom batch size
code-indexer index --files-count-to-process 100 # Limit file count
code-indexer index --threads 4 # Custom thread count
# Real-time monitoring
code-indexer watch # Git-aware file watching
code-indexer watch --debounce 5.0 # Custom debounce delay
code-indexer watch --initial-sync # Full sync before watching
# Basic search
code-indexer query "search terms" # Semantic search
code-indexer query "auth" --limit 20 # More results
code-indexer query "function" --quiet # Only results, no headers
# Advanced filtering
code-indexer query "user" --language python # Filter by language
code-indexer query "save" --path "*/models/*" # Filter by path pattern
code-indexer query "function" --min-score 0.7 # Higher confidence matches
code-indexer query "database" --limit 15 # More results
code-indexer query "test" --min-score 0.8 # High-confidence matches
# Short alias
cidx query "search terms" # Same as code-indexer query
# Standard analysis
code-indexer claude "How does auth work?" # AI-powered analysis
code-indexer claude "Debug this" --limit 15 # Custom search limit
code-indexer claude "Analyze" --context-lines 200 # More context
code-indexer claude "Quick check" --quiet # Minimal output
code-indexer claude "Review code" --no-stream # No streaming output
# Advanced options
code-indexer claude "Test" --include-file-list # Include project file list
code-indexer claude "Legacy" --rag-first # Use legacy RAG-first approach
# Debugging
code-indexer claude "Test" --dry-run-show-claude-prompt # Show prompt without execution
code-indexer claude "Analyze" --show-claude-plan # Show tool usage tracking
# Quick cleanup (recommended)
code-indexer clean-data # Clear current project data
code-indexer clean-data --all-projects # Clear all projects data
code-indexer clean-data --force-docker # Use Docker for cleanup
# Complete removal
code-indexer uninstall # Remove current project completely
code-indexer uninstall --force-docker # Use Docker for removal
code-indexer uninstall --wipe-all # DANGEROUS: Complete system wipe
# Migration and maintenance
code-indexer clean-legacy # Migrate from legacy containers
code-indexer optimize # Optimize vector database
code-indexer force-flush # Force flush to disk (deprecated)
code-indexer force-flush --collection mycoll # Flush specific collection
# Configuration repair
code-indexer fix-config # Fix corrupted configuration
code-indexer fix-config --dry-run # Preview fixes
code-indexer fix-config --verbose # Detailed fix information
code-indexer fix-config --force # Apply without confirmation
# Claude integration setup
code-indexer set-claude-prompt # Set CIDX instructions in project CLAUDE.md
code-indexer set-claude-prompt --user-prompt # Set in global ~/.claude/CLAUDE.md
# Available on most commands
--force-docker # Force Docker instead of Podman
--verbose, -v # Verbose output
--config, -c PATH # Custom config file path
--path, -p PATH # Custom project directory
# Special global options
--use-cidx-prompt # Generate AI integration prompt
--format FORMAT # Output format (text, markdown, compact, comprehensive)
--output FILE # Save output to file
--compact # Generate compact prompt
cidx
βcode-indexer
(shorter alias for all commands)- Use
cidx
for faster typing:cidx start
,cidx query "search"
, etc.
Ollama (Default - Local)
code-indexer init --embedding-provider ollama
VoyageAI (Cloud)
export VOYAGE_API_KEY="your-key"
code-indexer init --embedding-provider voyage-ai
Performance Status Indicators (VoyageAI only)
During indexing, VoyageAI shows real-time performance status in the progress bar:
- β‘ Full speed - Running at maximum throughput
- π΄ Server throttling - VoyageAI API rate limits detected
Example: 15/100 files (15%) | 8.3 emb/s β‘ | 8 threads | main.py
The system runs at full speed by default. If rate limits are encountered, an error is displayed asking to reduce parallel_requests.
Configuration is stored in .code-indexer/config.json
:
file_extensions
: File types to indexexclude_dirs
: Directories to skipembedding_provider
: ollama or voyage-ai (determines chunk size automatically)max_file_size
: Maximum file size in bytes (default: 1MB)chunk_size
: Legacy setting (ignored, chunker uses model-aware sizing)chunk_overlap
: Legacy setting (ignored, chunker uses 15% of chunk size)
Code Indexer uses a model-aware fixed-size chunking approach optimized for different embedding models:
How it works:
- Model-optimized chunk sizes: Automatically selects optimal chunk size based on embedding model capabilities
- Consistent overlap: 15% overlap between adjacent chunks (across all models)
- Simple arithmetic: Next chunk starts at (chunk_size - overlap_size) from current start position
- Token optimization: Uses larger chunk sizes for models with higher token capacity
Model-Specific Chunk Sizes:
- voyage-code-3: 4,096 characters (leverages 32K token capacity)
- voyage-code-2: 4,096 characters (leverages 16K token capacity)
- voyage-large-2: 4,096 characters (leverages large context capacity)
- nomic-embed-text: 2,048 characters (512 tokens - Ollama limitation)
- Unknown models: 1,000 characters (conservative fallback)
Example chunking (voyage-code-3):
Chunk 1: characters 0-4095 (4096 chars)
Chunk 2: characters 3482-7577 (4096 chars, overlaps 614 chars with Chunk 1)
Chunk 3: characters 6964-11059 (4096 chars, overlaps 614 chars with Chunk 2)
Benefits:
- Model optimization: Uses larger chunks for high-capacity models (VoyageAI: 4096, Ollama: 2048)
- Better context: More complete code sections per chunk
- Efficiency: Fewer total chunks reduce storage requirements
- Model utilization: Takes advantage of each model's token capacity
Code Indexer provides fixed-size chunking with intelligent text processing for all supported languages:
Language | File Extensions | Text Processing |
---|---|---|
Python | .py |
Fixed-size chunks with consistent overlap |
JavaScript | .js , .jsx |
Fixed-size chunks with consistent overlap |
TypeScript | .ts , .tsx |
Fixed-size chunks with consistent overlap |
Java | .java |
Fixed-size chunks with consistent overlap |
C# | .cs |
Fixed-size chunks with consistent overlap |
Go | .go |
Fixed-size chunks with consistent overlap |
Kotlin | .kt , .kts |
Fixed-size chunks with consistent overlap |
Groovy | .groovy , .gradle , .gvy , .gy |
Fixed-size chunks with consistent overlap |
Pascal/Delphi | .pas , .pp , .dpr , .dpk , .inc |
Fixed-size chunks with consistent overlap |
SQL | .sql |
Fixed-size chunks with consistent overlap |
C | .c , .h |
Fixed-size chunks with consistent overlap |
C++ | .cpp , .cc , .cxx , .hpp , .hxx |
Fixed-size chunks with consistent overlap |
Rust | .rs |
Fixed-size chunks with consistent overlap |
Swift | .swift |
Fixed-size chunks with consistent overlap |
Ruby | .rb , .rake , .gemspec |
Fixed-size chunks with consistent overlap |
Lua | .lua |
Fixed-size chunks with consistent overlap |
HTML | .html , .htm |
Fixed-size chunks with consistent overlap |
CSS | .css , .scss , .sass |
Fixed-size chunks with consistent overlap |
YAML | .yaml , .yml |
Fixed-size chunks with consistent overlap |
XML | .xml , .xsd , .xsl , .xslt |
Fixed-size chunks with consistent overlap |
Model-Aware Chunking Benefits:
- Optimized chunk sizes: Each model gets its optimal chunk size (voyage-code-3: 4096, nomic-embed-text: 2048)
- Consistent overlap: 15% overlap between adjacent chunks (across all models)
- Fast processing: No complex parsing overhead, pure arithmetic operations
- Complete search results: Full code sections without truncation
- Model efficiency: Leverages each embedding model's capabilities
Code Indexer uses the following containerized services:
- Qdrant: Vector database for storing code embeddings
- Ollama (optional): Local language model service for embeddings when not using VoyageAI
- Data Cleaner: Containerized service for cleaning root-owned files during data removal operations
Code Indexer uses a comprehensive test infrastructure organized into three main categories for optimal maintainability and execution speed:
tests/
βββ unit/ # Fast unit tests organized by functionality
β βββ parsers/ # Language parser tests
β βββ chunking/ # Text chunking logic tests
β βββ config/ # Configuration management tests
β βββ services/ # Service layer unit tests
β βββ cli/ # CLI unit tests
β βββ git/ # Git operations tests
β βββ infrastructure/ # Core infrastructure tests
β βββ bugfixes/ # Bug fix regression tests
βββ integration/ # Integration tests with real services
β βββ performance/ # Performance and throughput tests
β βββ docker/ # Docker integration tests
β βββ multiproject/ # Multi-project workflow tests
β βββ services/ # Service integration tests
β βββ cli/ # CLI integration tests
βββ e2e/ # End-to-end workflow tests
β βββ git_workflows/ # Git workflow e2e tests
β βββ payload_indexes/ # Payload indexing e2e tests
β βββ providers/ # Provider switching tests
β βββ semantic_search/ # Search capability tests
β βββ claude_integration/ # Claude integration tests
β βββ display/ # UI and display tests
β βββ infrastructure/ # Infrastructure e2e tests
βββ shared/ # Shared test utilities and fixtures
βββ fixtures/ # Test data and fixtures
Tests are categorized by container requirements using pytest markers:
@pytest.mark.shared_safe
- Can use either Docker or Podman containers, data-only operations@pytest.mark.docker_only
- Requires Docker-specific features or configurations@pytest.mark.podman_only
- Requires Podman-specific features or rootless containers@pytest.mark.destructive
- Manipulates containers directly, requires isolation
Code Indexer supports both Docker and Podman with intelligent container management:
- Production containers:
code-indexer-ollama
,code-indexer-qdrant
(ports 11434, 6333) - Test containers:
code-indexer-test-*-docker/podman
(ports 50000+) - Automatic isolation: Test containers use different names, ports, and networks
- Safety measures: Tests never interfere with production container setups
# Run fast unit tests and integration tests
./ci-github.sh
# Run specific test categories
pytest tests/unit/ -v # Unit tests only
pytest tests/integration/ -v # Integration tests only
pytest tests/e2e/ -v # E2E tests only
# Run all tests including slow e2e tests
./full-automation.sh
# Run tests by container category
pytest -m shared_safe -v # Shared-safe tests only
pytest -m docker_only -v # Docker-specific tests only
pytest -m "not destructive" -v # Non-destructive tests only
git clone https://github.com/jsbattig/code-indexer.git
cd code-indexer
pip install -e ".[dev]"
# Run quick tests to verify setup
./ci-github.sh
# Optional: Run comprehensive tests (requires Docker/Podman)
./full-automation.sh
- Service persistence: Tests keep services running between executions for speed
- Prerequisite validation: Tests ensure required services are available at startup
- Clean data isolation: Uses
clean-data
for test isolation, not service shutdown - Automatic categorization: Tests are automatically categorized by container requirements
- Shared fixtures: Common test utilities available via
tests.shared.test_infrastructure
# Run linting (includes ruff, black, mypy)
./lint.sh
# Check for import issues and syntax errors
python -m py_compile src/code_indexer/**/*.py
MIT License
Issues and pull requests welcome!