Privacy-first intelligent document search powered by client-side AI
A sophisticated Progressive Web Application (PWA) that enables users to upload, process, and search through their documents using advanced AI embeddingsβall running entirely in the browser for maximum privacy and security.
- 100% client-side processing - your documents never leave your device
- No server uploads, no cloud storage, no tracking
- Data stored locally in browser using IndexedDB
- Semantic search using Transformers.js with
all-MiniLM-L6-v2embeddings - Context-aware results with relevance scoring and snippets
- Intelligent text chunking with configurable overlap
- PDF, DOCX, Markdown, and plain text files
- Robust text extraction with error handling
- Automatic file type detection
- Web Workers for non-blocking AI processing
- LanceDB vector storage for lightning-fast similarity search
- Progressive loading with real-time progress tracking
- Responsive design with Tailwind CSS v4
- Clean, modern interface with intuitive navigation
- Drag-and-drop file uploads with visual feedback
- Beautiful search results with highlighting and context
- Accessibility-first design (WCAG 2.1 AA compliant)
- React 18.3 with TypeScript 5.3 (strict mode)
- Vite 5.1 for lightning-fast development
- Tailwind CSS v4 with modern design system
- Zustand for efficient state management
- Transformers.js (
all-MiniLM-L6-v2) for client-side embeddings - LanceDB for high-performance vector storage
- Custom semantic chunking with overlap optimization
- Cosine similarity search with relevance scoring
- Vitest with comprehensive test suite (34 tests, 100% pass rate)
- TypeScript strict mode with zero compilation errors
- ESLint + Prettier for code quality
- β₯85% test coverage on critical paths
src/
βββ components/ # Modular React components
β βββ AppHeader.tsx # Navigation and branding
β βββ DocumentUpload.tsx # File upload interface
β βββ SearchInterface.tsx # Search input and filters
β βββ SearchResults.tsx # Results display with highlighting
β βββ ProcessingProgress.tsx # Real-time processing status
βββ lib/ # Core business logic
β βββ embeddingService.ts # AI embedding generation
β βββ vectorStorageService.ts # LanceDB vector operations
β βββ textExtractor.ts # Multi-format text extraction
β βββ textChunker.ts # Intelligent text segmentation
β βββ knowledgeSearchService.ts # Main search orchestration
βββ workers/ # Background processing
β βββ embeddingWorker.ts # Web Worker for AI computations
βββ __tests__/ # Comprehensive test suite
βββ mockServices.test.ts # Core functionality tests
βββ textChunker.test.ts # Text processing tests
βββ embeddingService.test.ts # AI integration tests
- Node.js 18+ (project uses 22.8.0)
- Modern browser with WebAssembly support
# Clone the repository
git clone https://github.com/Poolchaos/local-knowledge-search.git
cd local-knowledge-search
# Install dependencies
npm install
# Start development server
npm run dev# Run comprehensive test suite
npm run test
# Run tests with coverage
npm run test:coverage
# Watch mode for development
npm run test:watch# Create optimized production build
npm run build
# Preview production build locally
npm run previewThe project uses all-MiniLM-L6-v2 by default. To use a different model:
// src/lib/embeddingService.ts
const MODEL_NAME = 'Xenova/your-preferred-model';Configure chunking parameters in textChunker.ts:
const DEFAULT_CONFIG = {
maxWordsPerChunk: 500, // Maximum words per chunk
overlapWords: 50, // Overlap between chunks
minChunkWords: 50 // Minimum viable chunk size
};LanceDB configuration in vectorStorageService.ts:
const VECTOR_DIMENSIONS = 384; // Match your embedding model
const TABLE_NAME = 'document_embeddings';- Search through research papers and academic documents
- Find relevant passages across multiple sources
- Organize and query personal knowledge base
- Navigate large technical documentation sets
- Search meeting notes and project documents
- Find relevant information across contracts and reports
- Search through journal entries and notes
- Find information across diverse document types
- Build your personal search engine
- Share knowledge bases without privacy concerns
- Search through team documents locally
- Maintain document confidentiality
- No data transmission: All processing happens in your browser
- Local storage only: Documents stored in IndexedDB
- No tracking: Zero analytics or telemetry
- No accounts: No registration or authentication required
- Content Security Policy (CSP) headers
- Secure handling of file uploads
- Input validation and sanitization
- Error boundaries for graceful failures
- Sub-second search for typical document collections
- Optimized embeddings with 384-dimension vectors
- Efficient chunking with smart overlap handling
- Progressive loading for large documents
- Moderate memory footprint (~100-300MB for typical use)
- Efficient CPU usage with Web Worker isolation
- Optimized bundle size with code splitting
- Battery-friendly background processing
- 34 tests across 5 test files
- Unit tests for core algorithms
- Integration tests for user workflows
- Mock services for reliable, fast testing
- TypeScript strict mode with zero errors
- ESLint with no warnings on new code
- All tests must pass before commits
- Performance benchmarks for search operations
- Core document processing and search
- Multi-format file support
- Comprehensive test suite
- Professional UI/UX
- Advanced search filters (date, file type, size)
- Document similarity recommendations
- Export search results and summaries
- Custom embedding model selection
- OCR support for scanned documents
- Audio transcription and search
- Collaborative sharing (still privacy-first)
- Browser extension for web page search
- Plugin architecture for custom processors
- Integration with popular note-taking apps
- Mobile-optimized PWA features
- Offline-first synchronization
We welcome contributions! Please see our Contributing Guidelines for details.
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Implement changes with tests
- Ensure all tests pass (
npm run test) - Commit using semantic format (
feat: add amazing feature) - Push and create Pull Request
- TypeScript strict mode required
- Comprehensive tests for new features
- ESLint and Prettier compliance
- Semantic commit messages
- Documentation updates
This project is licensed under the MIT License - see the LICENSE file for details.
- Hugging Face for the incredible Transformers.js library
- LanceDB team for high-performance vector storage
- React and Vite communities for amazing developer tools
- Tailwind CSS for the beautiful design system
Built with β€οΈ for privacy-conscious knowledge workers
Experience the future of document searchβprivate, powerful, and completely under your control.