"We've always defined ourselves by the ability to overcome the impossible."
TARS is a completely offline multimodal AI agent inspired by the TARS robot from Interstellar. Built with LangGraph orchestration, Phi-4-mini language models, Vosk speech recognition, BLIP vision processing, and intelligent query decomposition.
TARS runs entirely on your local machine, ensuring zero data transmission and complete privacy. The system features advanced subquery decomposition, multi-agent routing, response stitching, and real-time multimodal processing.
- Complete Offline Operation: No internet required after initial setup
- Multimodal Processing: Vision, speech, and text capabilities
- Privacy-First: All processing happens locally
- Intelligent Query Decomposition: Automatically splits complex multi-part queries
- Response Stitching: Combines responses from multiple AI agents
- Real-time Processing: Live audio transcription and vision analysis
- Phi-4-mini-instruct: Local 14B parameter model for sophisticated reasoning
- DistilGPT: Fallback language model for lightweight processing
- Semantic Intent Classification: Deep contextual understanding beyond keyword matching
- Conversation Memory: Maintains context across extended interactions
- Personality Engine: Configurable personality modes (professional, witty, technical)
- Vosk Speech Recognition: Offline STT with multiple model support
- Advanced VAD: Hybrid Voice Activity Detection (WebRTC + Silero)
- Real-time Transcription: Live audio processing with managed latency
- Multi-TTS Engines: pyttsx3, gTTS, eSpeak with voice consistency
- Audio Enhancement: Noise suppression, gain control, format conversion
- Buffer Management: Intelligent audio buffering prevents infinite loops
Note: The Vosk model is lightweight and may have transcription accuracy limitations
- BLIP Image Captioning: Advanced image description and analysis
- YOLOv8n Object Detection: Real-time object identification
- Haar Cascade Face Detection: Face detection and counting
- Real-time Camera Integration: Live video processing capabilities
- LangGraph Orchestration: Event-driven state management
- Thread Coordination: Managed threading with health monitoring
- Fault Tolerance: Auto-restart, graceful degradation, error recovery
- Performance Monitoring: Real-time metrics and health dashboards
- Modular Design: Clean separation of concerns, easy testing
- Configuration Management: Environment-based settings
graph TD
UI[User Input]
UI --> Keyboard[Keyboard - Text]
--> Audio[Microphone - Audio]
Audio --> VAD[VAD & Audio Processor] --> STT[Speech-to-Text - STT]
Keyboard --> TextHandler[Text Input Handler]
STT --> LangGraph[LangGraph State Manager]
TextHandler --> LangGraph
LangGraph --> Router[Router Node - Intent Classification]
Router --> QA[QA Node - LLM Orchestrator]
Router --> RobotCtrl[Robot Node - Robot Controller]
Router --> VisionNode[Vision Node]
Router --> Speech[Speech Node]
QA --> LLM[LLM Orchestrator - Local/Fallback]
LLM --> Persona[Personality Controller]
LLM --> TextResp[Text Response]
TextResp --> TTS[TTS Manager]
RobotCtrl --> RobotCtlr[Robot Controller]
RobotCtrl --> ConvHist[Conversation History]
RobotCtlr --> TTS[TTS Manager]
TTS --> Spoken[Spoken Response]
Spoken --> User
VisionNode --> VisionModel[Vision Model - BLIP, Face]
VisionModel --> TTS[TTS Manager]
Speech --> TTS
- CPU: Intel i5 or equivalent (developed and tested on Intel i5)
- RAM: 8GB minimum (16GB+ recommended)
- Storage: 10GB+ free disk space
- Microphone: Required for speech features
- Camera: Optional, for vision features
- Python: 3.8+ (3.10+ recommended)
- Operating System: Windows (some threading components are Windows-specific)
- Dependencies: See
requirements.txt
git clone
cd tars_agent
pip install -r requirements.txt
Download the following models to the models/
directory:
- Vosk Model: vosk-model-en-us-0.22
- Phi-4-mini-instruct: Download to
models/phi4-mini-instruct/
- YOLOv8n: Download to
models/yolov8n.pt
- DistilGPT: Download to
models/distilgpt/
- Haar Cascade: Download
haarcascade_frontalface.xml
Configure system settings in tars/config.py
or use environment variables.
[Screenshot Placeholder: Installation process and model download]
python main.py
python main.py --debug # Enable debug logging
python main.py --no-speech --no-vision # Text-only mode
python main.py --personality witty # Set personality mode
help
- Display comprehensive command referencestatus
- System diagnostics and performance metricslogs
- Recent system activity and mission logsmetrics
- Performance analytics and efficiency dataclear
- Clear displayquit
- Initiate shutdown sequence
- Natural language queries and conversations
start listening
- Activate voice input systemsstop listening
- Deactivate voice inputwhat do you see
- Visual analysis and description
what do you see and what is mars
explain about jwst and tell what you see
hello can you explain quantum physics like I am 5
what do you see
what objects do you see
describe the image
start listening
[speak your query]
stop listening
tars_agent/
├── data/
│ └── conversations/
│ └── conversations_history.json
├── logs/
│ └── tars.log
├── models/
│ ├── vosk/
│ ├── phi4-mini-instruct/
│ ├── distilgpt/
│ ├── haarcascade_frontalface.xml
│ └── yolov8n.pt
├── tars/
│ ├── components/
│ │ ├── audio/
│ │ ├── robot/
│ │ ├── speech/
│ │ └── vision/
│ ├── graph/
│ │ ├── nodes/
│ │ ├── state.py
│ │ └── utils.py
│ ├── llm/
│ │ ├── history.py
│ │ ├── orchestrator.py
│ │ └── personality.py
│ ├── utils/
│ │ ├── config.py
│ │ ├── data_models.py
│ │ ├── resource_coordinator.py
│ │ └── thread_coordinator.py
│ └── config.py
└── main.py
- Query Input: User provides text, speech, or multimodal input
- Intent Classification: System determines query type and complexity
- Query Decomposition: Complex queries are split into subqueries
- Node Routing: Each subquery is routed to appropriate processing node
- Parallel Processing: Vision, speech, and language nodes process simultaneously
- Response Collection: All node responses are gathered
- Response Stitching: LLM combines responses into coherent output
- Output Generation: Final response delivered via text and/or speech
[Screenshot Placeholder: Processing flow visualization]
TARS includes comprehensive monitoring and logging:
- Real-time Metrics: Response times, success rates, system health
- Detailed Logging: UTF-8 encoded logs with full processing traces
- Health Diagnostics: Component status and performance analytics
- Error Tracking: Anomaly detection and recovery mechanisms
- Vosk Accuracy: Lightweight model may have transcription limitations
- Windows-Specific: Some threading components optimized for Windows
- Resource Intensive: Requires substantial RAM for optimal performance
- Robot Control: Currently in development, limited functionality
- Model Dependencies: Requires manual model downloads
- Code Organization: Codebase has been refactored extensively and is a bit too complex
This project represents extensive learning and experimentation in multimodal AI systems. The codebase has been refactored numerous times as we explored different approaches to:
- Multimodal AI integration
- Real-time audio processing
- Vision-language model coordination
- Thread management and system orchestration
- Performance optimization for CPU-only inference
The codebase is not for everyone - it reflects a journey of discovery and learning rather than a polished commercial product.
- Audio Issues: Check microphone permissions and audio device configuration
- Model Loading: Ensure all models are downloaded to correct directories
- Memory Issues: Increase available RAM or reduce model sizes
- Threading Errors: Verify Windows compatibility for threading components
- Live Audio Streaming: Real-time voice conversation with minimal latency
- Live Vision Processing: Continuous video analysis and commentary
- Real-time Multimodal Communication: Seamless integration of speech, vision, and text in real-time
The vision of true real-time multimodal communication remains challenging due to:
- Processing Overhead: Running Phi-4-mini (14B parameters), BLIP, YOLO, and Vosk simultaneously creates significant computational load
- Sequential Processing: Current architecture processes queries sequentially rather than in parallel streams
- Hardware Limitations: CPU-only inference on consumer hardware introduces unavoidable latency
- Memory Bandwidth: Large model switching and context management creates processing bottlenecks
- Model quantization and optimization for faster inference
- Parallel processing architecture redesign
- Streaming response generation
- Hardware acceleration integration (GPU support)
- Lightweight model alternatives for real-time components
This is a personal research project. The code is provided as-is for educational and research purposes.
- Inspired by TARS from Interstellar
- Built with LangGraph, Vosk, BLIP, and other open-source technologies
- Developed as a learning project in multimodal AI systems
Version: 1.0.0
Author: Akshanth Sai K
Project Type: Personal Development
"See you on the other side." - TARS