Skip to content

Offline multimodal AI system with speech, vision, and language processing. LangGraph orchestration, Phi-4-mini LLM, Vosk STT, BLIP vision. Complete privacy, zero data transmission. Inspired by Interstellar's TARS.

Notifications You must be signed in to change notification settings

akshanthsaik/TARS-Offline-Multimodal-AI-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TARS - Offline Multimodal AI System

"We've always defined ourselves by the ability to overcome the impossible."

TARS is a completely offline multimodal AI agent inspired by the TARS robot from Interstellar. Built with LangGraph orchestration, Phi-4-mini language models, Vosk speech recognition, BLIP vision processing, and intelligent query decomposition.

alt text alt text

Overview

TARS runs entirely on your local machine, ensuring zero data transmission and complete privacy. The system features advanced subquery decomposition, multi-agent routing, response stitching, and real-time multimodal processing.

Key Features

  • Complete Offline Operation: No internet required after initial setup
  • Multimodal Processing: Vision, speech, and text capabilities
  • Privacy-First: All processing happens locally
  • Intelligent Query Decomposition: Automatically splits complex multi-part queries
  • Response Stitching: Combines responses from multiple AI agents
  • Real-time Processing: Live audio transcription and vision analysis

Technical Architecture

Core Components

Language Processing

  • Phi-4-mini-instruct: Local 14B parameter model for sophisticated reasoning
  • DistilGPT: Fallback language model for lightweight processing
  • Semantic Intent Classification: Deep contextual understanding beyond keyword matching
  • Conversation Memory: Maintains context across extended interactions
  • Personality Engine: Configurable personality modes (professional, witty, technical)

Speech Processing

  • Vosk Speech Recognition: Offline STT with multiple model support
  • Advanced VAD: Hybrid Voice Activity Detection (WebRTC + Silero)
  • Real-time Transcription: Live audio processing with managed latency
  • Multi-TTS Engines: pyttsx3, gTTS, eSpeak with voice consistency
  • Audio Enhancement: Noise suppression, gain control, format conversion
  • Buffer Management: Intelligent audio buffering prevents infinite loops

Note: The Vosk model is lightweight and may have transcription accuracy limitations

Vision Processing

  • BLIP Image Captioning: Advanced image description and analysis
  • YOLOv8n Object Detection: Real-time object identification
  • Haar Cascade Face Detection: Face detection and counting
  • Real-time Camera Integration: Live video processing capabilities

System Orchestration

  • LangGraph Orchestration: Event-driven state management
  • Thread Coordination: Managed threading with health monitoring
  • Fault Tolerance: Auto-restart, graceful degradation, error recovery
  • Performance Monitoring: Real-time metrics and health dashboards
  • Modular Design: Clean separation of concerns, easy testing
  • Configuration Management: Environment-based settings
graph TD
    UI[User Input] 
    UI --> Keyboard[Keyboard - Text]
    --> Audio[Microphone - Audio]
    Audio --> VAD[VAD & Audio Processor] --> STT[Speech-to-Text - STT]
    Keyboard --> TextHandler[Text Input Handler]

    STT --> LangGraph[LangGraph State Manager]

    TextHandler --> LangGraph

    LangGraph --> Router[Router Node - Intent Classification]

    Router --> QA[QA Node - LLM Orchestrator]
    Router --> RobotCtrl[Robot Node - Robot Controller]
    Router --> VisionNode[Vision Node]
    Router --> Speech[Speech Node]

    QA --> LLM[LLM Orchestrator - Local/Fallback]
    LLM --> Persona[Personality Controller]
    LLM --> TextResp[Text Response]
    TextResp --> TTS[TTS Manager]

    RobotCtrl --> RobotCtlr[Robot Controller]
    RobotCtrl --> ConvHist[Conversation History]
    RobotCtlr --> TTS[TTS Manager]
    TTS --> Spoken[Spoken Response]
    Spoken --> User

    VisionNode --> VisionModel[Vision Model - BLIP, Face]
    VisionModel --> TTS[TTS Manager]

    Speech --> TTS

Loading

System Requirements

Hardware Requirements

  • CPU: Intel i5 or equivalent (developed and tested on Intel i5)
  • RAM: 8GB minimum (16GB+ recommended)
  • Storage: 10GB+ free disk space
  • Microphone: Required for speech features
  • Camera: Optional, for vision features

Software Requirements

  • Python: 3.8+ (3.10+ recommended)
  • Operating System: Windows (some threading components are Windows-specific)
  • Dependencies: See requirements.txt

Installation

1. Clone Repository

git clone 
cd tars_agent

2. Install Dependencies

pip install -r requirements.txt

3. Download Models

Download the following models to the models/ directory:

  • Vosk Model: vosk-model-en-us-0.22
  • Phi-4-mini-instruct: Download to models/phi4-mini-instruct/
  • YOLOv8n: Download to models/yolov8n.pt
  • DistilGPT: Download to models/distilgpt/
  • Haar Cascade: Download haarcascade_frontalface.xml

4. Configuration

Configure system settings in tars/config.py or use environment variables.

[Screenshot Placeholder: Installation process and model download]

Usage

Starting TARS

python main.py

Command Line Options

python main.py --debug                    # Enable debug logging
python main.py --no-speech --no-vision   # Text-only mode
python main.py --personality witty       # Set personality mode

Interactive Commands

System Commands

  • help - Display comprehensive command reference
  • status - System diagnostics and performance metrics
  • logs - Recent system activity and mission logs
  • metrics - Performance analytics and efficiency data
  • clear - Clear display
  • quit - Initiate shutdown sequence

alt text alt text alt text alt text

AI Interactions

  • Natural language queries and conversations
  • start listening - Activate voice input systems
  • stop listening - Deactivate voice input
  • what do you see - Visual analysis and description

Example Use Cases

Multimodal Queries

what do you see and what is mars
explain about jwst and tell what you see
hello can you explain quantum physics like I am 5

Vision Processing

what do you see
what objects do you see
describe the image

Speech Interaction

start listening
[speak your query]
stop listening

Multimodal Queries Examples

alt text alt text

alt text alt text

Speech Interaction Examples

alt text alt text

File Structure

tars_agent/
├── data/
│   └── conversations/
│       └── conversations_history.json
├── logs/
│   └── tars.log
├── models/
│   ├── vosk/
│   ├── phi4-mini-instruct/
│   ├── distilgpt/
│   ├── haarcascade_frontalface.xml
│   └── yolov8n.pt
├── tars/
│   ├── components/
│   │   ├── audio/
│   │   ├── robot/
│   │   ├── speech/
│   │   └── vision/
│   ├── graph/
│   │   ├── nodes/
│   │   ├── state.py
│   │   └── utils.py
│   ├── llm/
│   │   ├── history.py
│   │   ├── orchestrator.py
│   │   └── personality.py
│   ├── utils/
│   │   ├── config.py
│   │   ├── data_models.py
│   │   ├── resource_coordinator.py
│   │   └── thread_coordinator.py
│   └── config.py
└── main.py

Processing Flow

  1. Query Input: User provides text, speech, or multimodal input
  2. Intent Classification: System determines query type and complexity
  3. Query Decomposition: Complex queries are split into subqueries
  4. Node Routing: Each subquery is routed to appropriate processing node
  5. Parallel Processing: Vision, speech, and language nodes process simultaneously
  6. Response Collection: All node responses are gathered
  7. Response Stitching: LLM combines responses into coherent output
  8. Output Generation: Final response delivered via text and/or speech

[Screenshot Placeholder: Processing flow visualization]

Performance Monitoring

TARS includes comprehensive monitoring and logging:

  • Real-time Metrics: Response times, success rates, system health
  • Detailed Logging: UTF-8 encoded logs with full processing traces
  • Health Diagnostics: Component status and performance analytics
  • Error Tracking: Anomaly detection and recovery mechanisms

alt text alt text alt text alt text alt text

Known Limitations

  • Vosk Accuracy: Lightweight model may have transcription limitations
  • Windows-Specific: Some threading components optimized for Windows
  • Resource Intensive: Requires substantial RAM for optimal performance
  • Robot Control: Currently in development, limited functionality
  • Model Dependencies: Requires manual model downloads
  • Code Organization: Codebase has been refactored extensively and is a bit too complex

Development Notes

This project represents extensive learning and experimentation in multimodal AI systems. The codebase has been refactored numerous times as we explored different approaches to:

  • Multimodal AI integration
  • Real-time audio processing
  • Vision-language model coordination
  • Thread management and system orchestration
  • Performance optimization for CPU-only inference

The codebase is not for everyone - it reflects a journey of discovery and learning rather than a polished commercial product.

Troubleshooting

Common Issues

  1. Audio Issues: Check microphone permissions and audio device configuration
  2. Model Loading: Ensure all models are downloaded to correct directories
  3. Memory Issues: Increase available RAM or reduce model sizes
  4. Threading Errors: Verify Windows compatibility for threading components

Future Development

Originally Planned Features

  • Live Audio Streaming: Real-time voice conversation with minimal latency
  • Live Vision Processing: Continuous video analysis and commentary
  • Real-time Multimodal Communication: Seamless integration of speech, vision, and text in real-time

Current Constraints

The vision of true real-time multimodal communication remains challenging due to:

  • Processing Overhead: Running Phi-4-mini (14B parameters), BLIP, YOLO, and Vosk simultaneously creates significant computational load
  • Sequential Processing: Current architecture processes queries sequentially rather than in parallel streams
  • Hardware Limitations: CPU-only inference on consumer hardware introduces unavoidable latency
  • Memory Bandwidth: Large model switching and context management creates processing bottlenecks

Potential Solutions

  • Model quantization and optimization for faster inference
  • Parallel processing architecture redesign
  • Streaming response generation
  • Hardware acceleration integration (GPU support)
  • Lightweight model alternatives for real-time components

Contributing

This is a personal research project. The code is provided as-is for educational and research purposes.

Acknowledgments

  • Inspired by TARS from Interstellar
  • Built with LangGraph, Vosk, BLIP, and other open-source technologies
  • Developed as a learning project in multimodal AI systems

Version: 1.0.0
Author: Akshanth Sai K
Project Type: Personal Development

"See you on the other side." - TARS

About

Offline multimodal AI system with speech, vision, and language processing. LangGraph orchestration, Phi-4-mini LLM, Vosk STT, BLIP vision. Complete privacy, zero data transmission. Inspired by Interstellar's TARS.

Topics

Resources

Stars

Watchers

Forks

Languages