- Remote training and inference w/ SLEAP
- Remote Authenticated CLI Training w/ SLEAP
SLEAP-RTC supports flexible configuration for different deployment environments (development, staging, production).
Configuration is loaded in the following priority order (highest to lowest):
- CLI arguments - Explicit command-line flags like
--server - Environment variables -
SLEAP_RTC_SIGNALING_WS,SLEAP_RTC_SIGNALING_HTTP - Configuration file - TOML file with environment-specific settings
- Defaults - Production signaling server
Set the environment using the SLEAP_RTC_ENV environment variable:
export SLEAP_RTC_ENV=development # Use development environment
export SLEAP_RTC_ENV=staging # Use staging environment
export SLEAP_RTC_ENV=production # Use production environment (default)Valid environments: development, staging, production
Create a configuration file at one of these locations:
sleap-rtc.tomlin your project directory~/.sleap-rtc/config.tomlin your home directory
See config.example.toml for a complete example with all environments.
Example configuration:
[default]
# Shared settings across all environments
connection_timeout = 30
chunk_size = 65536
[environments.development]
signaling_websocket = "ws://localhost:8080"
signaling_http = "http://localhost:8001"
[environments.staging]
signaling_websocket = "ws://staging-server.example.com:8080"
signaling_http = "http://staging-server.example.com:8001"
[environments.production]
signaling_websocket = "ws://ec2-54-176-92-10.us-west-1.compute.amazonaws.com:8080"
signaling_http = "http://ec2-54-176-92-10.us-west-1.compute.amazonaws.com:8001"Override specific settings using environment variables:
# Override WebSocket URL
export SLEAP_RTC_SIGNALING_WS="ws://custom-server.com:8080"
# Override HTTP API URL
export SLEAP_RTC_SIGNALING_HTTP="http://custom-server.com:8001"# Use default production environment
sleap-rtc train data.slp
# Use development environment
SLEAP_RTC_ENV=development sleap-rtc train data.slp
# Use staging environment
SLEAP_RTC_ENV=staging sleap-rtc train data.slp
# Override with environment variable
SLEAP_RTC_SIGNALING_WS=ws://custom.com:8080 sleap-rtc train data.slp
# Override with CLI argument
sleap-rtc train data.slp --server ws://custom.com:8080If no configuration is provided, SLEAP-RTC defaults to the production signaling server, maintaining backward compatibility with existing deployments.
SLEAP-RTC supports two file transfer methods with automatic selection based on your environment:
| Method | Speed (5 GB file) | Use Case | Setup Required |
|---|---|---|---|
| Shared Storage | ~10-30 seconds | Large training packages (5-9 GB) | ✅ Yes - configure shared filesystem |
| RTC Transfer | ~15-30 minutes | Any environment | ❌ No - works out of the box |
Automatic Selection: SLEAP-RTC automatically uses the fastest available method:
- If shared storage is configured → Shared Storage (60-90x faster for large files)
- Otherwise → RTC Transfer (original method, fully backward compatible)
For optimal performance with large files, configure shared storage:
# Set environment variable pointing to your shared filesystem
export SHARED_STORAGE_ROOT="/Volumes/talmo/amick" # On Client (macOS/local)
export SHARED_STORAGE_ROOT="/home/jovyan/vast/amick" # On Worker (Vast.ai)
# Or use CLI argument
sleap-rtc client --shared-storage-root /Volumes/talmo/amick
sleap-rtc worker --shared-storage-root /home/jovyan/vast/amickHow It Works:
- Shared Storage: Client writes files to shared filesystem, sends only paths over network
- RTC Transfer: Client sends files as chunked data over WebRTC data channel
Requirements:
- Both Client and Worker must have access to the same shared filesystem (NFS, local mount, etc.)
- Paths can differ (e.g.,
/Volumes/talmo/amickon macOS,/home/jovyan/vast/amickon Linux)
For detailed setup instructions, see DEVELOPMENT.md - Shared Storage Configuration.
Typical transfer times for training packages:
| File Size | Shared Storage | RTC Transfer | Speedup |
|---|---|---|---|
| 500 MB | ~5 seconds | ~2 minutes | 24x |
| 2 GB | ~10 seconds | ~8 minutes | 48x |
| 5 GB | ~20 seconds | ~20 minutes | 60x |
| 9 GB | ~30 seconds | ~30 minutes | 60x |
Benchmarks assume: Shared storage on local NFS/SSD (100+ MB/s), RTC transfer over internet (5-10 MB/s)
Use Shared Storage when:
- ✅ Training packages are large (> 1 GB)
- ✅ Client and Worker have access to shared filesystem (Vast.ai, RunAI, local NFS)
- ✅ You need fast iteration cycles during model development
- ✅ You're doing batch training on multiple datasets
Use RTC Transfer when:
- ✅ No shared storage available (different networks, cloud instances)
- ✅ Files are small (< 100 MB)
- ✅ One-time training runs
- ✅ Quick testing without infrastructure setup
Both methods coexist - SLEAP-RTC automatically selects the best option and falls back gracefully.
SLEAP-RTC provides commands for running workers and clients for remote training and inference.
Start a worker to process training or inference jobs:
# Start a worker (creates a new room)
sleap-rtc worker
# Join an existing room (for multi-worker scenarios)
sleap-rtc worker --room-id <room_id> --token <token>When a worker starts, it displays connection credentials:
================================================================================
Worker authenticated with server
================================================================================
Session string for DIRECT connection to this worker:
eyJyIjogInJvb21faWQiLCAidCI6ICJ0b2tlbiIsICJwIjogInBlZXJfaWQifQ==
Room credentials for OTHER workers/clients to join this room:
Room ID: room_abc123
Token: token_xyz789
Use session string with --session-string for direct connection
Use room credentials with --room-id and --token for worker discovery
================================================================================
Connect to a worker to run a training job:
# Option 1: Direct connection using session string
sleap-rtc client-train \
--session-string <session_string> \
--pkg-path /path/to/training_package.zip
# Option 2: Room-based discovery with interactive worker selection
sleap-rtc client-train \
--room-id <room_id> \
--token <token> \
--pkg-path /path/to/training_package.zip
# Option 3: Auto-select best worker by GPU memory
sleap-rtc client-train \
--room-id <room_id> \
--token <token> \
--pkg-path /path/to/training_package.zip \
--auto-select
# Option 4: Connect to specific worker in room (skip discovery)
sleap-rtc client-train \
--room-id <room_id> \
--token <token> \
--worker-id <peer_id> \
--pkg-path /path/to/training_package.zipAdditional options:
--controller-port <port>: ZMQ controller port (default: 9000)--publish-port <port>: ZMQ publish port (default: 9001)--min-gpu-memory <MB>: Filter workers by minimum GPU memory
Connect to a worker to run an inference job:
# Option 1: Direct connection using session string
sleap-rtc client-track \
--session-string <session_string> \
--pkg-path /path/to/inference_package.zip
# Option 2: Room-based discovery with interactive worker selection
sleap-rtc client-track \
--room-id <room_id> \
--token <token> \
--pkg-path /path/to/inference_package.zip
# Option 3: Auto-select best worker by GPU memory
sleap-rtc client-track \
--room-id <room_id> \
--token <token> \
--pkg-path /path/to/inference_package.zip \
--auto-selectSLEAP-RTC supports a flexible two-phase connection workflow:
- Phase 1: Join Room - Client authenticates with signaling server and joins a room
- Phase 2: Worker Discovery & Selection - Client discovers available workers and selects one
This model provides several advantages:
- Visibility: See all available workers before connecting
- Flexibility: Choose workers based on capabilities (GPU memory, status, hostname)
- Resilience: If a worker is busy, easily discover and select alternatives
- Multi-worker: Support multiple workers in a single room for load balancing
Use when you have a session string from a specific worker:
# Worker displays session string on startup
sleap-rtc worker
# Copy the session string from output
# Client connects directly to that worker
sleap-rtc client-train --session-string <session_string> --pkg-path package.zipWhen to use:
- Single worker scenarios
- Direct connection to a specific known worker
- Minimal configuration required
Limitations:
- If the worker is busy, connection will be rejected
- No worker discovery or selection capability
- Must obtain new session string if worker restarts
Use when you want to see available workers and choose interactively:
# Start multiple workers in the same room
sleap-rtc worker # Worker 1 creates room, displays credentials
sleap-rtc worker --room-id <room_id> --token <token> # Worker 2 joins
sleap-rtc worker --room-id <room_id> --token <token> # Worker 3 joins
# Client discovers and selects worker interactively
sleap-rtc client-train --room-id <room_id> --token <token> --pkg-path package.zipInteractive selection displays:
Discovering workers in room...
Found 3 available workers:
1. Worker peer_abc123
GPU: NVIDIA RTX 4090 (24576 MB)
Status: available
Hostname: gpu-server-1
2. Worker peer_def456
GPU: NVIDIA RTX 3090 (24576 MB)
Status: available
Hostname: gpu-server-2
3. Worker peer_ghi789
GPU: NVIDIA GTX 1080 Ti (11264 MB)
Status: available
Hostname: gpu-workstation
Select worker (1-3) or 'r' to refresh:
When to use:
- Multiple workers available
- Want to see worker specifications before connecting
- Need to verify worker status before job submission
- Want to manually choose based on current availability
Features:
- Real-time worker information (GPU model, memory, status, hostname)
- Refresh capability to update worker list
- Only shows workers with status "available"
Use when you want the system to automatically choose the best worker:
sleap-rtc client-train \
--room-id <room_id> \
--token <token> \
--pkg-path package.zip \
--auto-selectBehavior:
- Discovers all available workers in the room
- Automatically selects worker with highest GPU memory
- No user interaction required
- Ideal for scripts and automated workflows
When to use:
- Automated training pipelines
- Scripts that need deterministic worker selection
- Prefer best hardware without manual selection
Use when you know the specific worker peer-id you want:
sleap-rtc client-train \
--room-id <room_id> \
--token <token> \
--worker-id <peer_id> \
--pkg-path package.zipBehavior:
- Skips worker discovery
- Connects directly to specified worker by peer-id
- Still uses room credentials for authentication
When to use:
- You know the exact worker peer-id you need
- Want to target a specific worker without discovery overhead
- Scripted workflows with predetermined worker assignment
Set up multiple workers in a room for parallel job processing:
# Terminal 1: Start Worker 1 (creates room)
sleap-rtc worker
# Save room_id and token from output
# Terminal 2: Start Worker 2 (joins same room)
sleap-rtc worker --room-id <room_id> --token <token>
# Terminal 3: Start Worker 3 (joins same room)
sleap-rtc worker --room-id <room_id> --token <token>
# Terminal 4: Client 1 discovers and selects a worker
sleap-rtc client-train --room-id <room_id> --token <token> --pkg-path job1.zip
# Terminal 5: Client 2 discovers and selects different worker
sleap-rtc client-train --room-id <room_id> --token <token> --pkg-path job2.zipResult: Each client can independently select from available workers, enabling parallel job execution.
Workers with different GPU configurations can coexist in a room:
# High-end worker (RTX 4090)
sleap-rtc worker --room-id shared_room --token shared_token
# Mid-tier worker (RTX 3090)
sleap-rtc worker --room-id shared_room --token shared_token
# Budget worker (GTX 1080 Ti)
sleap-rtc worker --room-id shared_room --token shared_token
# Client auto-selects best worker (RTX 4090)
sleap-rtc client-train \
--room-id shared_room \
--token shared_token \
--pkg-path large_job.zip \
--auto-selectFeatures:
- Clients can filter by
--min-gpu-memoryto ensure sufficient resources - Auto-select automatically chooses worker with most GPU memory
- Interactive mode shows GPU specs for informed selection
If a worker becomes unavailable, clients can easily discover alternatives:
# Client attempts connection to Worker 1 via session string
sleap-rtc client-train --session-string <worker1_session> --pkg-path job.zip
# ERROR: Worker is currently busy
# Client falls back to room-based discovery
sleap-rtc client-train --room-id <room_id> --token <token> --pkg-path job.zip
# SUCCESS: Discovers Worker 2 and Worker 3 are available, selects Worker 2Workers maintain status to coordinate connections and prevent conflicts:
| Status | Description | Accepts New Connections? |
|---|---|---|
available |
Worker is idle and ready to accept jobs | ✅ Yes |
reserved |
Worker accepted connection, negotiating job | ❌ No |
busy |
Worker is actively processing a job | ❌ No |
Status transitions:
available → reserved → busy → available
↑ ↓
└────────────────────────────┘
When a client attempts to connect to a busy or reserved worker (e.g., via session string), the worker will reject the connection:
Client output:
Connecting to worker...
ERROR: Worker is currently busy. Please use --room-id and --token to discover available workers.
Connection rejected by worker.
Worker output:
Received offer SDP
Rejecting connection from peer_xyz789 - worker is busy
Sent busy rejection to client peer_xyz789
Why this matters:
- Prevents job conflicts: Multiple clients cannot interfere with each other's jobs
- Protects data integrity: Ensures one job completes before starting another
- Clear error messages: Clients receive actionable feedback
- Room-based alternative: Rejection message suggests using room discovery to find available workers
- Use room-based discovery for production: More resilient to worker availability changes
- Session strings for development: Convenient for testing with a single known worker
- Auto-select for automation: Deterministic worker selection in scripts
- Check worker status: Room-based discovery only shows "available" workers
- Multi-worker for availability: Deploy multiple workers to handle concurrent jobs
- GPU filtering: Use
--min-gpu-memoryto ensure workers have sufficient resources