sleap-rtc

Remote training and inference w/ SLEAP
Remote Authenticated CLI Training w/ SLEAP

Configuration

SLEAP-RTC supports flexible configuration for different deployment environments (development, staging, production).

Configuration Priority

Configuration is loaded in the following priority order (highest to lowest):

CLI arguments - Explicit command-line flags like --server
Environment variables - SLEAP_RTC_SIGNALING_WS, SLEAP_RTC_SIGNALING_HTTP
Configuration file - TOML file with environment-specific settings
Defaults - Production signaling server

Environment Selection

Set the environment using the SLEAP_RTC_ENV environment variable:

export SLEAP_RTC_ENV=development  # Use development environment
export SLEAP_RTC_ENV=staging      # Use staging environment
export SLEAP_RTC_ENV=production   # Use production environment (default)

Valid environments: development, staging, production

Configuration File

Create a configuration file at one of these locations:

sleap-rtc.toml in your project directory
~/.sleap-rtc/config.toml in your home directory

See config.example.toml for a complete example with all environments.

Example configuration:

[default]
# Shared settings across all environments
connection_timeout = 30
chunk_size = 65536

[environments.development]
signaling_websocket = "ws://localhost:8080"
signaling_http = "http://localhost:8001"

[environments.staging]
signaling_websocket = "ws://staging-server.example.com:8080"
signaling_http = "http://staging-server.example.com:8001"

[environments.production]
signaling_websocket = "ws://ec2-54-176-92-10.us-west-1.compute.amazonaws.com:8080"
signaling_http = "http://ec2-54-176-92-10.us-west-1.compute.amazonaws.com:8001"

Environment Variable Overrides

Override specific settings using environment variables:

# Override WebSocket URL
export SLEAP_RTC_SIGNALING_WS="ws://custom-server.com:8080"

# Override HTTP API URL
export SLEAP_RTC_SIGNALING_HTTP="http://custom-server.com:8001"

Usage Examples

# Use default production environment
sleap-rtc train data.slp

# Use development environment
SLEAP_RTC_ENV=development sleap-rtc train data.slp

# Use staging environment
SLEAP_RTC_ENV=staging sleap-rtc train data.slp

# Override with environment variable
SLEAP_RTC_SIGNALING_WS=ws://custom.com:8080 sleap-rtc train data.slp

# Override with CLI argument
sleap-rtc train data.slp --server ws://custom.com:8080

Backward Compatibility

If no configuration is provided, SLEAP-RTC defaults to the production signaling server, maintaining backward compatibility with existing deployments.

Performance & File Transfer

SLEAP-RTC supports two file transfer methods with automatic selection based on your environment:

Transfer Methods

Method	Speed (5 GB file)	Use Case	Setup Required
Shared Storage	~10-30 seconds	Large training packages (5-9 GB)	✅ Yes - configure shared filesystem
RTC Transfer	~15-30 minutes	Any environment	❌ No - works out of the box

Automatic Selection: SLEAP-RTC automatically uses the fastest available method:

If shared storage is configured → Shared Storage (60-90x faster for large files)
Otherwise → RTC Transfer (original method, fully backward compatible)

Quick Setup: Shared Storage

For optimal performance with large files, configure shared storage:

# Set environment variable pointing to your shared filesystem
export SHARED_STORAGE_ROOT="/Volumes/talmo/amick"  # On Client (macOS/local)
export SHARED_STORAGE_ROOT="/home/jovyan/vast/amick"  # On Worker (Vast.ai)

# Or use CLI argument
sleap-rtc client --shared-storage-root /Volumes/talmo/amick
sleap-rtc worker --shared-storage-root /home/jovyan/vast/amick

How It Works:

Shared Storage: Client writes files to shared filesystem, sends only paths over network
RTC Transfer: Client sends files as chunked data over WebRTC data channel

Requirements:

Both Client and Worker must have access to the same shared filesystem (NFS, local mount, etc.)
Paths can differ (e.g., /Volumes/talmo/amick on macOS, /home/jovyan/vast/amick on Linux)

For detailed setup instructions, see DEVELOPMENT.md - Shared Storage Configuration.

Performance Benchmarks

Typical transfer times for training packages:

File Size	Shared Storage	RTC Transfer	Speedup
500 MB	~5 seconds	~2 minutes	24x
2 GB	~10 seconds	~8 minutes	48x
5 GB	~20 seconds	~20 minutes	60x
9 GB	~30 seconds	~30 minutes	60x

Benchmarks assume: Shared storage on local NFS/SSD (100+ MB/s), RTC transfer over internet (5-10 MB/s)

When to Use Each Method

Use Shared Storage when:

✅ Training packages are large (> 1 GB)
✅ Client and Worker have access to shared filesystem (Vast.ai, RunAI, local NFS)
✅ You need fast iteration cycles during model development
✅ You're doing batch training on multiple datasets

Use RTC Transfer when:

✅ No shared storage available (different networks, cloud instances)
✅ Files are small (< 100 MB)
✅ One-time training runs
✅ Quick testing without infrastructure setup

Both methods coexist - SLEAP-RTC automatically selects the best option and falls back gracefully.

CLI Usage

SLEAP-RTC provides commands for running workers and clients for remote training and inference.

Worker Commands

Start a worker to process training or inference jobs:

# Start a worker (creates a new room)
sleap-rtc worker

# Join an existing room (for multi-worker scenarios)
sleap-rtc worker --room-id <room_id> --token <token>

When a worker starts, it displays connection credentials:

================================================================================
Worker authenticated with server
================================================================================

Session string for DIRECT connection to this worker:
  eyJyIjogInJvb21faWQiLCAidCI6ICJ0b2tlbiIsICJwIjogInBlZXJfaWQifQ==

Room credentials for OTHER workers/clients to join this room:
  Room ID: room_abc123
  Token:   token_xyz789

Use session string with --session-string for direct connection
Use room credentials with --room-id and --token for worker discovery
================================================================================

Client Commands

Training Client

Connect to a worker to run a training job:

# Option 1: Direct connection using session string
sleap-rtc client-train \
  --session-string <session_string> \
  --pkg-path /path/to/training_package.zip

# Option 2: Room-based discovery with interactive worker selection
sleap-rtc client-train \
  --room-id <room_id> \
  --token <token> \
  --pkg-path /path/to/training_package.zip

# Option 3: Auto-select best worker by GPU memory
sleap-rtc client-train \
  --room-id <room_id> \
  --token <token> \
  --pkg-path /path/to/training_package.zip \
  --auto-select

# Option 4: Connect to specific worker in room (skip discovery)
sleap-rtc client-train \
  --room-id <room_id> \
  --token <token> \
  --worker-id <peer_id> \
  --pkg-path /path/to/training_package.zip

Additional options:

--controller-port <port>: ZMQ controller port (default: 9000)
--publish-port <port>: ZMQ publish port (default: 9001)
--min-gpu-memory <MB>: Filter workers by minimum GPU memory

Inference Client

Connect to a worker to run an inference job:

# Option 1: Direct connection using session string
sleap-rtc client-track \
  --session-string <session_string> \
  --pkg-path /path/to/inference_package.zip

# Option 2: Room-based discovery with interactive worker selection
sleap-rtc client-track \
  --room-id <room_id> \
  --token <token> \
  --pkg-path /path/to/inference_package.zip

# Option 3: Auto-select best worker by GPU memory
sleap-rtc client-track \
  --room-id <room_id> \
  --token <token> \
  --pkg-path /path/to/inference_package.zip \
  --auto-select

Connection Workflows

Two-Phase Connection Model

SLEAP-RTC supports a flexible two-phase connection workflow:

Phase 1: Join Room - Client authenticates with signaling server and joins a room
Phase 2: Worker Discovery & Selection - Client discovers available workers and selects one

This model provides several advantages:

Visibility: See all available workers before connecting
Flexibility: Choose workers based on capabilities (GPU memory, status, hostname)
Resilience: If a worker is busy, easily discover and select alternatives
Multi-worker: Support multiple workers in a single room for load balancing

Connection Mode 1: Session String (Direct Connection)

Use when you have a session string from a specific worker:

# Worker displays session string on startup
sleap-rtc worker
# Copy the session string from output

# Client connects directly to that worker
sleap-rtc client-train --session-string <session_string> --pkg-path package.zip

When to use:

Single worker scenarios
Direct connection to a specific known worker
Minimal configuration required

Limitations:

If the worker is busy, connection will be rejected
No worker discovery or selection capability
Must obtain new session string if worker restarts

Connection Mode 2: Room-Based Discovery (Interactive Selection)

Use when you want to see available workers and choose interactively:

# Start multiple workers in the same room
sleap-rtc worker  # Worker 1 creates room, displays credentials
sleap-rtc worker --room-id <room_id> --token <token>  # Worker 2 joins
sleap-rtc worker --room-id <room_id> --token <token>  # Worker 3 joins

# Client discovers and selects worker interactively
sleap-rtc client-train --room-id <room_id> --token <token> --pkg-path package.zip

Interactive selection displays:

Discovering workers in room...
Found 3 available workers:

1. Worker peer_abc123
   GPU: NVIDIA RTX 4090 (24576 MB)
   Status: available
   Hostname: gpu-server-1

2. Worker peer_def456
   GPU: NVIDIA RTX 3090 (24576 MB)
   Status: available
   Hostname: gpu-server-2

3. Worker peer_ghi789
   GPU: NVIDIA GTX 1080 Ti (11264 MB)
   Status: available
   Hostname: gpu-workstation

Select worker (1-3) or 'r' to refresh:

When to use:

Multiple workers available
Want to see worker specifications before connecting
Need to verify worker status before job submission
Want to manually choose based on current availability

Features:

Real-time worker information (GPU model, memory, status, hostname)
Refresh capability to update worker list
Only shows workers with status "available"

Connection Mode 3: Auto-Select (Automatic Best Worker)

Use when you want the system to automatically choose the best worker:

sleap-rtc client-train \
  --room-id <room_id> \
  --token <token> \
  --pkg-path package.zip \
  --auto-select

Behavior:

Discovers all available workers in the room
Automatically selects worker with highest GPU memory
No user interaction required
Ideal for scripts and automated workflows

When to use:

Automated training pipelines
Scripts that need deterministic worker selection
Prefer best hardware without manual selection

Connection Mode 4: Direct Worker in Room

Use when you know the specific worker peer-id you want:

sleap-rtc client-train \
  --room-id <room_id> \
  --token <token> \
  --worker-id <peer_id> \
  --pkg-path package.zip

Behavior:

Skips worker discovery
Connects directly to specified worker by peer-id
Still uses room credentials for authentication

When to use:

You know the exact worker peer-id you need
Want to target a specific worker without discovery overhead
Scripted workflows with predetermined worker assignment

Multi-Worker Scenarios

Scenario 1: Load Balancing Across Multiple Workers

Set up multiple workers in a room for parallel job processing:

# Terminal 1: Start Worker 1 (creates room)
sleap-rtc worker
# Save room_id and token from output

# Terminal 2: Start Worker 2 (joins same room)
sleap-rtc worker --room-id <room_id> --token <token>

# Terminal 3: Start Worker 3 (joins same room)
sleap-rtc worker --room-id <room_id> --token <token>

# Terminal 4: Client 1 discovers and selects a worker
sleap-rtc client-train --room-id <room_id> --token <token> --pkg-path job1.zip

# Terminal 5: Client 2 discovers and selects different worker
sleap-rtc client-train --room-id <room_id> --token <token> --pkg-path job2.zip

Result: Each client can independently select from available workers, enabling parallel job execution.

Scenario 2: Heterogeneous Worker Pool

Workers with different GPU configurations can coexist in a room:

# High-end worker (RTX 4090)
sleap-rtc worker --room-id shared_room --token shared_token

# Mid-tier worker (RTX 3090)
sleap-rtc worker --room-id shared_room --token shared_token

# Budget worker (GTX 1080 Ti)
sleap-rtc worker --room-id shared_room --token shared_token

# Client auto-selects best worker (RTX 4090)
sleap-rtc client-train \
  --room-id shared_room \
  --token shared_token \
  --pkg-path large_job.zip \
  --auto-select

Features:

Clients can filter by --min-gpu-memory to ensure sufficient resources
Auto-select automatically chooses worker with most GPU memory
Interactive mode shows GPU specs for informed selection

Scenario 3: High-Availability Setup

If a worker becomes unavailable, clients can easily discover alternatives:

# Client attempts connection to Worker 1 via session string
sleap-rtc client-train --session-string <worker1_session> --pkg-path job.zip
# ERROR: Worker is currently busy

# Client falls back to room-based discovery
sleap-rtc client-train --room-id <room_id> --token <token> --pkg-path job.zip
# SUCCESS: Discovers Worker 2 and Worker 3 are available, selects Worker 2

Worker Status and Safeguards

Worker Status Lifecycle

Workers maintain status to coordinate connections and prevent conflicts:

Status	Description	Accepts New Connections?
`available`	Worker is idle and ready to accept jobs	✅ Yes
`reserved`	Worker accepted connection, negotiating job	❌ No
`busy`	Worker is actively processing a job	❌ No

Status transitions:

available → reserved → busy → available
    ↑                            ↓
    └────────────────────────────┘

Busy Rejection Behavior

When a client attempts to connect to a busy or reserved worker (e.g., via session string), the worker will reject the connection:

Client output:

Connecting to worker...
ERROR: Worker is currently busy. Please use --room-id and --token to discover available workers.
Connection rejected by worker.

Worker output:

Received offer SDP
Rejecting connection from peer_xyz789 - worker is busy
Sent busy rejection to client peer_xyz789

Why this matters:

Prevents job conflicts: Multiple clients cannot interfere with each other's jobs
Protects data integrity: Ensures one job completes before starting another
Clear error messages: Clients receive actionable feedback
Room-based alternative: Rejection message suggests using room discovery to find available workers

Best Practices

Use room-based discovery for production: More resilient to worker availability changes
Session strings for development: Convenient for testing with a single known worker
Auto-select for automation: Deterministic worker selection in scripts
Check worker status: Room-based discovery only shows "available" workers
Multi-worker for availability: Deploy multiple workers to handle concurrent jobs
GPU filtering: Use --min-gpu-memory to ensure workers have sufficient resources

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
docs		docs
openspec/changes/add-client-room-connection		openspec/changes/add-client-room-connection
sleap_rtc		sleap_rtc
.gitignore		.gitignore
DEVELOPMENT.md		DEVELOPMENT.md
INFERENCE_CLI_PROPOSAL.md		INFERENCE_CLI_PROPOSAL.md
LICENSE		LICENSE
README.md		README.md
config.example.toml		config.example.toml
pyproject.toml		pyproject.toml
simple_room_test.py		simple_room_test.py
sleap-rtc.toml		sleap-rtc.toml
test_error_handling.py		test_error_handling.py
test_filesystem.py		test_filesystem.py
test_shared_storage_integration.py		test_shared_storage_integration.py
test_worker_shared_storage.py		test_worker_shared_storage.py

License

talmolab/sleap-rtc

Folders and files

Latest commit

History

Repository files navigation

sleap-rtc

Configuration

Configuration Priority

Environment Selection

Configuration File

Environment Variable Overrides

Usage Examples

Backward Compatibility

Performance & File Transfer

Transfer Methods

Quick Setup: Shared Storage

Performance Benchmarks

When to Use Each Method

CLI Usage

Worker Commands

Client Commands

Training Client

Inference Client

Connection Workflows

Two-Phase Connection Model

Connection Mode 1: Session String (Direct Connection)

Connection Mode 2: Room-Based Discovery (Interactive Selection)

Connection Mode 3: Auto-Select (Automatic Best Worker)

Connection Mode 4: Direct Worker in Room

Multi-Worker Scenarios

Scenario 1: Load Balancing Across Multiple Workers

Scenario 2: Heterogeneous Worker Pool

Scenario 3: High-Availability Setup

Worker Status and Safeguards

Worker Status Lifecycle

Busy Rejection Behavior

Best Practices

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages