tysonthomas9 · tysonthomas9 · Jul 18, 2025 · Jul 13, 2025 · Jul 13, 2025 · Jul 14, 2025
diff --git a/config/gni/devtools_grd_files.gni b/config/gni/devtools_grd_files.gni
@@ -645,6 +645,10 @@ grd_files_bundled_sources = [
   "front_end/panels/ai_chat/common/log.js",
   "front_end/panels/ai_chat/common/context.js",
   "front_end/panels/ai_chat/common/page.js",
+  "front_end/panels/ai_chat/common/WebSocketRPCClient.js",
+  "front_end/panels/ai_chat/common/EvaluationConfig.js",
+  "front_end/panels/ai_chat/evaluation/EvaluationProtocol.js",
+  "front_end/panels/ai_chat/evaluation/EvaluationAgent.js",
   "front_end/panels/ai_chat/tracing/TracingProvider.js",
   "front_end/panels/ai_chat/tracing/LangfuseProvider.js",
   "front_end/panels/ai_chat/tracing/TracingConfig.js",

diff --git a/eval-server/.env.example b/eval-server/.env.example
@@ -0,0 +1,16 @@
+# WebSocket Server Configuration
+PORT=8080
+HOST=localhost
+
+# LLM Judge Configuration
+OPENAI_API_KEY=your-openai-api-key-here
+JUDGE_MODEL=gpt-4
+JUDGE_TEMPERATURE=0.1
+
+# Logging Configuration
+LOG_LEVEL=info
+LOG_DIR=./logs
+
+# RPC Configuration
+RPC_TIMEOUT=30000
+MAX_CONCURRENT_EVALUATIONS=10
diff --git a/eval-server/.gitignore b/eval-server/.gitignore
@@ -0,0 +1,2 @@
+.env
+node_modules
diff --git a/eval-server/CLAUDE.md b/eval-server/CLAUDE.md
@@ -0,0 +1,103 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+bo-eval-server is a WebSocket-based evaluation server for LLM agents that implements an LLM-as-a-judge evaluation system. The server accepts connections from AI agents, sends them evaluation tasks via RPC calls, collects their responses, and uses an LLM to judge the quality of responses.
+
+## Commands
+
+### Development
+- `npm start` - Start the WebSocket server
+- `npm run dev` - Start server with file watching for development
+- `npm run cli` - Start interactive CLI for server management and testing
+- `npm test` - Run example agent client for testing
+
+### Installation
+- `npm install` - Install dependencies
+- Copy `.env.example` to `.env` and configure environment variables
+
+### Required Environment Variables
+- `OPENAI_API_KEY` - OpenAI API key for LLM judge functionality
+- `PORT` - WebSocket server port (default: 8080)
+
+## Architecture
+
+### Core Components
+
+**WebSocket Server** (`src/server.js`)
+- Accepts connections from LLM agents
+- Manages agent lifecycle (connect, ready, disconnect)
+- Orchestrates evaluation sessions
+- Handles bidirectional RPC communication
+
+**RPC Client** (`src/rpc-client.js`)
+- Implements JSON-RPC 2.0 protocol for server-to-client calls
+- Manages request/response correlation with unique IDs
+- Handles timeouts and error conditions
+- Calls `Evaluate(request: String) -> String` method on connected agents
+
+**LLM Evaluator** (`src/evaluator.js`)
+- Integrates with OpenAI API for LLM-as-a-judge functionality
+- Evaluates agent responses on multiple criteria (correctness, completeness, clarity, relevance, helpfulness)
+- Returns structured JSON evaluation with scores and reasoning
+
+**Logger** (`src/logger.js`)
+- Structured logging using Winston
+- Separate log files for different event types
+- JSON format for easy parsing and analysis
+- Logs all RPC calls, evaluations, and connection events
+
+### Evaluation Flow
+
+1. Agent connects to WebSocket server
+2. Agent sends "ready" signal
+3. Server calls agent's `Evaluate` method with a task
+4. Agent processes task and returns response
+5. Server sends response to LLM judge for evaluation
+6. Results are logged as JSON with scores and detailed feedback
+
+### Project Structure
+
+```
+src/
+├── server.js          # Main WebSocket server and evaluation orchestration
+├── rpc-client.js      # JSON-RPC client for calling agent methods
+├── evaluator.js       # LLM judge integration (OpenAI)
+├── logger.js          # Structured logging and result storage
+├── config.js          # Configuration management
+└── cli.js             # Interactive CLI for testing and management
+
+logs/                  # Log files (created automatically)
+├── combined.log       # All log events
+├── error.log          # Error events only
+└── evaluations.jsonl  # Evaluation results in JSON Lines format
+```
+
+### Key Features
+
+- **Bidirectional RPC**: Server can call methods on connected clients
+- **LLM-as-a-Judge**: Automated evaluation of agent responses using GPT-4
+- **Concurrent Evaluations**: Support for multiple agents and parallel evaluations
+- **Structured Logging**: All interactions logged as JSON for analysis
+- **Interactive CLI**: Built-in CLI for testing and server management
+- **Connection Management**: Robust handling of agent connections and disconnections
+- **Timeout Handling**: Configurable timeouts for RPC calls and evaluations
+
+### Agent Protocol
+
+Agents must implement:
+- WebSocket connection to server
+- JSON-RPC 2.0 protocol support
+- `Evaluate(task: string) -> string` method
+- "ready" message to signal availability for evaluations
+
+### Configuration
+
+All configuration is managed through environment variables and `src/config.js`. Key settings:
+- Server port and host
+- OpenAI API configuration
+- RPC timeouts
+- Logging levels and directories
+- Maximum concurrent evaluations
diff --git a/eval-server/README.md b/eval-server/README.md
@@ -0,0 +1,47 @@
+# bo-eval-server
+
+A WebSocket-based evaluation server for LLM agents using LLM-as-a-judge methodology.
+
+## Quick Start
+
+1. **Install dependencies**
+   ```bash
+   npm install
+   ```
+
+2. **Configure environment**
+   ```bash
+   cp .env.example .env
+   # Edit .env and add your OPENAI_API_KEY
+   ```
+
+3. **Start the server**
+   ```bash
+   npm start
+   ```
+
+4. **Use interactive CLI** (alternative to step 3)
+   ```bash
+   npm run cli
+   ```
+
+## Features
+
+- 🔌 WebSocket server for real-time agent connections
+- 🤖 Bidirectional RPC calls to connected agents
+- ⚖️ LLM-as-a-judge evaluation using OpenAI GPT-4
+- 📊 Structured JSON logging of all evaluations
+- 🖥️ Interactive CLI for testing and management
+- ⚡ Support for concurrent agent evaluations
+
+## Agent Protocol
+
+Your agent needs to:
+
+1. Connect to the WebSocket server (default: `ws://localhost:8080`)
+2. Send a `{"type": "ready"}` message when ready for evaluations
+3. Implement the `Evaluate` RPC method that accepts a string task and returns a string response
+
+## For more details
+
+See [CLAUDE.md](./CLAUDE.md) for comprehensive documentation of the architecture and implementation.
diff --git a/eval-server/clients/1233ae25-9f9e-4f77-924d-865f7d615cef.yaml b/eval-server/clients/1233ae25-9f9e-4f77-924d-865f7d615cef.yaml
@@ -0,0 +1,12 @@
+client:
+  id: 1233ae25-9f9e-4f77-924d-865f7d615cef
+  name: DevTools Client 1233ae25
+  secret_key: hello
+  description: Auto-generated DevTools evaluation client
+settings:
+  max_concurrent_evaluations: 3
+  default_timeout: 45000
+  retry_policy:
+    max_retries: 2
+    backoff_multiplier: 2
+    initial_delay: 1000