Skip to content

Feat/external eval server #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 18, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions config/gni/devtools_grd_files.gni
Original file line number Diff line number Diff line change
Expand Up @@ -645,6 +645,10 @@ grd_files_bundled_sources = [
"front_end/panels/ai_chat/common/log.js",
"front_end/panels/ai_chat/common/context.js",
"front_end/panels/ai_chat/common/page.js",
"front_end/panels/ai_chat/common/WebSocketRPCClient.js",
"front_end/panels/ai_chat/common/EvaluationConfig.js",
"front_end/panels/ai_chat/evaluation/EvaluationProtocol.js",
"front_end/panels/ai_chat/evaluation/EvaluationAgent.js",
"front_end/panels/ai_chat/tracing/TracingProvider.js",
"front_end/panels/ai_chat/tracing/LangfuseProvider.js",
"front_end/panels/ai_chat/tracing/TracingConfig.js",
Expand Down
16 changes: 16 additions & 0 deletions eval-server/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# WebSocket Server Configuration
PORT=8080
HOST=localhost

# LLM Judge Configuration
OPENAI_API_KEY=your-openai-api-key-here
JUDGE_MODEL=gpt-4
JUDGE_TEMPERATURE=0.1

# Logging Configuration
LOG_LEVEL=info
LOG_DIR=./logs

# RPC Configuration
RPC_TIMEOUT=30000
MAX_CONCURRENT_EVALUATIONS=10
2 changes: 2 additions & 0 deletions eval-server/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.env
node_modules
103 changes: 103 additions & 0 deletions eval-server/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

bo-eval-server is a WebSocket-based evaluation server for LLM agents that implements an LLM-as-a-judge evaluation system. The server accepts connections from AI agents, sends them evaluation tasks via RPC calls, collects their responses, and uses an LLM to judge the quality of responses.

## Commands

### Development
- `npm start` - Start the WebSocket server
- `npm run dev` - Start server with file watching for development
- `npm run cli` - Start interactive CLI for server management and testing
- `npm test` - Run example agent client for testing

### Installation
- `npm install` - Install dependencies
- Copy `.env.example` to `.env` and configure environment variables

### Required Environment Variables
- `OPENAI_API_KEY` - OpenAI API key for LLM judge functionality
- `PORT` - WebSocket server port (default: 8080)

## Architecture

### Core Components

**WebSocket Server** (`src/server.js`)
- Accepts connections from LLM agents
- Manages agent lifecycle (connect, ready, disconnect)
- Orchestrates evaluation sessions
- Handles bidirectional RPC communication

**RPC Client** (`src/rpc-client.js`)
- Implements JSON-RPC 2.0 protocol for server-to-client calls
- Manages request/response correlation with unique IDs
- Handles timeouts and error conditions
- Calls `Evaluate(request: String) -> String` method on connected agents

**LLM Evaluator** (`src/evaluator.js`)
- Integrates with OpenAI API for LLM-as-a-judge functionality
- Evaluates agent responses on multiple criteria (correctness, completeness, clarity, relevance, helpfulness)
- Returns structured JSON evaluation with scores and reasoning

**Logger** (`src/logger.js`)
- Structured logging using Winston
- Separate log files for different event types
- JSON format for easy parsing and analysis
- Logs all RPC calls, evaluations, and connection events

### Evaluation Flow

1. Agent connects to WebSocket server
2. Agent sends "ready" signal
3. Server calls agent's `Evaluate` method with a task
4. Agent processes task and returns response
5. Server sends response to LLM judge for evaluation
6. Results are logged as JSON with scores and detailed feedback

### Project Structure

```
src/
├── server.js # Main WebSocket server and evaluation orchestration
├── rpc-client.js # JSON-RPC client for calling agent methods
├── evaluator.js # LLM judge integration (OpenAI)
├── logger.js # Structured logging and result storage
├── config.js # Configuration management
└── cli.js # Interactive CLI for testing and management

logs/ # Log files (created automatically)
├── combined.log # All log events
├── error.log # Error events only
└── evaluations.jsonl # Evaluation results in JSON Lines format
```

### Key Features

- **Bidirectional RPC**: Server can call methods on connected clients
- **LLM-as-a-Judge**: Automated evaluation of agent responses using GPT-4
- **Concurrent Evaluations**: Support for multiple agents and parallel evaluations
- **Structured Logging**: All interactions logged as JSON for analysis
- **Interactive CLI**: Built-in CLI for testing and server management
- **Connection Management**: Robust handling of agent connections and disconnections
- **Timeout Handling**: Configurable timeouts for RPC calls and evaluations

### Agent Protocol

Agents must implement:
- WebSocket connection to server
- JSON-RPC 2.0 protocol support
- `Evaluate(task: string) -> string` method
- "ready" message to signal availability for evaluations

### Configuration

All configuration is managed through environment variables and `src/config.js`. Key settings:
- Server port and host
- OpenAI API configuration
- RPC timeouts
- Logging levels and directories
- Maximum concurrent evaluations
47 changes: 47 additions & 0 deletions eval-server/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# bo-eval-server

A WebSocket-based evaluation server for LLM agents using LLM-as-a-judge methodology.

## Quick Start

1. **Install dependencies**
```bash
npm install
```

2. **Configure environment**
```bash
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY
```

3. **Start the server**
```bash
npm start
```

4. **Use interactive CLI** (alternative to step 3)
```bash
npm run cli
```

## Features

- 🔌 WebSocket server for real-time agent connections
- 🤖 Bidirectional RPC calls to connected agents
- ⚖️ LLM-as-a-judge evaluation using OpenAI GPT-4
- 📊 Structured JSON logging of all evaluations
- 🖥️ Interactive CLI for testing and management
- ⚡ Support for concurrent agent evaluations

## Agent Protocol

Your agent needs to:

1. Connect to the WebSocket server (default: `ws://localhost:8080`)
2. Send a `{"type": "ready"}` message when ready for evaluations
3. Implement the `Evaluate` RPC method that accepts a string task and returns a string response

## For more details

See [CLAUDE.md](./CLAUDE.md) for comprehensive documentation of the architecture and implementation.
12 changes: 12 additions & 0 deletions eval-server/clients/1233ae25-9f9e-4f77-924d-865f7d615cef.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
client:
id: 1233ae25-9f9e-4f77-924d-865f7d615cef
name: DevTools Client 1233ae25
secret_key: hello
description: Auto-generated DevTools evaluation client
settings:
max_concurrent_evaluations: 3
default_timeout: 45000
retry_policy:
max_retries: 2
backoff_multiplier: 2
initial_delay: 1000
Loading