Skip to content

Commit 543b097

Browse files
authored
Feat/external eval server (#23)
Feat/external eval server (#23)
1 parent f109d68 commit 543b097

File tree

118 files changed

+11673
-22
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

118 files changed

+11673
-22
lines changed

config/gni/devtools_grd_files.gni

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -645,6 +645,10 @@ grd_files_bundled_sources = [
645645
"front_end/panels/ai_chat/common/log.js",
646646
"front_end/panels/ai_chat/common/context.js",
647647
"front_end/panels/ai_chat/common/page.js",
648+
"front_end/panels/ai_chat/common/WebSocketRPCClient.js",
649+
"front_end/panels/ai_chat/common/EvaluationConfig.js",
650+
"front_end/panels/ai_chat/evaluation/EvaluationProtocol.js",
651+
"front_end/panels/ai_chat/evaluation/EvaluationAgent.js",
648652
"front_end/panels/ai_chat/tracing/TracingProvider.js",
649653
"front_end/panels/ai_chat/tracing/LangfuseProvider.js",
650654
"front_end/panels/ai_chat/tracing/TracingConfig.js",

eval-server/.env.example

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# WebSocket Server Configuration
2+
PORT=8080
3+
HOST=localhost
4+
5+
# LLM Judge Configuration
6+
OPENAI_API_KEY=your-openai-api-key-here
7+
JUDGE_MODEL=gpt-4
8+
JUDGE_TEMPERATURE=0.1
9+
10+
# Logging Configuration
11+
LOG_LEVEL=info
12+
LOG_DIR=./logs
13+
14+
# RPC Configuration
15+
RPC_TIMEOUT=30000
16+
MAX_CONCURRENT_EVALUATIONS=10

eval-server/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
.env
2+
node_modules

eval-server/CLAUDE.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
bo-eval-server is a WebSocket-based evaluation server for LLM agents that implements an LLM-as-a-judge evaluation system. The server accepts connections from AI agents, sends them evaluation tasks via RPC calls, collects their responses, and uses an LLM to judge the quality of responses.
8+
9+
## Commands
10+
11+
### Development
12+
- `npm start` - Start the WebSocket server
13+
- `npm run dev` - Start server with file watching for development
14+
- `npm run cli` - Start interactive CLI for server management and testing
15+
- `npm test` - Run example agent client for testing
16+
17+
### Installation
18+
- `npm install` - Install dependencies
19+
- Copy `.env.example` to `.env` and configure environment variables
20+
21+
### Required Environment Variables
22+
- `OPENAI_API_KEY` - OpenAI API key for LLM judge functionality
23+
- `PORT` - WebSocket server port (default: 8080)
24+
25+
## Architecture
26+
27+
### Core Components
28+
29+
**WebSocket Server** (`src/server.js`)
30+
- Accepts connections from LLM agents
31+
- Manages agent lifecycle (connect, ready, disconnect)
32+
- Orchestrates evaluation sessions
33+
- Handles bidirectional RPC communication
34+
35+
**RPC Client** (`src/rpc-client.js`)
36+
- Implements JSON-RPC 2.0 protocol for server-to-client calls
37+
- Manages request/response correlation with unique IDs
38+
- Handles timeouts and error conditions
39+
- Calls `Evaluate(request: String) -> String` method on connected agents
40+
41+
**LLM Evaluator** (`src/evaluator.js`)
42+
- Integrates with OpenAI API for LLM-as-a-judge functionality
43+
- Evaluates agent responses on multiple criteria (correctness, completeness, clarity, relevance, helpfulness)
44+
- Returns structured JSON evaluation with scores and reasoning
45+
46+
**Logger** (`src/logger.js`)
47+
- Structured logging using Winston
48+
- Separate log files for different event types
49+
- JSON format for easy parsing and analysis
50+
- Logs all RPC calls, evaluations, and connection events
51+
52+
### Evaluation Flow
53+
54+
1. Agent connects to WebSocket server
55+
2. Agent sends "ready" signal
56+
3. Server calls agent's `Evaluate` method with a task
57+
4. Agent processes task and returns response
58+
5. Server sends response to LLM judge for evaluation
59+
6. Results are logged as JSON with scores and detailed feedback
60+
61+
### Project Structure
62+
63+
```
64+
src/
65+
├── server.js # Main WebSocket server and evaluation orchestration
66+
├── rpc-client.js # JSON-RPC client for calling agent methods
67+
├── evaluator.js # LLM judge integration (OpenAI)
68+
├── logger.js # Structured logging and result storage
69+
├── config.js # Configuration management
70+
└── cli.js # Interactive CLI for testing and management
71+
72+
logs/ # Log files (created automatically)
73+
├── combined.log # All log events
74+
├── error.log # Error events only
75+
└── evaluations.jsonl # Evaluation results in JSON Lines format
76+
```
77+
78+
### Key Features
79+
80+
- **Bidirectional RPC**: Server can call methods on connected clients
81+
- **LLM-as-a-Judge**: Automated evaluation of agent responses using GPT-4
82+
- **Concurrent Evaluations**: Support for multiple agents and parallel evaluations
83+
- **Structured Logging**: All interactions logged as JSON for analysis
84+
- **Interactive CLI**: Built-in CLI for testing and server management
85+
- **Connection Management**: Robust handling of agent connections and disconnections
86+
- **Timeout Handling**: Configurable timeouts for RPC calls and evaluations
87+
88+
### Agent Protocol
89+
90+
Agents must implement:
91+
- WebSocket connection to server
92+
- JSON-RPC 2.0 protocol support
93+
- `Evaluate(task: string) -> string` method
94+
- "ready" message to signal availability for evaluations
95+
96+
### Configuration
97+
98+
All configuration is managed through environment variables and `src/config.js`. Key settings:
99+
- Server port and host
100+
- OpenAI API configuration
101+
- RPC timeouts
102+
- Logging levels and directories
103+
- Maximum concurrent evaluations

eval-server/README.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# bo-eval-server
2+
3+
A WebSocket-based evaluation server for LLM agents using LLM-as-a-judge methodology.
4+
5+
## Quick Start
6+
7+
1. **Install dependencies**
8+
```bash
9+
npm install
10+
```
11+
12+
2. **Configure environment**
13+
```bash
14+
cp .env.example .env
15+
# Edit .env and add your OPENAI_API_KEY
16+
```
17+
18+
3. **Start the server**
19+
```bash
20+
npm start
21+
```
22+
23+
4. **Use interactive CLI** (alternative to step 3)
24+
```bash
25+
npm run cli
26+
```
27+
28+
## Features
29+
30+
- 🔌 WebSocket server for real-time agent connections
31+
- 🤖 Bidirectional RPC calls to connected agents
32+
- ⚖️ LLM-as-a-judge evaluation using OpenAI GPT-4
33+
- 📊 Structured JSON logging of all evaluations
34+
- 🖥️ Interactive CLI for testing and management
35+
- ⚡ Support for concurrent agent evaluations
36+
37+
## Agent Protocol
38+
39+
Your agent needs to:
40+
41+
1. Connect to the WebSocket server (default: `ws://localhost:8080`)
42+
2. Send a `{"type": "ready"}` message when ready for evaluations
43+
3. Implement the `Evaluate` RPC method that accepts a string task and returns a string response
44+
45+
## For more details
46+
47+
See [CLAUDE.md](./CLAUDE.md) for comprehensive documentation of the architecture and implementation.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
client:
2+
id: 1233ae25-9f9e-4f77-924d-865f7d615cef
3+
name: DevTools Client 1233ae25
4+
secret_key: hello
5+
description: Auto-generated DevTools evaluation client
6+
settings:
7+
max_concurrent_evaluations: 3
8+
default_timeout: 45000
9+
retry_policy:
10+
max_retries: 2
11+
backoff_multiplier: 2
12+
initial_delay: 1000

0 commit comments

Comments
 (0)