Skip to content

Commit baace12

Browse files
maciejmajekjmatejczJuliajMagdalenaKotyniapawel-kotowski
authored
chore: sync development -> main (#691)
feat: tool calling benchmark unified across types and prompts variety… (#620) feat: basic tasks extension (#644) feat: tool calling custom interfaces tasks extension (#636) feat: tool calling spatial reasoning tasks extension (#637) refactor: remove navigation tasks (#638) refactor: o3de config (#630) refactor(nav2_toolkit): remove unused action_client (#670) fix: manipulaiton bench fixes (#653) docs: rai simbench docs update (#665) feat: planning task and megamind agent (#679) feat: megamind context providers (#687) feat: tool calling bench - manipulation tasks extenstion (#656) chore: resolving conflicts (#690) Co-authored-by: Jakub Matejczyk <58983084+jmatejcz@users.noreply.github.com> Co-authored-by: Julia Jia <juliajster@gmail.com> Co-authored-by: Magdalena Kotynia <magdalena.kotynia@robotec.ai> Co-authored-by: Pawel Kotowski <pawel.kotowski@oleahealth.ai> Co-authored-by: Brian Tuan <btuan@users.noreply.github.com> Co-authored-by: jmatejcz <jakub.matejczyk@robotec.ai>
1 parent cefdffd commit baace12

File tree

68 files changed

+12281
-3732
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+12281
-3732
lines changed

docs/simulation_and_benchmarking/rai_bench.md

Lines changed: 53 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ RAI Bench is a comprehensive package that both provides benchmarks with ready-to
66

77
- [Manipulation O3DE Benchmark](#manipulation-o3de-benchmark)
88
- [Tool Calling Agent Benchmark](#tool-calling-agent-benchmark)
9+
- [VLM Benchmark](#vlm-benchmark)
910

1011
## Manipulation O3DE Benchmark
1112

@@ -73,16 +74,14 @@ score = (correctly_placed_now - correctly_placed_initially) / initially_incorrec
7374

7475
You can find predefined scene configs in `rai_bench/manipulation_o3de/predefined/configs/`.
7576

76-
Predefined scenarios can be imported like:
77+
Predefined scenarios can be imported, for example, choosing tasks by difficulty:
7778

7879
```python
7980
from rai_bench.manipulation_o3de import get_scenarios
8081

8182
get_scenarios(levels=["easy", "medium"])
8283
```
8384

84-
Choose which task you want by selecting the difficulty, from trivial to very hard scenarios.
85-
8685
## Tool Calling Agent Benchmark
8786

8887
Evaluates agent performance independently from any simulation, based only on tool calls that the agent makes. To make it independent from simulations, this benchmark introduces tool mocks which can be adjusted for different tasks. This makes the benchmark more universal and a lot faster.
@@ -106,30 +105,76 @@ The `Validator` class can combine single or multiple subtasks to create a single
106105

107106
- OrderedCallsValidator - requires a strict order of subtasks. The next subtask will be validated only when the previous one was completed. Validator passes when all subtasks pass.
108107
- NotOrderedCallsValidator - doesn't enforce order of subtasks. Every subtask will be validated against every tool call. Validator passes when all subtasks pass.
108+
- OneFromManyValidator - passes when any one of the given subtasks passes.
109109

110110
### Task
111111

112-
A Task represents a specific prompt and set of tools available. A list of validators is assigned to validate the performance.
112+
A Task represents a specific prompts and set of tools available. A list of validators is assigned to validate the performance.
113113

114114
??? info "Task class definition"
115115

116116
::: rai_bench.tool_calling_agent.interfaces.Task
117117

118118
As you can see, the framework is very flexible. Any SubTask can be combined into any Validator that can be later assigned to any Task.
119119

120+
Every Task needs to define it's prompt and system prompt, what tools agent will have available, how many tool calls are required to complete it and how many optional tool calls are possible.
121+
122+
Optional tool calls mean that a certain tool calls is not obligatory to pass the Task, but shoudn't be considered an error, example: `GetROS2RGBCameraTask` which has prompt: `Get RGB camera image.` requires making one tool call with `get_ros2_image` tool. But listing topics before doing it is a valid approach, so in this case opitonal tool calls is `1`.
123+
120124
### ToolCallingAgentBenchmark
121125

122126
The ToolCallingAgentBenchmark class manages the execution of tasks and collects results.
123127

124128
### Available Tasks
125129

126-
Tasks of this benchmark are grouped by type:
130+
There are predefined Tasks available which are grouped by categories:
127131

128-
- Basic - basic usage of tools
129-
- Navigation
132+
- Basic - require retrieving info from certain topics
130133
- Manipulation
131134
- Custom Interfaces - requires using messages with custom interfaces
132135

136+
Every Task has assigned the `complexity` which reflects the difficulty.
137+
138+
When creating a Task, you can define few params:
139+
140+
```python
141+
class TaskArgs(BaseModel):
142+
"""Holds the configurations specified by user"""
143+
144+
extra_tool_calls: int = 0
145+
prompt_detail: Literal["brief", "descriptive"] = "brief"
146+
examples_in_system_prompt: Literal[0, 2, 5] = 0
147+
```
148+
149+
- examples_in_system_prompt - How many examples there are in system prompts, example:
150+
151+
- `0`: `You are a ROS 2 expert that want to solve tasks. You have access to various tools that allow you to query the ROS 2 system. Be proactive and use the tools to answer questions.`
152+
- `2`: `You are a ROS 2 expert that want to solve tasks. You have access to various tools that allow you to query the ROS 2 system. Be proactive and use the tools to answer questions. Example of tool calls: get_ros2_message_interface, args: {'msg_type': 'geometry_msgs/msg/Twist'} publish_ros2_message, args: {'topic': '/cmd_vel', 'message_type': 'geometry_msgs/msg/Twist', 'message': {linear: {x: 0.5, y: 0.0, z: 0.0}, angular: {x: 0.0, y: 0.0, z: 1.0}}}`
153+
154+
- prompt_detail - How descriptive should the Task prompt be, example:
155+
156+
- `brief`: "Get all camera images"
157+
- `descriptive`: "Get all camera images from all available camera sources in the system.
158+
This includes both RGB color images and depth images.
159+
You can discover what camera topics are available and capture images from each."
160+
161+
Descriptive prompts provides guidance and tips.
162+
163+
- extra_tool_calls - How many extra tool calls an agent can make and still pass the Task, example:
164+
- `GetROS2RGBCameraTask` has 1 required tool call and 1 optional. When `extra_tool_calls` set to 5, agent can correct himself couple times and still pass even with 7 tool calls. There can be 2 types of invalid tool calls, first when the tool is used incorrectly and agent receives an error - this allows him to correct himself easier. Second type is when tool is called properly but it is not the tool that should be called or it is called with wrong params. In this case agent won't get any error so it will be harder for him to correct, but BOTH of these cases are counted as `extra tool call`.
165+
133166
If you want to know details about every task, visit `rai_bench/tool_calling_agent/tasks`
134167

135-
## Test Models
168+
## VLM Benchmark
169+
170+
The VLM Benchmark is a benchmark for VLM models. It includes a set of tasks containing questions related to images and evaluates the performance of the agent that returns the answer in the structured format.
171+
172+
### Running
173+
174+
To run the benchmark:
175+
176+
```bash
177+
cd rai
178+
source setup_shell.sh
179+
python src/rai_bench/rai_bench/examples/vlm_benchmark.py --model-name gemma3:4b --vendor ollama
180+
```

0 commit comments

Comments
 (0)