You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can find predefined scene configs in `rai_bench/manipulation_o3de/predefined/configs/`.
75
76
76
-
Predefined scenarios can be imported like:
77
+
Predefined scenarios can be imported, for example, choosing tasks by difficulty:
77
78
78
79
```python
79
80
from rai_bench.manipulation_o3de import get_scenarios
80
81
81
82
get_scenarios(levels=["easy", "medium"])
82
83
```
83
84
84
-
Choose which task you want by selecting the difficulty, from trivial to very hard scenarios.
85
-
86
85
## Tool Calling Agent Benchmark
87
86
88
87
Evaluates agent performance independently from any simulation, based only on tool calls that the agent makes. To make it independent from simulations, this benchmark introduces tool mocks which can be adjusted for different tasks. This makes the benchmark more universal and a lot faster.
@@ -106,30 +105,76 @@ The `Validator` class can combine single or multiple subtasks to create a single
106
105
107
106
- OrderedCallsValidator - requires a strict order of subtasks. The next subtask will be validated only when the previous one was completed. Validator passes when all subtasks pass.
108
107
- NotOrderedCallsValidator - doesn't enforce order of subtasks. Every subtask will be validated against every tool call. Validator passes when all subtasks pass.
108
+
- OneFromManyValidator - passes when any one of the given subtasks passes.
109
109
110
110
### Task
111
111
112
-
A Task represents a specific prompt and set of tools available. A list of validators is assigned to validate the performance.
112
+
A Task represents a specific prompts and set of tools available. A list of validators is assigned to validate the performance.
113
113
114
114
??? info "Task class definition"
115
115
116
116
::: rai_bench.tool_calling_agent.interfaces.Task
117
117
118
118
As you can see, the framework is very flexible. Any SubTask can be combined into any Validator that can be later assigned to any Task.
119
119
120
+
Every Task needs to define it's prompt and system prompt, what tools agent will have available, how many tool calls are required to complete it and how many optional tool calls are possible.
121
+
122
+
Optional tool calls mean that a certain tool calls is not obligatory to pass the Task, but shoudn't be considered an error, example: `GetROS2RGBCameraTask` which has prompt: `Get RGB camera image.` requires making one tool call with `get_ros2_image` tool. But listing topics before doing it is a valid approach, so in this case opitonal tool calls is `1`.
123
+
120
124
### ToolCallingAgentBenchmark
121
125
122
126
The ToolCallingAgentBenchmark class manages the execution of tasks and collects results.
123
127
124
128
### Available Tasks
125
129
126
-
Tasks of this benchmark are grouped by type:
130
+
There are predefined Tasks available which are grouped by categories:
127
131
128
-
- Basic - basic usage of tools
129
-
- Navigation
132
+
- Basic - require retrieving info from certain topics
130
133
- Manipulation
131
134
- Custom Interfaces - requires using messages with custom interfaces
132
135
136
+
Every Task has assigned the `complexity` which reflects the difficulty.
- examples_in_system_prompt - How many examples there are in system prompts, example:
150
+
151
+
-`0`: `You are a ROS 2 expert that want to solve tasks. You have access to various tools that allow you to query the ROS 2 system. Be proactive and use the tools to answer questions.`
152
+
-`2`: `You are a ROS 2 expert that want to solve tasks. You have access to various tools that allow you to query the ROS 2 system. Be proactive and use the tools to answer questions. Example of tool calls: get_ros2_message_interface, args: {'msg_type': 'geometry_msgs/msg/Twist'} publish_ros2_message, args: {'topic': '/cmd_vel', 'message_type': 'geometry_msgs/msg/Twist', 'message': {linear: {x: 0.5, y: 0.0, z: 0.0}, angular: {x: 0.0, y: 0.0, z: 1.0}}}`
153
+
154
+
- prompt_detail - How descriptive should the Task prompt be, example:
155
+
156
+
-`brief`: "Get all camera images"
157
+
-`descriptive`: "Get all camera images from all available camera sources in the system.
158
+
This includes both RGB color images and depth images.
159
+
You can discover what camera topics are available and capture images from each."
160
+
161
+
Descriptive prompts provides guidance and tips.
162
+
163
+
- extra_tool_calls - How many extra tool calls an agent can make and still pass the Task, example:
164
+
-`GetROS2RGBCameraTask` has 1 required tool call and 1 optional. When `extra_tool_calls` set to 5, agent can correct himself couple times and still pass even with 7 tool calls. There can be 2 types of invalid tool calls, first when the tool is used incorrectly and agent receives an error - this allows him to correct himself easier. Second type is when tool is called properly but it is not the tool that should be called or it is called with wrong params. In this case agent won't get any error so it will be harder for him to correct, but BOTH of these cases are counted as `extra tool call`.
165
+
133
166
If you want to know details about every task, visit `rai_bench/tool_calling_agent/tasks`
134
167
135
-
## Test Models
168
+
## VLM Benchmark
169
+
170
+
The VLM Benchmark is a benchmark for VLM models. It includes a set of tasks containing questions related to images and evaluates the performance of the agent that returns the answer in the structured format.
0 commit comments