Skip to content

Commit b1c6772

Browse files
authored
Support doubao-1.5-thinking-vision-pro (#598)
1 parent 1521ed2 commit b1c6772

File tree

5 files changed

+267
-49
lines changed

5 files changed

+267
-49
lines changed

apps/ui-tars/src/main/agent/prompts.ts

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,3 +85,84 @@ call_user() # Submit the task and call the user when the task is unsolvable, or
8585
8686
## User Instruction
8787
`;
88+
89+
export const getSystemPromptDoubao_15_15B = `
90+
You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
91+
92+
## Output Format
93+
\`\`\`
94+
Thought: ...
95+
Action: ...
96+
\`\`\`
97+
98+
## Action Space
99+
100+
click(start_box='[x1, y1, x2, y2]')
101+
left_double(start_box='[x1, y1, x2, y2]')
102+
right_single(start_box='[x1, y1, x2, y2]')
103+
drag(start_box='[x1, y1, x2, y2]', end_box='[x3, y3, x4, y4]')
104+
hotkey(key='')
105+
type(content='xxx') # Use escape characters \\', \\", and \n in content part to ensure we can parse the content in normal python string format. If you want to submit your input, use \\n at the end of content.
106+
scroll(start_box='[x1, y1, x2, y2]', direction='down or up or right or left')
107+
wait() #Sleep for 5s and take a screenshot to check for any changes.
108+
finished(content='xxx') # Use escape characters \\', \\", and \n in content part to ensure we can parse the content in normal python string format.
109+
110+
111+
## Note
112+
- Use Chinese in \`Thought\` part.
113+
- Write a small plan and finally summarize your next action (with its target element) in one sentence in \`Thought\` part.
114+
115+
## User Instruction
116+
`;
117+
118+
export const getSystemPromptDoubao_15_20B = `You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
119+
120+
## Output Format
121+
\`\`\`
122+
Thought: ...
123+
Action: ...
124+
\`\`\`
125+
126+
## Action Space
127+
128+
click(point='<point>x1 y1</point>')
129+
left_double(point='<point>x1 y1</point>')
130+
right_single(point='<point>x1 y1</point>')
131+
drag(start_point='<point>x1 y1</point>', end_point='<point>x2 y2</point>')
132+
scroll(point='<point>x1 y1</point>', direction='down or up or right or left') # Show more information on the \`direction\` side.
133+
hotkey(key='ctrl c') # Split keys with a space and use lowercase. Also, do not use more than 3 keys in one hotkey action.
134+
press(key='ctrl') # Presses and holds down ONE key (e.g., ctrl). Use this action in combination with release(). You can perform other actions between press and release. For example, click elements while holding the ctrl key.
135+
release(key='ctrl') # Releases the key previously pressed. All actions between press and release will execute with the key held down. Note: Ensure all keys are released by the end of the step.
136+
type(content='xxx') # Use escape characters \\', \\", and \\n in content part to ensure we can parse the content in normal python string format. If you want to submit your input, use \\n at the end of content.
137+
wait() # Sleep for 5s and take a screenshot to check for any changes.
138+
call_user() # Call the user when the task is unsolvable, or when you need the user's help. Then, user will see and answer your question in \`user_resp\`.
139+
finished(content='xxx') # Submit the task with an report to the user. Use escape characters \\', \\", and \\n in content part to ensure we can parse the content in normal python string format.
140+
141+
142+
## Note
143+
- Use Chinese in \`Thought\` part.
144+
- Write a small plan and finally summarize your next action (with its target element) in one sentence in \`Thought\` part.
145+
- You may stumble upon new rules or features while playing the game or executing GUI tasks for the first time. Make sure to record them in your \`Thought\` and utilize them later.
146+
- Your thought style should follow the style of thought Examples.
147+
- You can provide multiple actions in one step, separated by "\n\n".
148+
- Ensure all keys you pressed are released by the end of the step.
149+
150+
## Thought Examples
151+
- Example1. Thought: 第一行、第三列出现了一个数字2;第二列原有数字4与第四列新出现的数字4合并后变为8。注意观察第二列数字8与左边数字8的颜色比较浅一点,数字2的颜色看起来没有数字8的深。我猜测不同的颜色深的程度代表数值的大小,颜色较深的代表数值较大。这不,为了验证这个,我继续按下向左键让这两个8合并成为更大的数。
152+
- Example2. Thought: 真好!第一行第三列的数字2向左移动了两格合并到了第一行第一列,并且颜色比原先数字8的颜色深了许多。证明我的猜想没错,确实是这样!所以只有同样颜色深浅的数字才能够进行合并,而合并后的数字将变为原来数字的二倍并且颜色深度较深。而且!第一行第三列的2向左移动了两格,但是并没有和第一行第一列的2进行合并!由此可得,只有相同连续的格子才能够进行数字的合并。我按下向下键,16可以一步步进行合并得到2048,但是过程可能有些难。像我这样所做的操作并不是一步一步合并得到的。我这样做是为了更好的后续进行合并,得到更加大的数。
153+
- Example3. Thought: 又重新再来了。刚才的下键并没有起到什么作用。新格子还是刷到了第三行第四列的位置,表明下键此时并没有什么太大作用,我猜测是不是特定的布局无法支持一些方位的操作,为了验证,我得多尝试一些方位,我按下左键看看。
154+
- Example4. Thought: 哦,我知道了,同样的位置选择了同样的操作时不会发生改变的。除非是选择不同的方位!点击向上键以后,3、4行的数字都向上移动了一格,而它们原来所在的位置都被刷新出来了新数字,分别是4和2。同样,第三行第四列的数字2没有发生移动也刷新了新格子。明白了这一切后,我操作向左键试试看。
155+
- Example5. Thought: 经过我不懈的努力,在我的仔细观察选中的策略下,我成功地获得了胜利。这验证了我之前的猜想,移动按键只有我的头部移动到含数字的区域才会改变移动按键,蛇的身体移动到含数字的区域并不会影响移动按键。
156+
- Example6. Thought: 小蛇还是没动,我再次选择让它向右一步,希望这一次能成功移动,并且我猜测移动的间隔应该是蛇的长度,按动的次数也应该是蛇的长度。我或许需要将它记录下来,如果按一次它因为前方有障碍而动不了,但前方需要移动的话,需要按两次或者以上,按照蛇的长度来计算要按几次。
157+
- Example7. Thought: 我觉得我的猜测是正确的,小蛇的移动是根据手部的长度是否能达成这一条件进行前进,这对我之后的操作提供了很多帮助,也是游戏的通性。不过现在小蛇离苹果拿走只有一个格子,太过去了,所以后面还需要。再次往前走我们应该先走出道这个限制然后来到中间这个地方然后我们应该是绕一圈然后把这两道门选择开阔住然后使得这样才能让这个墙消失。那么我可以现在向左,尝试不触碰障碍的迈进,这似乎能改变小蛇的操作,使其改变路数。
158+
- Example8. Thought: 我观察到在出口管道里面,红苹果的前方还有一个阻挡物。那个阻挡物是一张带有浅褐和深褐色的老鼠皮,看起来随着红苹果的自然移动,它也在向着出口移动,但是对比旁边的方块框架显得很慢。目前这些都是我猜测的,我要看看推动这个老鼠皮要多少的力道。就在这时我刚好要按向右了,现在我按住 “D”键。
159+
- Example9. Thought: 太好了,我的做法是正确的,但是我发现激光点发射出来的激光这个时候并没有发光,看来我刚刚的猜测是不太全面的,还有新的知识,需要我再次了解一下激光的规则,回忆起来,刚刚似乎这个红色激光点发射出来的激光,别上是黄色,但上面的并没有什么波动,我需要新的条件,才能发现它的规律,将上一步的最后一格步骤拿出来,我发现刚刚不仅是激光颜色改变了,重要的是上面的箭头也改变了方向,也就是说激光点跟着太阳光一样,会有方向改变,这应该会是个关键消息,那我需要思考一下。
160+
- Example10. Thought: 我继续观察发光装置箭头方向和角度,我猜测离发射装置近的那个白方块,只能被移动到与发射装置相邻的中上方蓝色方块位置,那么此时下方的白方块只能位于最右边一列蓝色方块中的其中一个位置并与位于一条直线上的左下方的黑色圆圈重合,我只能在右下角和正下方的两个蓝色方块中选择,似乎,看起来右下角的这个方块的位置更能满足与两列黑色圆圈的距离的重合,但是到底是否正确的呢,那么我一定要去验证了。
161+
- Example11. Thought: 我们第一关是一个四边形,这个四边形内部的红绳是交织在一起的,我们根据以上经验如果要挪动一个毛线团的话,没有办法挪动任何一个上方有绳子限制的毛线团。所以从解题思路上我们可以打破这四边形的限制方向,那我们就可以挪动上方的毛线团。
162+
163+
## Output Examples
164+
Thought: 在这里输出你的中文思考,你的思考样式应该参考上面的Thought Examples...
165+
Action: click(point='<point>10 20</point>')
166+
167+
## User Instruction
168+
`;

apps/ui-tars/src/main/services/runAgent.ts

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,12 @@ import {
1616
DefaultBrowserOperator,
1717
SearchEngine,
1818
} from '@ui-tars/operator-browser';
19-
import { getSystemPrompt, getSystemPromptV1_5 } from '../agent/prompts';
19+
import {
20+
getSystemPrompt,
21+
getSystemPromptV1_5,
22+
getSystemPromptDoubao_15_15B,
23+
getSystemPromptDoubao_15_20B,
24+
} from '../agent/prompts';
2025
import {
2126
closeScreenMarker,
2227
hideWidgetWindow,
@@ -44,6 +49,8 @@ const getModelVersion = (
4449
return UITarsModelVersion.V1_0;
4550
case VLMProviderV2.doubao_1_5:
4651
return UITarsModelVersion.DOUBAO_1_5_15B;
52+
case VLMProviderV2.doubao_1_5_vl:
53+
return UITarsModelVersion.DOUBAO_1_5_20B;
4754
default:
4855
return UITarsModelVersion.V1_0;
4956
}
@@ -166,16 +173,28 @@ export const runAgent = async (
166173
);
167174
}
168175

176+
const modelVersion = getModelVersion(settings.vlmProvider);
177+
178+
const getSpByModelVersion = (modelVersion: UITarsModelVersion) => {
179+
switch (modelVersion) {
180+
case UITarsModelVersion.DOUBAO_1_5_20B:
181+
return getSystemPromptDoubao_15_20B;
182+
case UITarsModelVersion.DOUBAO_1_5_15B:
183+
return getSystemPromptDoubao_15_15B;
184+
case UITarsModelVersion.V1_5:
185+
return getSystemPromptV1_5(language, 'normal');
186+
default:
187+
return getSystemPrompt(language);
188+
}
189+
};
190+
169191
const guiAgent = new GUIAgent({
170192
model: {
171193
baseURL: settings.vlmBaseUrl,
172194
apiKey: settings.vlmApiKey,
173195
model: settings.vlmModelName,
174196
},
175-
systemPrompt:
176-
getModelVersion(settings.vlmProvider) === UITarsModelVersion.V1_5
177-
? getSystemPromptV1_5(language, 'normal')
178-
: getSystemPrompt(language),
197+
systemPrompt: getSpByModelVersion(modelVersion),
179198
logger,
180199
signal: abortController?.signal,
181200
operator: operator,
@@ -206,7 +225,7 @@ export const runAgent = async (
206225
},
207226
maxLoopCount: settings.maxLoopCount,
208227
loopIntervalInMs: settings.loopIntervalInMs,
209-
uiTarsVersion: getModelVersion(settings.vlmProvider),
228+
uiTarsVersion: modelVersion,
210229
});
211230

212231
GUIAgentManager.getInstance().setAgent(guiAgent);

apps/ui-tars/src/main/store/types.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ export enum VLMProviderV2 {
4444
ui_tars_1_0 = 'Hugging Face for UI-TARS-1.0',
4545
ui_tars_1_5 = 'Hugging Face for UI-TARS-1.5',
4646
doubao_1_5 = 'VolcEngine Ark for Doubao-1.5-UI-TARS',
47+
doubao_1_5_vl = 'VolcEngine Ark for Doubao-1.5-thinking-vision-pro',
4748
}
4849

4950
export enum SearchEngineForSettings {

packages/ui-tars/operators/browser-operator/src/browser-operator.ts

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,14 @@ export class BrowserOperator extends Operator {
231231
await this.handleHotkey(action_inputs);
232232
break;
233233

234+
case 'press':
235+
await this.handlePress(action_inputs);
236+
break;
237+
238+
case 'release':
239+
await this.handleRelease(action_inputs);
240+
break;
241+
234242
case 'scroll':
235243
await this.handleScroll(action_inputs);
236244
break;
@@ -405,6 +413,84 @@ export class BrowserOperator extends Operator {
405413
this.logger.info('Hotkey execution completed');
406414
}
407415

416+
private async handlePress(inputs: Record<string, any>) {
417+
const page = await this.getActivePage();
418+
419+
const keyStr = inputs?.key;
420+
if (!keyStr) {
421+
this.logger.warn('No key specified for press');
422+
throw new Error(`No key specified for press`);
423+
}
424+
425+
this.logger.info(`Pressing key: ${keyStr}`);
426+
427+
const keys = keyStr.split(/[\s+]/);
428+
const normalizedKeys: KeyInput[] = keys.map((key: string) => {
429+
const lowercaseKey = key.toLowerCase();
430+
const keyInput = KEY_MAPPINGS[lowercaseKey];
431+
432+
if (keyInput) {
433+
return keyInput;
434+
}
435+
436+
throw new Error(`Unsupported key: ${key}`);
437+
});
438+
439+
this.logger.info(`Normalized keys for press:`, normalizedKeys);
440+
441+
// Only press the keys
442+
for (const key of normalizedKeys) {
443+
await page.keyboard.down(key);
444+
await this.delay(50); // 添加小延迟确保按键稳定
445+
}
446+
447+
this.logger.info('Press operation completed');
448+
}
449+
450+
private async handleRelease(inputs: Record<string, any>) {
451+
const page = await this.getActivePage();
452+
453+
const keyStr = inputs?.key;
454+
if (!keyStr) {
455+
this.logger.warn('No key specified for release');
456+
throw new Error(`No key specified for release`);
457+
}
458+
459+
this.logger.info(`Releasing key: ${keyStr}`);
460+
461+
const keys = keyStr.split(/[\s+]/);
462+
const normalizedKeys: KeyInput[] = keys.map((key: string) => {
463+
const lowercaseKey = key.toLowerCase();
464+
const keyInput = KEY_MAPPINGS[lowercaseKey];
465+
466+
if (keyInput) {
467+
return keyInput;
468+
}
469+
470+
throw new Error(`Unsupported key: ${key}`);
471+
});
472+
473+
this.logger.info(`Normalized keys for release:`, normalizedKeys);
474+
475+
// Release the keys
476+
for (const key of normalizedKeys) {
477+
await page.keyboard.up(key);
478+
await this.delay(50); // 添加小延迟确保按键稳定
479+
}
480+
481+
// For hotkey combinations that may trigger navigation,
482+
// wait for navigation to complete
483+
const navigationKeys = ['Enter', 'F5'];
484+
if (normalizedKeys.some((key: string) => navigationKeys.includes(key))) {
485+
this.logger.info('Waiting for possible navigation after key release');
486+
await this.waitForPossibleNavigation(page);
487+
} else {
488+
await this.delay(500);
489+
}
490+
491+
this.logger.info('Release operation completed');
492+
}
493+
408494
private async handleScroll(inputs: Record<string, any>) {
409495
const page = await this.getActivePage();
410496

packages/ui-tars/operators/nut-js/src/index.ts

Lines changed: 74 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,59 @@ export class NutJSOperator extends Operator {
117117
// logger.info('[execute] [Region]', region);
118118
// }
119119

120+
const getHotkeys = (keyStr: string | undefined): Key[] => {
121+
if (keyStr) {
122+
const platformCommandKey =
123+
process.platform === 'darwin' ? Key.LeftCmd : Key.LeftWin;
124+
const platformCtrlKey =
125+
process.platform === 'darwin' ? Key.LeftCmd : Key.LeftControl;
126+
const keyMap: Record<string, Key> = {
127+
return: Key.Enter,
128+
enter: Key.Enter,
129+
backspace: Key.Backspace,
130+
delete: Key.Delete,
131+
ctrl: platformCtrlKey,
132+
shift: Key.LeftShift,
133+
alt: Key.LeftAlt,
134+
space: Key.Space,
135+
'page down': Key.PageDown,
136+
pagedown: Key.PageDown,
137+
'page up': Key.PageUp,
138+
pageup: Key.PageUp,
139+
meta: platformCommandKey,
140+
win: platformCommandKey,
141+
command: platformCommandKey,
142+
cmd: platformCommandKey,
143+
comma: Key.Comma,
144+
',': Key.Comma,
145+
up: Key.Up,
146+
down: Key.Down,
147+
left: Key.Left,
148+
right: Key.Right,
149+
arrowup: Key.Up,
150+
arrowdown: Key.Down,
151+
arrowleft: Key.Left,
152+
arrowright: Key.Right,
153+
};
154+
155+
const keys = keyStr
156+
.split(/[\s+]/)
157+
.map(
158+
(k) =>
159+
keyMap[k.toLowerCase()] ||
160+
Key[k.toUpperCase() as keyof typeof Key],
161+
);
162+
logger.info('[NutjsOperator] hotkey: ', keys);
163+
return keys;
164+
} else {
165+
logger.error(
166+
'[NutjsOperator] hotkey error: ',
167+
`${keyStr} is not a valid key`,
168+
);
169+
return [];
170+
}
171+
};
172+
120173
switch (action_type) {
121174
case 'wait':
122175
logger.info('[NutjsOperator] wait', action_inputs);
@@ -215,50 +268,28 @@ export class NutJSOperator extends Operator {
215268

216269
case 'hotkey': {
217270
const keyStr = action_inputs?.key || action_inputs?.hotkey;
218-
if (keyStr) {
219-
const platformCommandKey =
220-
process.platform === 'darwin' ? Key.LeftCmd : Key.LeftWin;
221-
const platformCtrlKey =
222-
process.platform === 'darwin' ? Key.LeftCmd : Key.LeftControl;
223-
const keyMap: Record<string, Key> = {
224-
return: Key.Enter,
225-
enter: Key.Enter,
226-
backspace: Key.Backspace,
227-
delete: Key.Delete,
228-
ctrl: platformCtrlKey,
229-
shift: Key.LeftShift,
230-
alt: Key.LeftAlt,
231-
space: Key.Space,
232-
'page down': Key.PageDown,
233-
pagedown: Key.PageDown,
234-
'page up': Key.PageUp,
235-
pageup: Key.PageUp,
236-
meta: platformCommandKey,
237-
win: platformCommandKey,
238-
command: platformCommandKey,
239-
cmd: platformCommandKey,
240-
comma: Key.Comma,
241-
',': Key.Comma,
242-
up: Key.Up,
243-
down: Key.Down,
244-
left: Key.Left,
245-
right: Key.Right,
246-
arrowup: Key.Up,
247-
arrowdown: Key.Down,
248-
arrowleft: Key.Left,
249-
arrowright: Key.Right,
250-
};
251-
252-
const keys = keyStr
253-
.split(/[\s+]/)
254-
.map(
255-
(k) =>
256-
keyMap[k.toLowerCase()] ||
257-
Key[k.toUpperCase() as keyof typeof Key],
258-
);
259-
logger.info('[NutjsOperator] hotkey: ', keys);
271+
const keys = getHotkeys(keyStr);
272+
if (keys.length > 0) {
260273
await keyboard.pressKey(...keys);
261-
await keyboard.releaseKey(...keys.reverse());
274+
await keyboard.releaseKey(...keys);
275+
}
276+
break;
277+
}
278+
279+
case 'press': {
280+
const keyStr = action_inputs?.key || action_inputs?.hotkey;
281+
const keys = getHotkeys(keyStr);
282+
if (keys.length > 0) {
283+
await keyboard.pressKey(...keys);
284+
}
285+
break;
286+
}
287+
288+
case 'release': {
289+
const keyStr = action_inputs?.key || action_inputs?.hotkey;
290+
const keys = getHotkeys(keyStr);
291+
if (keys.length > 0) {
292+
await keyboard.releaseKey(...keys);
262293
}
263294
break;
264295
}

0 commit comments

Comments
 (0)