|
1 |
| -# Agent TARS |
2 | 1 |
|
3 |
| -**Agent TARS** is a multimodal AI agent that revolutionizes GUI interaction. By visually interpreting environments like web pages, Agent TARS empowers GUI agents with enhanced context and capabilities, making it a versatile tool to perform a wide range of tasks including searching, browsing, and synthesizing information. Furthermore, Agent TARS facilitates seamless integration with file systems and command line interface, enabling a cohesive workflow with intuitive GUI capabilities. |
4 | 2 |
|
5 |
| -With a redesigned desktop client, Agent TARS enhances its GUI understanding with an advanced agent framework. This synergy enables generic tasks and paves the way for continuous performance optimization of GUI agents like [UI-TARS](https://github.com/bytedance/ui-tars), combined with an agent framework. The framework makes it easier for developers to build and customize GUI agent solutions. |
| 3 | +> [!IMPORTANT] |
| 4 | +> **\[2025-03-16\]** We released a **technical preview** version of a new desktop app - [Agent TARS](./apps/omega/README.md), a multimodal AI agent that leverages browser operations by visually interpreting web pages and seamlessly integrating with command lines and file systems. |
6 | 5 |
|
7 |
| -# Showcases |
| 6 | +<p align="center"> |
| 7 | + <img alt="UI-TARS" width="260" src="./resources/icon.png"> |
| 8 | +</p> |
8 | 9 |
|
9 |
| -- [ ] Add demo |
| 10 | +# UI-TARS Desktop |
10 | 11 |
|
11 |
| -# ✨️ Key Features |
| 12 | +UI-TARS Desktop is a GUI Agent application based on [UI-TARS (Vision-Language Model)](https://github.com/bytedance/UI-TARS) that allows you to control your computer using natural language. |
12 | 13 |
|
13 |
| -Agent TARS builds upon the foundation of [UI-TARS-desktop](./apps/ui-tars/README.md) and introduces three major enhancements: |
14 | 14 |
|
15 |
| -- **🌐 Smarter Browser Control:** Using UI understanding, Agent TARS excels at operating browsers. With an advanced agent framework, it plans and executes complex tasks like operator and deep research, unlocking a wider range of scenarios for GUI agents. |
16 |
| -- **💡 More Tools, More Power:** It combines browser UI skills with features like search, file editing, command-line actions, and tool integration via the Model Context Protocol (MCP). This makes tackling intricate tasks a breeze and helps developers build a vibrant GUI agent ecosystem. |
17 |
| -- **💻️ Shiny New Desktop UI:** Enjoy a revamped PC desktop client (built with Electron) featuring searches and browser displays, chat UI with session management, model configuration and planning steps—making it easier to expand GUI agent applications. |
| 15 | +<p align="center"> |
| 16 | +    📑 <a href="https://arxiv.org/abs/2501.12326">Paper</a>    |
| 17 | + | 🤗 <a href="https://huggingface.co/bytedance-research/UI-TARS-7B-DPO">Hugging Face Models</a>   |
| 18 | + |   🫨 <a href="https://discord.gg/pTXwYVjfcs">Discord</a>   |
| 19 | + |   🤖 <a href="https://www.modelscope.cn/models/bytedance-research/UI-TARS-7B-DPO">ModelScope</a>   |
| 20 | +<br> |
| 21 | +🖥️ Desktop Application    |
| 22 | +|    👓 <a href="https://github.com/web-infra-dev/midscene">Midscene (use in browser)</a> |
| 23 | +</p> |
18 | 24 |
|
19 |
| -*Note:* The original UI-TARS-desktop client sticks around, and our SDK is now more universal for broader use. |
| 25 | +### ⚠️ Important Announcement: GGUF Model Performance |
20 | 26 |
|
21 |
| -## 🌐 Enhanced GUI Agent Tool Integration |
| 27 | +The **GGUF model** has undergone quantization, but unfortunately, its performance cannot be guaranteed. As a result, we have decided to **downgrade** it. |
22 | 28 |
|
23 |
| -Agent TARS excels at connecting tools related to GUI Agents, creating cohesive task executions: |
| 29 | +💡 **Alternative Solution**: |
| 30 | +You can use **[Cloud Deployment](#cloud-deployment)** or **[Local Deployment [vLLM]](#local-deployment-vllm)**(If you have enough GPU resources) instead. |
24 | 31 |
|
25 |
| -- **Search and Browse:** Conduct searches and navigate web pages effortlessly. |
26 |
| -- **Exploration:** Dynamically open links and scroll down pages to explore content while browsing. |
27 |
| -- **Information Synthesis:** Collect and synthesize information into final results. |
| 32 | +We appreciate your understanding and patience as we work to ensure the best possible experience. |
28 | 33 |
|
29 |
| -## 🛠️ Engineering Development Made Easy |
| 34 | +## Updates |
30 | 35 |
|
31 |
| -Agent TARS offers a robust framework to integrate the multimodal model into projects seamlessly. Its well-structured architecture simplifies building custom workflows, enabling developers to harness multimodal capabilities with ease. |
| 36 | +- 🚀 01.25: We updated the **[Cloud Deployment](#cloud-deployment)** section in the 中文版: [GUI模型部署教程](https://bytedance.sg.larkoffice.com/docx/TCcudYwyIox5vyxiSDLlgIsTgWf#U94rdCxzBoJMLex38NPlHL21gNb) with new information related to the ModelScope platform. You can now use the ModelScope platform for deployment. |
32 | 37 |
|
33 |
| -## 🔎 Functional Expansion and Tool Support |
| 38 | +## Showcases |
34 | 39 |
|
35 |
| -Agent TARS provides a comprehensive platform with comprehensive functions and tool support, including: |
| 40 | +| Instruction | Video | |
| 41 | +| :---: | :---: | |
| 42 | +| Get the current weather in SF using the web browser | <video src="https://github.com/user-attachments/assets/5235418c-ac61-4895-831d-68c1c749fc87" height="300" /> | |
| 43 | +| Send a twitter with the content "hello world" | <video src="https://github.com/user-attachments/assets/737ccc11-9124-4464-b4be-3514cbced85c" height="300" /> | |
36 | 44 |
|
37 |
| -- **Operator** **with Browser** |
38 |
| -- **Coding &** **Artifact** **Preview** |
39 |
| -- **MCP-Based Tools** |
| 45 | +## Features |
40 | 46 |
|
41 |
| -## 📽️ Replay and Sharing |
| 47 | +- 🤖 Natural language control powered by Vision-Language Model |
| 48 | +- 🖥️ Screenshot and visual recognition support |
| 49 | +- 🎯 Precise mouse and keyboard control |
| 50 | +- 💻 Cross-platform support (Windows/MacOS) |
| 51 | +- 🔄 Real-time feedback and status display |
| 52 | +- 🔐 Private and secure - fully local processing |
42 | 53 |
|
43 |
| -Share your task execution journeys with Agent TARS: |
| 54 | +## Quick Start |
44 | 55 |
|
45 |
| -- **Standardized Data Persistence:** Save and access your data reliably. |
46 |
| -- **Web Publishing:** Publish execution processes to web pages for display and collaboration. |
| 56 | +### Download |
47 | 57 |
|
48 |
| -# Getting Started |
| 58 | +You can download the [latest release](https://github.com/bytedance/UI-TARS-desktop/releases/latest) version of UI-TARS Desktop from our releases page. |
49 | 59 |
|
50 |
| -**Clone the** **Repository**: |
| 60 | +> **Note**: If you have [Homebrew](https://brew.sh/) installed, you can install UI-TARS Desktop by running the following command: |
| 61 | +> ```bash |
| 62 | +> brew install --cask ui-tars |
| 63 | +> ``` |
| 64 | +
|
| 65 | +### Install |
| 66 | +
|
| 67 | +#### MacOS |
| 68 | +
|
| 69 | +1. Drag **UI TARS** application into the **Applications** folder |
| 70 | + <img src="./images/mac_install.png" width="500px" /> |
| 71 | +
|
| 72 | +2. Enable the permission of **UI TARS** in MacOS: |
| 73 | + - System Settings -> Privacy & Security -> **Accessibility** |
| 74 | + - System Settings -> Privacy & Security -> **Screen Recording** |
| 75 | + <img src="./images/mac_permission.png" width="500px" /> |
| 76 | +
|
| 77 | +3. Then open **UI TARS** application, you can see the following interface: |
| 78 | + <img src="./images/mac_app.png" width="500px" /> |
| 79 | +
|
| 80 | +
|
| 81 | +#### Windows |
| 82 | +
|
| 83 | +**Still to run** the application, you can see the following interface: |
| 84 | +
|
| 85 | +<img src="./images/windows_install.png" width="400px" /> |
| 86 | +
|
| 87 | +### Deployment |
| 88 | +
|
| 89 | +#### Cloud Deployment |
| 90 | +We recommend using HuggingFace Inference Endpoints for fast deployment. |
| 91 | +We provide two docs for users to refer: |
| 92 | +
|
| 93 | +English version: [GUI Model Deployment Guide](https://juniper-switch-f10.notion.site/GUI-Model-Deployment-Guide-17b5350241e280058e98cea60317de71) |
| 94 | +
|
| 95 | +中文版: [GUI模型部署教程](https://bytedance.sg.larkoffice.com/docx/TCcudYwyIox5vyxiSDLlgIsTgWf#U94rdCxzBoJMLex38NPlHL21gNb) |
| 96 | +
|
| 97 | +#### Local Deployment [vLLM] |
| 98 | +We recommend using vLLM for fast deployment and inference. You need to use `vllm>=0.6.1`. |
| 99 | +```bash |
| 100 | +pip install -U transformers |
| 101 | +VLLM_VERSION=0.6.6 |
| 102 | +CUDA_VERSION=cu124 |
| 103 | +pip install vllm==${VLLM_VERSION} --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION} |
51 | 104 |
|
52 | 105 | ```
|
53 |
| -git clone https://github.com/bytedance/agent-TARS.git |
| 106 | +##### Download the Model |
| 107 | +We provide three model sizes on Hugging Face: **2B**, **7B**, and **72B**. To achieve the best performance, we recommend using the **7B-DPO** or **72B-DPO** model (based on your hardware configuration): |
| 108 | + |
| 109 | +- [2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT) |
| 110 | +- [7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT) |
| 111 | +- [7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO) |
| 112 | +- [72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT) |
| 113 | +- [72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO) |
| 114 | + |
| 115 | + |
| 116 | +##### Start an OpenAI API Service |
| 117 | +Run the command below to start an OpenAI-compatible API service: |
| 118 | + |
| 119 | +```bash |
| 120 | +python -m vllm.entrypoints.openai.api_server --served-model-name ui-tars --model <path to your model> |
54 | 121 | ```
|
55 | 122 |
|
56 |
| -## Future Plans |
| 123 | +##### Input your API information |
57 | 124 |
|
58 |
| -Agent TARS is more than a tool—it’s a platform for the future of multimodal agents. Upcoming enhancements include: |
| 125 | +<img src="./images/settings_model.png" width="500px" /> |
59 | 126 |
|
60 |
| -- Ongoing optimization of agent framework-GUI Agent synergy with expanded model compatibility. |
61 |
| -- Expansion to mobile device operations with cross-platform framework. |
62 |
| -- Integration with game environments for AI-driven gameplay. |
| 127 | +<!-- If you use Ollama, you can use the following settings to start the server: |
| 128 | +
|
| 129 | +```yaml |
| 130 | +VLM Provider: ollama |
| 131 | +VLM Base Url: http://localhost:11434/v1 |
| 132 | +VLM API Key: api_key |
| 133 | +VLM Model Name: ui-tars |
| 134 | +``` --> |
| 135 | + |
| 136 | +> **Note**: VLM Base Url is OpenAI compatible API endpoints (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details). |
63 | 137 |
|
64 | 138 | ## Contributing
|
65 | 139 |
|
66 |
| -- [ ] update [contributing.md](./contributing.md) |
| 140 | +[CONTRIBUTING.md](./CONTRIBUTING.md) |
67 | 141 |
|
68 |
| -## License |
| 142 | +## SDK(Experimental) |
69 | 143 |
|
70 |
| -Agent TARS is licensed under the Apache License 2.0. |
| 144 | +[SDK](./docs/sdk.md) |
71 | 145 |
|
72 |
| -# Acknowledgments |
| 146 | +## License |
| 147 | + |
| 148 | +UI-TARS Desktop is licensed under the Apache License 2.0. |
73 | 149 |
|
74 |
| -- A huge thanks to the UI-TARS and UI-TARS-desktop team for their foundational work. |
75 |
| -- Gratitude to all contributors and the open-source community for their support. |
| 150 | +## Citation |
| 151 | +If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: |
76 | 152 |
|
77 |
| -**Join us in shaping the future of multimodal AI agents with Agent TARS!** |
| 153 | +```BibTeX |
| 154 | +@article{qin2025ui, |
| 155 | + title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents}, |
| 156 | + author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others}, |
| 157 | + journal={arXiv preprint arXiv:2501.12326}, |
| 158 | + year={2025} |
| 159 | +} |
| 160 | +``` |
0 commit comments