A curated list of papers, projects, and resources for multi-modal Graphical User Interface (GUI) agents.
🔥 This project is actively maintained, and we welcome your contributions. If you have any suggestions, such as missing papers or information, please feel free to open an issue or submit a pull request.
🤖 Try our Awesome-Paper-Agent. Just provide an arXiv URL link, and it will automatically return formatted information, like this:
User:
https://arxiv.org/abs/2312.13108
GPT:
+ [AssistGUI: Task-Oriented Desktop Graphical User Interface Automation](https://arxiv.org/abs/2312.13108) (Dec. 2023)
[](https://github.com/showlab/assistgui)
[](https://arxiv.org/abs/2312.13108)
[](https://showlab.github.io/assistgui/)
So then you can easily copy and use this information in your pull requests.
⭐ If you find this repository useful, please give it a star.
World of Bits: An Open-Domain Platform for Web-Based Agents (Aug. 2017, ICML 2017)
A Unified Solution for Structured Web Data Extraction (Jul. 2011, SIGIR 2011)
Rico: A Mobile App Dataset for Building Data-Driven Design Applications (Oct. 2017)
Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration (Feb. 2018, ICLR 2018)
Mapping Natural Language Instructions to Mobile UI Action Sequences (May. 2020, ACL 2020)
WebSRC: A Dataset for Web-Based Structural Reading Comprehension (Jan. 2021, EMNLP 2021)
AndroidEnv: A Reinforcement Learning Platform for Android (May. 2021)
A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility (Feb. 2022)
META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI (May. 2022)
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents (Jul. 2022)
Language Models can Solve Computer Tasks (Mar. 2023)
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction (May. 2023)
Mind2Web: Towards a Generalist Agent for the Web (Jun. 2023)
Android in the Wild: A Large-Scale Dataset for Android Device Control (Jul. 2023)
WebArena: A Realistic Web Environment for Building Autonomous Agents (Jul. 2023)
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models (Nov. 2023)
AssistGUI: Task-Oriented Desktop Graphical User Interface Automation (Dec. 2023, CVPR 2024)
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (Jan. 2024, ACL 2024)
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web (Feb. 2024)
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue (Feb. 2024)
On the Multi-turn Instruction Following for Conversational Web Agents (Feb. 2024)
AgentStudio: A Toolkit for Building General Virtual Agents (Mar. 2024)
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Apr. 2024)
Benchmarking Mobile Device Control Agents across Diverse Configurations (Apr. 2024, ICLR 2024)
MMInA: Benchmarking Multihop Multimodal Internet Agents (Apr. 2024)
Autonomous Evaluation and Refinement of Digital Agents (Apr. 2024)
LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation (Apr. 2024)
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? (Apr. 2024)
GUICourse: From General Vision Language Models to Versatile GUI Agents (Jun. 2024)
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents (Jun. 2024)
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices (Jun. 2024)
VideoGUI: A Benchmark for GUI Automation from Instructional Videos (Jun. 2024)
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding (Jun. 2024)
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents (Jun. 2024)
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents (Jun. 2024)
Practical, Automated Scenario-based Mobile App Testing (Jun. 2024)
WebCanvas: Benchmarking Web Agents in Online Environments (Jun. 2024)
On the Effects of Data Scale on Computer Control Agents (Jun. 2024)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents (Jul. 2024)
WebVLN: Vision-and-Language Navigation on Websites (AAAI 2024)
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? (Jul. 2024)
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Harnessing Webpage UIs for Text-Rich Visual Understanding (Oct, 2024)
GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent (Dec, 2024)
A3: Android Agent Arena for Mobile GUI Agents (Jan. 2025)
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation (ICLR 2025)
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation (Feb. 2025)
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark (Apr. 2025)
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows (May. 2025)
MONDAY: Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents (May., 2025, CVPR 2025)
Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System (June. 2025)
Grounding Open-Domain Instructions to Automate Web Support Tasks (Mar. 2021)
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning (Aug. 2021)
A Data-Driven Approach for Learning to Control Computers (Feb. 2022)
Augmenting Autotelic Agents with Large Language Models (May. 2023)
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control (Jun. 2023, ICLR 2024)
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis (Jul. 2023, ICLR 2024)
LASER: LLM Agent with State-Space Exploration for Web Navigation (Sep. 2023)
CogAgent: A Visual Language Model for GUI Agents (Dec. 2023, CVPR 2024)
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement (Feb. 2024)
UFO: A UI-Focused Agent for Windows OS Interaction (Feb. 2024)
Comprehensive Cognitive LLM Agent for Smartphone GUI Automation (Feb. 2024)
Improving Language Understanding from Screenshots (Feb. 2024)
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent (Apr. 2024, KDD 2024)
SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models (May. 2023, NeurIPS 2023)
You Only Look at Screens: Multimodal Chain-of-Action Agents (Sep. 2023)
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API (Oct. 2023)
OpenAgents: AN OPEN PLATFORM FOR LANGUAGE AGENTS IN THE WILD (Oct. 2023)
AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant (Oct. 2024)
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation (Nov. 2023)
AppAgent: Multimodal Agents as Smartphone Users (Dec. 2023)
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents (Jan. 2024, ACL 2024)
GPT-4V(ision) is a Generalist Web Agent, if Grounded (Jan. 2024, ICML 2024)
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (Jan. 2024)
Dual-View Visual Contextualization for Web Navigation (Feb. 2024, CVPR 2024)
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (Jun. 2024)
Visual Grounding for User Interfaces (NAACL 2024)
ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (Feb. 2024)
ScreenAI: A Vision-Language Model for UI and Infographics Understanding (Feb. 2024)
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Apr. 2024)
Octopus: On-device language model for function calling of software APIs (Apr., 2024)
Octopus v2: On-device language model for super agent (Apr., 2024)
Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent (Apr., 2024)
Octopus v4: Graph of language models (Apr., 2024)
AutoWebGLM: Bootstrap and Reinforce a Large Language Model-based Web Navigating Agent (Apr. 2024)
Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning (Apr. 2024)
Enhancing Mobile "How-to" Queries with Automated Search Results Verification and Reranking (Apr. 2024, SIGIR 2024)
Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation (Dec. 2023, MobiCom 2024)
Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study (Mar. 2024)
Android in the Zoo: Chain-of-Action-Thought for GUI Agents (Mar. 2024)
GUI Action Narrator: Where and When Did That Action Take Place? (Jun. 2024)
Identifying User Goals from UI Trajectories (Jun. 2024)
VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning (Jun. 2024)
Octo-planner: On-device Language Model for Planner-Action Agents (Jun. 2024)
E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion (Jun. 2024)
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration (Jun. 2024)
MobileFlow: A Multimodal LLM For Mobile GUI Agent (Jul. 2024)
Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model (Jul. 2024)
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence (Jul. 2024)
MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices (Jul. 2024)
AUITestAgent: Automatic Requirements Oriented GUI Function Testing (Jul. 2024)
Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems (Jul. 2024)
OmniParser for Pure Vision Based GUI Agent (Aug. 2024)
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents (Aug. 2024)
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (Aug. 2024)
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher (Jul. 2023)
AppAgent v2: Advanced Agent for Flexible Mobile Interactions (Aug. 2024)
Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions (Aug. 2024)
Agent Workflow Memory (Sep. 2024)
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understandin (Sep. 2024)
Agent S: An Open Agentic Framework that Uses Computers Like a Human (Oct. 2024)
MobA: A Two-Level Agent System for Efficient Mobile Task Automation (Oct. 2024)
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents (Oct. 2024)
OS-ATLAS: A Foundation Action Model For Generalist GUI Agents (Oct. 2024)
Attacking Vision-Language Computer Agents via Pop-ups (Nov. 2024)
AutoGLM: Autonomous Foundation Agents for GUIs (Nov. 2024)
AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations (Nov. 2024)
ShowUI: One Vision-Language-Action Model for Generalist GUI Agent (Nov. 2024)
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction (Dec. 2024)
Falcon-UI: Understanding GUI Before Following User Instructions (Dec. 2024)
PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World (Dec. 2024)
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining (Dec. 2024)
Aria-UI: Visual Grounding for GUI Instructions (Dec. 2024)
CogAgent v2 (Dec. 2024)
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (Dec. 2024)
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection (Jan. 2025)
GUI-Bee : Align GUI Action Grounding to Novel Environments via Autonomous Exploration (Jan. 2025)
Lightweight Neural App Control (ICLR 2025)
DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents (ICLR 2025)
AppVLM: A Lightweight Vision Language Model for Online App Control (Feb. 2025)
GUI-Thinker: GUI-Thinker: A Basic yet Comprehensive GUI Agent Developed with Self-Reflection (Feb. 2025)
TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials (Apr. 2025)
MobileA3gent: Training Mobile GUI Agents Using Decentralized Self-Sourced Data from Diverse Users (Feb., 2025)
FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data (Mar. 2025)
AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning (Jun., 2025)
EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection (May. 2025)
Test‑Time Reinforcement Learning for GUI Grounding via Region Consistency (Aug., 2025)