GitHub - showlab/Awesome-GUI-Agent: 💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.

9 min read Original article ↗

A curated list of papers, projects, and resources for multi-modal Graphical User Interface (GUI) agents.

🔥 This project is actively maintained, and we welcome your contributions. If you have any suggestions, such as missing papers or information, please feel free to open an issue or submit a pull request.

🤖 Try our Awesome-Paper-Agent. Just provide an arXiv URL link, and it will automatically return formatted information, like this:

User:
https://arxiv.org/abs/2312.13108

GPT:
+ [AssistGUI: Task-Oriented Desktop Graphical User Interface Automation](https://arxiv.org/abs/2312.13108) (Dec. 2023)

  [![Star](https://img.shields.io/github/stars/showlab/assistgui.svg?style=social&label=Star)](https://github.com/showlab/assistgui)
  [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2312.13108)
  [![Website](https://img.shields.io/badge/Website-9cf)](https://showlab.github.io/assistgui/)

So then you can easily copy and use this information in your pull requests.

⭐ If you find this repository useful, please give it a star.

World of Bits: An Open-Domain Platform for Web-Based Agents (Aug. 2017, ICML 2017)

A Unified Solution for Structured Web Data Extraction (Jul. 2011, SIGIR 2011)

Rico: A Mobile App Dataset for Building Data-Driven Design Applications (Oct. 2017)

Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration (Feb. 2018, ICLR 2018)

Mapping Natural Language Instructions to Mobile UI Action Sequences (May. 2020, ACL 2020)

WebSRC: A Dataset for Web-Based Structural Reading Comprehension (Jan. 2021, EMNLP 2021)

AndroidEnv: A Reinforcement Learning Platform for Android (May. 2021)

A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility (Feb. 2022)

META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI (May. 2022)

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents (Jul. 2022)

Language Models can Solve Computer Tasks (Mar. 2023)

Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction (May. 2023)

Mind2Web: Towards a Generalist Agent for the Web (Jun. 2023)

Android in the Wild: A Large-Scale Dataset for Android Device Control (Jul. 2023)

WebArena: A Realistic Web Environment for Building Autonomous Agents (Jul. 2023)

Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models (Nov. 2023)

AssistGUI: Task-Oriented Desktop Graphical User Interface Automation (Dec. 2023, CVPR 2024)

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (Jan. 2024, ACL 2024)

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web (Feb. 2024)

WebLINX: Real-World Website Navigation with Multi-Turn Dialogue (Feb. 2024)

On the Multi-turn Instruction Following for Conversational Web Agents (Feb. 2024)

AgentStudio: A Toolkit for Building General Virtual Agents (Mar. 2024)

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Apr. 2024)

Benchmarking Mobile Device Control Agents across Diverse Configurations (Apr. 2024, ICLR 2024)

MMInA: Benchmarking Multihop Multimodal Internet Agents (Apr. 2024)

Autonomous Evaluation and Refinement of Digital Agents (Apr. 2024)

LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation (Apr. 2024)

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? (Apr. 2024)

GUICourse: From General Vision Language Models to Versatile GUI Agents (Jun. 2024)

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents (Jun. 2024)

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices (Jun. 2024)

VideoGUI: A Benchmark for GUI Automation from Instructional Videos (Jun. 2024)

Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding (Jun. 2024)

MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents (Jun. 2024)

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents (Jun. 2024)

Practical, Automated Scenario-based Mobile App Testing (Jun. 2024)

WebCanvas: Benchmarking Web Agents in Online Environments (Jun. 2024)

On the Effects of Data Scale on Computer Control Agents (Jun. 2024)

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents (Jul. 2024)

WebVLN: Vision-and-Language Navigation on Websites (AAAI 2024)

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? (Jul. 2024)

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

Windows Agent Arena

Harnessing Webpage UIs for Text-Rich Visual Understanding (Oct, 2024)

GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent (Dec, 2024)

A3: Android Agent Arena for Mobile GUI Agents (Jan. 2025)

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

WebWalker: Benchmarking LLMs in Web Traversal

SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation (ICLR 2025)

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation (Feb. 2025)

LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark (Apr. 2025)

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows (May. 2025)

MONDAY: Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents (May., 2025, CVPR 2025)

Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System (June. 2025)

Grounding Open-Domain Instructions to Automate Web Support Tasks (Mar. 2021)

Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning (Aug. 2021)

A Data-Driven Approach for Learning to Control Computers (Feb. 2022)

Augmenting Autotelic Agents with Large Language Models (May. 2023)

Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control (Jun. 2023, ICLR 2024)

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis (Jul. 2023, ICLR 2024)

LASER: LLM Agent with State-Space Exploration for Web Navigation (Sep. 2023)

CogAgent: A Visual Language Model for GUI Agents (Dec. 2023, CVPR 2024)

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

OS-Copilot: Towards Generalist Computer Agents with Self-Improvement (Feb. 2024)

UFO: A UI-Focused Agent for Windows OS Interaction (Feb. 2024)

Comprehensive Cognitive LLM Agent for Smartphone GUI Automation (Feb. 2024)

Improving Language Understanding from Screenshots (Feb. 2024)

AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent (Apr. 2024, KDD 2024)

SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models (May. 2023, NeurIPS 2023)

You Only Look at Screens: Multimodal Chain-of-Action Agents (Sep. 2023)

Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API (Oct. 2023)

OpenAgents: AN OPEN PLATFORM FOR LANGUAGE AGENTS IN THE WILD (Oct. 2023)

AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant (Oct. 2024)

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation (Nov. 2023)

AppAgent: Multimodal Agents as Smartphone Users (Dec. 2023)

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents (Jan. 2024, ACL 2024)

GPT-4V(ision) is a Generalist Web Agent, if Grounded (Jan. 2024, ICML 2024)

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (Jan. 2024)

Dual-View Visual Contextualization for Web Navigation (Feb. 2024, CVPR 2024)

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (Jun. 2024)

Visual Grounding for User Interfaces (NAACL 2024)

ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (Feb. 2024)

ScreenAI: A Vision-Language Model for UI and Infographics Understanding (Feb. 2024)

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Apr. 2024)

Octopus: On-device language model for function calling of software APIs (Apr., 2024)

Octopus v2: On-device language model for super agent (Apr., 2024)

Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent (Apr., 2024)

Octopus v4: Graph of language models (Apr., 2024)

AutoWebGLM: Bootstrap and Reinforce a Large Language Model-based Web Navigating Agent (Apr. 2024)

Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning (Apr. 2024)

Enhancing Mobile "How-to" Queries with Automated Search Results Verification and Reranking (Apr. 2024, SIGIR 2024)

AutoDroid: LLM-powered Task Automation in Android

Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation (Dec. 2023, MobiCom 2024)

Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study (Mar. 2024)

Android in the Zoo: Chain-of-Action-Thought for GUI Agents (Mar. 2024)

Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning (May 2024)

GUI Action Narrator: Where and When Did That Action Take Place? (Jun. 2024)

Identifying User Goals from UI Trajectories (Jun. 2024)

VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning (Jun. 2024)

Octo-planner: On-device Language Model for Planner-Action Agents (Jun. 2024)

E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion (Jun. 2024)

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration (Jun. 2024)

MobileFlow: A Multimodal LLM For Mobile GUI Agent (Jul. 2024)

Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model (Jul. 2024)

Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence (Jul. 2024)

MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices (Jul. 2024)

AUITestAgent: Automatic Requirements Oriented GUI Function Testing (Jul. 2024)

Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems (Jul. 2024)

OmniParser for Pure Vision Based GUI Agent (Aug. 2024)

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents (Aug. 2024)

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (Aug. 2024)

MindSearch: Mimicking Human Minds Elicits Deep AI Searcher (Jul. 2023)

AppAgent v2: Advanced Agent for Flexible Mobile Interactions (Aug. 2024)

Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions (Aug. 2024)

Agent Workflow Memory (Sep. 2024)

MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understandin (Sep. 2024)

Agent S: An Open Agentic Framework that Uses Computers Like a Human (Oct. 2024)

MobA: A Two-Level Agent System for Efficient Mobile Task Automation (Oct. 2024)

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents (Oct. 2024)

OS-ATLAS: A Foundation Action Model For Generalist GUI Agents (Oct. 2024)

Attacking Vision-Language Computer Agents via Pop-ups (Nov. 2024)

AutoGLM: Autonomous Foundation Agents for GUIs (Nov. 2024)

AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations (Nov. 2024)

ShowUI: One Vision-Language-Action Model for Generalist GUI Agent (Nov. 2024)

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction (Dec. 2024)

Falcon-UI: Understanding GUI Before Following User Instructions (Dec. 2024)

PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World (Dec. 2024)

Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining (Dec. 2024)

Aria-UI: Visual Grounding for GUI Instructions (Dec. 2024)

CogAgent v2 (Dec. 2024)

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (Dec. 2024)

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection (Jan. 2025)

GUI-Bee : Align GUI Action Grounding to Novel Environments via Autonomous Exploration (Jan. 2025)

Lightweight Neural App Control (ICLR 2025)

DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents (ICLR 2025)

AppVLM: A Lightweight Vision Language Model for Online App Control (Feb. 2025)

VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning (Feb. 2025)

GUI-Thinker: GUI-Thinker: A Basic yet Comprehensive GUI Agent Developed with Self-Reflection (Feb. 2025)

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials (Apr. 2025)

MobileA3gent: Training Mobile GUI Agents Using Decentralized Self-Sourced Data from Diverse Users (Feb., 2025)

FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data (Mar. 2025)

AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning (Jun., 2025)

EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection (May. 2025)

Test‑Time Reinforcement Learning for GUI Grounding via Region Consistency (Aug., 2025)