GitHub - showlab/Awesome-GUI-Agent: 💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.

9 min read Original article ↗

A curated list of papers, projects, and resources for multi-modal Graphical User Interface (GUI) agents.

🔥 This project is actively maintained, and we welcome your contributions. If you have any suggestions, such as missing papers or information, please feel free to open an issue or submit a pull request.

🤖 Try our Awesome-Paper-Agent. Just provide an arXiv URL link, and it will automatically return formatted information, like this:

User:
https://arxiv.org/abs/2312.13108

GPT:
+ [AssistGUI: Task-Oriented Desktop Graphical User Interface Automation](https://arxiv.org/abs/2312.13108) (Dec. 2023)

  [![Star](https://img.shields.io/github/stars/showlab/assistgui.svg?style=social&label=Star)](https://github.com/showlab/assistgui)
  [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2312.13108)
  [![Website](https://img.shields.io/badge/Website-9cf)](https://showlab.github.io/assistgui/)

So then you can easily copy and use this information in your pull requests.

⭐ If you find this repository useful, please give it a star.

  • World of Bits: An Open-Domain Platform for Web-Based Agents (Aug. 2017, ICML 2017)

    arXiv

  • A Unified Solution for Structured Web Data Extraction (Jul. 2011, SIGIR 2011)

    arXiv

  • Rico: A Mobile App Dataset for Building Data-Driven Design Applications (Oct. 2017)

    arXiv

  • Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration (Feb. 2018, ICLR 2018)

    Star arXiv

  • Mapping Natural Language Instructions to Mobile UI Action Sequences (May. 2020, ACL 2020)

    Star arXiv

  • WebSRC: A Dataset for Web-Based Structural Reading Comprehension (Jan. 2021, EMNLP 2021)

    arXiv Website

  • AndroidEnv: A Reinforcement Learning Platform for Android (May. 2021)

    Star arXiv Website

  • A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility (Feb. 2022)

    arXiv

  • META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI (May. 2022)

    arXiv Website

  • WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents (Jul. 2022)

    Star arXiv Website

  • Language Models can Solve Computer Tasks (Mar. 2023)

    Star arXiv Website

  • Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction (May. 2023)

    arXiv GitHub

  • Mind2Web: Towards a Generalist Agent for the Web (Jun. 2023)

    Star arXiv Website

  • Android in the Wild: A Large-Scale Dataset for Android Device Control (Jul. 2023)

    Star arXiv

  • WebArena: A Realistic Web Environment for Building Autonomous Agents (Jul. 2023)

    Star arXiv Website

  • Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models (Nov. 2023)

    Star arXiv Website

  • AssistGUI: Task-Oriented Desktop Graphical User Interface Automation (Dec. 2023, CVPR 2024)

    Star arXiv Website

  • VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (Jan. 2024, ACL 2024)

    Star arXiv Website

  • OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web (Feb. 2024)

    arXiv

  • WebLINX: Real-World Website Navigation with Multi-Turn Dialogue (Feb. 2024)

    Star arXiv Website

  • On the Multi-turn Instruction Following for Conversational Web Agents (Feb. 2024)

    Star arXiv

  • AgentStudio: A Toolkit for Building General Virtual Agents (Mar. 2024)

    Star arXiv Website

  • OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Apr. 2024)

    Star arXiv Website

  • Benchmarking Mobile Device Control Agents across Diverse Configurations (Apr. 2024, ICLR 2024)

    Star arXiv

  • MMInA: Benchmarking Multihop Multimodal Internet Agents (Apr. 2024)

    Star arXiv Website

  • Autonomous Evaluation and Refinement of Digital Agents (Apr. 2024)

    Star arXiv

  • LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation (Apr. 2024)

    Star arXiv

  • VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? (Apr. 2024)

    arXiv

  • GUICourse: From General Vision Language Models to Versatile GUI Agents (Jun. 2024)

    Star arXiv

  • GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents (Jun. 2024)

    Star arXiv Website

  • GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices (Jun. 2024)

    Star arXiv

  • VideoGUI: A Benchmark for GUI Automation from Instructional Videos (Jun. 2024)

    Star arXiv Website

  • Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding (Jun. 2024)

    Star arXiv Website

  • MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents (Jun. 2024)

    Star arXiv Website

  • AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents (Jun. 2024)

    Star arXiv

  • Practical, Automated Scenario-based Mobile App Testing (Jun. 2024)

    arXiv

  • WebCanvas: Benchmarking Web Agents in Online Environments (Jun. 2024)

    arXiv Website

  • On the Effects of Data Scale on Computer Control Agents (Jun. 2024)

    Star arXiv

  • CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents (Jul. 2024)

    Star arXiv

  • WebVLN: Vision-and-Language Navigation on Websites (AAAI 2024)

    Star arXiv

  • Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? (Jul. 2024)

    Star arXiv Website

  • AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

    Star arXiv Website

  • Windows Agent Arena

    Star Website PDF

  • Harnessing Webpage UIs for Text-Rich Visual Understanding (Oct, 2024)

    Star arXiv Website

  • GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent (Dec, 2024)

    Star arXiv

  • A3: Android Agent Arena for Mobile GUI Agents (Jan. 2025)

    arXiv Website

  • ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

    Star Website PDF

  • WebWalker: Benchmarking LLMs in Web Traversal

    Star Website PDF

  • SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation (ICLR 2025)

    arXiv Website

  • WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation (Feb. 2025)

    Star arXiv Website

  • LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark (Apr. 2025)

    Star arXiv Website

  • ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows (May. 2025)

    Star arXiv Website

  • MONDAY: Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents (May., 2025, CVPR 2025)

    Star arXiv Website

  • Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System (June. 2025)

    arXiv Website

  • Grounding Open-Domain Instructions to Automate Web Support Tasks (Mar. 2021)

    arXiv

  • Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning (Aug. 2021)

    arXiv

  • A Data-Driven Approach for Learning to Control Computers (Feb. 2022)

    arXiv

  • Augmenting Autotelic Agents with Large Language Models (May. 2023)

    arXiv

  • Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control (Jun. 2023, ICLR 2024)

    Star arXiv

  • A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis (Jul. 2023, ICLR 2024)

    arXiv

  • LASER: LLM Agent with State-Space Exploration for Web Navigation (Sep. 2023)

    arXiv

  • CogAgent: A Visual Language Model for GUI Agents (Dec. 2023, CVPR 2024)

    Star arXiv

  • WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

    Star arXiv

  • OS-Copilot: Towards Generalist Computer Agents with Self-Improvement (Feb. 2024)

    Star arXiv Website

  • UFO: A UI-Focused Agent for Windows OS Interaction (Feb. 2024)

    Star arXiv Website

  • Comprehensive Cognitive LLM Agent for Smartphone GUI Automation (Feb. 2024)

    arXiv

  • Improving Language Understanding from Screenshots (Feb. 2024)

    arXiv

  • AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent (Apr. 2024, KDD 2024)

    Star arXiv

  • SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models (May. 2023, NeurIPS 2023)

    Star arXiv Website

  • You Only Look at Screens: Multimodal Chain-of-Action Agents (Sep. 2023)

    Star arXiv

  • Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API (Oct. 2023)

    arXiv

  • OpenAgents: AN OPEN PLATFORM FOR LANGUAGE AGENTS IN THE WILD (Oct. 2023)

    Star arXiv

  • AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant (Oct. 2024)

    Star arXiv Website

  • GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation (Nov. 2023)

    Star arXiv

  • AppAgent: Multimodal Agents as Smartphone Users (Dec. 2023)

    Star arXiv Website

  • SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents (Jan. 2024, ACL 2024)

    Star arXiv

  • GPT-4V(ision) is a Generalist Web Agent, if Grounded (Jan. 2024, ICML 2024)

    Star arXiv Website

  • Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (Jan. 2024)

    arXiv

  • Dual-View Visual Contextualization for Web Navigation (Feb. 2024, CVPR 2024)

    arXiv

  • DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (Jun. 2024)

    Star arXiv Website

  • Visual Grounding for User Interfaces (NAACL 2024)

    arXiv

  • ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (Feb. 2024)

    Star arXiv Website

  • ScreenAI: A Vision-Language Model for UI and Infographics Understanding (Feb. 2024)

    arXiv

  • Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Apr. 2024)

    Star arXiv

  • Octopus: On-device language model for function calling of software APIs (Apr., 2024)

    arXiv

  • Octopus v2: On-device language model for super agent (Apr., 2024)

    arXiv

  • Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent (Apr., 2024)

    arXiv Website

  • Octopus v4: Graph of language models (Apr., 2024)

    arXiv

  • AutoWebGLM: Bootstrap and Reinforce a Large Language Model-based Web Navigating Agent (Apr. 2024)

    Star arXiv

  • Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning (Apr. 2024)

    arXiv

  • Enhancing Mobile "How-to" Queries with Automated Search Results Verification and Reranking (Apr. 2024, SIGIR 2024)

    arXiv

  • AutoDroid: LLM-powered Task Automation in Android

    arXiv

  • Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation (Dec. 2023, MobiCom 2024)

    arXiv Website

  • Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study (Mar. 2024)

    Star arXiv Website

  • Android in the Zoo: Chain-of-Action-Thought for GUI Agents (Mar. 2024)

    Star arXiv

  • Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning (May 2024)

    arXiv

  • GUI Action Narrator: Where and When Did That Action Take Place? (Jun. 2024)

    Star arXiv Website

  • Identifying User Goals from UI Trajectories (Jun. 2024)

    arXiv

  • VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning (Jun. 2024)

    arXiv

  • Octo-planner: On-device Language Model for Planner-Action Agents (Jun. 2024)

    arXiv Website

  • E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion (Jun. 2024)

    arXiv

  • Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration (Jun. 2024)

    Star arXiv

  • MobileFlow: A Multimodal LLM For Mobile GUI Agent (Jul. 2024)

    arXiv

  • Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model (Jul. 2024)

    arXiv

  • Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence (Jul. 2024)

    Star arXiv

  • MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices (Jul. 2024)

    arXiv

  • AUITestAgent: Automatic Requirements Oriented GUI Function Testing (Jul. 2024)

    Star arXiv

  • Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems (Jul. 2024)

    Star arXiv

  • OmniParser for Pure Vision Based GUI Agent (Aug. 2024)

    arXiv

  • VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents (Aug. 2024)

    Star arXiv Website

  • Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (Aug. 2024)

    arXiv Website

  • MindSearch: Mimicking Human Minds Elicits Deep AI Searcher (Jul. 2023)

    Star arXiv Website

  • AppAgent v2: Advanced Agent for Flexible Mobile Interactions (Aug. 2024)

    arXiv

  • Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions (Aug. 2024)

    arXiv

  • Agent Workflow Memory (Sep. 2024)

    Star arXiv

  • MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understandin (Sep. 2024)

    arXiv

  • Agent S: An Open Agentic Framework that Uses Computers Like a Human (Oct. 2024)

    Star arXiv

  • MobA: A Two-Level Agent System for Efficient Mobile Task Automation (Oct. 2024)

    Star arXiv Dataset

  • Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents (Oct. 2024)

    Star Website arXiv

  • OS-ATLAS: A Foundation Action Model For Generalist GUI Agents (Oct. 2024)

    Star arXiv Website Dataset

  • Attacking Vision-Language Computer Agents via Pop-ups (Nov. 2024)

    Star arXiv

  • AutoGLM: Autonomous Foundation Agents for GUIs (Nov. 2024)

    Star arXiv

  • AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations (Nov. 2024)

    arXiv

  • ShowUI: One Vision-Language-Action Model for Generalist GUI Agent (Nov. 2024)

    Star arXiv

  • Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction (Dec. 2024)

    Website Star arXiv

  • Falcon-UI: Understanding GUI Before Following User Instructions (Dec. 2024)

    arXiv

  • PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World (Dec. 2024)

    Star arXiv Website

  • Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining (Dec. 2024)

    arXiv

  • Aria-UI: Visual Grounding for GUI Instructions (Dec. 2024)

    Star arXiv Website Dataset

  • CogAgent v2 (Dec. 2024)

    Star

  • OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (Dec. 2024)

    Star arXiv Website

  • InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection (Jan. 2025)

    Star arXiv

  • GUI-Bee : Align GUI Action Grounding to Novel Environments via Autonomous Exploration (Jan. 2025)

    arXiv Website

  • Lightweight Neural App Control (ICLR 2025)

    arXiv

  • DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents (ICLR 2025)

    arXiv Website

  • AppVLM: A Lightweight Vision Language Model for Online App Control (Feb. 2025)

    arXiv

  • VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning (Feb. 2025)

    arXiv

  • GUI-Thinker: GUI-Thinker: A Basic yet Comprehensive GUI Agent Developed with Self-Reflection (Feb. 2025)

    Star arXiv Website

  • TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials (Apr. 2025)

    Star arXiv Website

  • MobileA3gent: Training Mobile GUI Agents Using Decentralized Self-Sourced Data from Diverse Users (Feb., 2025) arXiv

  • FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data (Mar. 2025)

    Star arXiv Website

  • AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning (Jun., 2025) Star arXiv Website

  • EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection (May. 2025) arXiv

  • Test‑Time Reinforcement Learning for GUI Grounding via Region Consistency (Aug., 2025)

    Star arXiv Website