Android-MCP: Bridging AI Agents and Android Devices
Android-MCP: Bridging AI Agents and Android Devices
We've been working on Android-MCP, a lightweight, open-source bridge designed to enable AI agents (specifically large language models) to interact with Android devices. The goal is to allow LLMs to perform real-world tasks like app navigation, UI interaction, and automated QA testing without relying on traditional computer vision pipelines or pre-programmed scripts.
The core idea is to leverage ADB and the Android Accessibility API for native interaction with UI elements. This means an LLM can launch apps, tap, swipe, input text, and read view hierarchies directly. A key feature is that it works with any language model, with vision being optional – there's no need for fine-tuned computer vision models or OCR.
Android-MCP operates as an MCP server and offers a rich toolset for mobile automation, including pre-built tools for gestures, keystrokes, capturing device state, and accessing notifications. We've observed typical latency between actions (e.g., two taps) ranging from 2-5 seconds, depending on device specifications and load.
It supports Android 10+ and is built with Python 3.10+. The project is licensed under the MIT License, and contributions are welcome.
You can find more details, installation instructions, and the source code here: https://github.com/CursorTouch/Android-MCP
We're interested to hear thoughts on how this kind of direct interaction could be applied in various scenarios, particularly in areas like automated testing or accessibility enhancements for LLM-driven applications. This is actually a really cool direction, using LLMs to interact directly with Android UIs could solve the brittleness problem that's been killing traditional automation. Like just telling it "navigate to settings and enable dark mode" instead of writing fragile selectors… that's the dream :D But the current implementation has some issues that make it tough for real use ~ 2-5 second latency per action is brutal. A simple login flow would take forever vs traditional automation. The bigger thing is reliability… how do you actually verify the LLM did what you asked vs what it thinks it did? With normal automation you get assertions and can inspect elements. Here you're kinda flying blind. Also "vision optional" makes me think it's not great at understanding complex UIs yet… which defeats the main selling point. That said this feels like where things are headed long term. As LLMs get faster and better at visual stuff, this approach could eventually beat traditional automation for maintainability. Just not quite ready for production yet. Yes, you are correct; at the current level, this is an MCP. Next, we are going to build an agent on top of it, there we will include vision like a must-have capability, as u mention, to understand complex UI, and we will implement a validator as well after each step.