GitHub - kilospark/webact

webact - token-efficient browser control for AI agents

A highly token efficient browser control tool that lets you control any Chromium-based browser via the Chrome DevTools Protocol. Ships as a Rust binary with zero runtime dependencies. Works as an MCP server with Claude Code, Claude Desktop, Cursor, Codex, Windsurf, Cline, ChatGPT Desktop, and any MCP-compatible client. Also works as a CLI skill with Claude Code, Cursor, Codex, Windsurf, Cline, Copilot, OpenCode, Goose, and any tool supporting the Agent Skills spec.

No Playwright, no browser automation frameworks. Raw CDP over WebSocket.

Install

MCP Server (recommended)

curl -fsSL https://raw.githubusercontent.com/kilospark/webact/main/install.sh | sh

Downloads the webact-mcp binary and auto-configures any detected MCP clients (Claude Desktop, Claude Code, ChatGPT Desktop, Cursor, Windsurf, Cline, Codex).

Agent Skill

npx skills add kilospark/webact

Works with Claude Code, Cursor, Codex, Windsurf, Cline, Copilot, OpenCode, Goose, and 40+ agents. Powered by Vercel's skills CLI.

Manual MCP config

{
  "mcpServers": {
    "webact": {
      "command": "webact-mcp"
    }
  }
}

For Claude Code:

claude mcp add webact webact-mcp

Usage

Just tell your agent what you want:

check the top stories on Hacker News
navigate to github.com and show my notifications
search google for "best restaurants near me"

Or describe any goal - the agent will figure out the steps.

How it works

The agent follows a perceive-act loop:

Plan - break the goal into steps
Act - navigate, click, type via CDP commands
Perceive - read the page to see what happened
Decide - adapt, continue, or report results
Repeat - until the goal is done

Reading the page

webact provides multiple ways to read page content, each optimized for different needs:

Need	Tool	Output
Page content (articles, docs)	`read`	Clean text, no UI chrome
Full page + interaction targets	`text`	Text + numbered refs
Interactive elements only	`axtree -i`	Flat list of clickable/typeable elements
HTML structure/selectors	`dom`	Compact HTML
Visual layout	`screenshot`	PNG image

read strips navigation, sidebars, ads, and returns just the main content as clean text with headings, lists, and paragraphs. Best for articles, docs, search results, and information retrieval.

text shows the full page in reading order, interleaving static text with interactive elements (numbered refs). Like a screen reader view. Generates a ref map so you can immediately use click 12 or type 3 hello.

Sessions

Each agent invocation gets its own session with isolated tab tracking. On launch, a unique session ID is generated and a fresh Chrome tab is created for that session.

Multiple agents can work side by side in the same Chrome instance
Each session only sees and controls its own tabs

CLI

The webact CLI wraps CDP:

webact launch                  # Start browser, create session
webact navigate <url>          # Go to a URL (auto-dismisses cookie banners)
webact read [selector]         # Reader-mode text extraction (strips nav/sidebar/ads)
webact text [selector]         # Full page in reading order with interactive refs
webact dom [selector]          # Get compact DOM HTML
webact dom --tokens=N          # Truncate DOM to ~N tokens
webact axtree                  # Get accessibility tree (auto-capped at ~4k tokens)
webact axtree -i               # Interactive elements with ref numbers
webact axtree -i --diff        # Show only changes since last snapshot
webact observe                 # Interactive elements as ready-to-use commands
webact find <query>            # Find element by description
webact screenshot              # Capture screenshot
webact pdf [path]              # Save page as PDF
webact click <sel|x,y|--text>  # Click by selector, coordinates, or text match
webact doubleclick <sel>       # Double-click
webact rightclick <sel>        # Right-click (context menu)
webact hover <sel>             # Hover (tooltips/menus)
webact focus <selector>        # Focus an element without clicking
webact clear <selector>        # Clear an input field
webact type <selector> <text>  # Type into an input (focuses first)
webact keyboard <text>         # Type at current caret position (no selector)
webact paste <text>            # Paste via clipboard event (for rich editors)
webact select <sel> <value>    # Select option(s) from a dropdown
webact upload <sel> <file>     # Upload file(s) to a file input
webact humanclick <sel>        # Click with human-like mouse movement
webact humantype <sel> <text>  # Type with variable delays
webact drag <from> <to>        # Drag from one selector to another
webact dialog accept|dismiss   # Handle alert/confirm/prompt dialogs
webact waitfor <sel> [ms]      # Wait for element to appear (default 5s)
webact waitfornav [ms]         # Wait for navigation to complete (default 10s)
webact press <key>             # Press a key or combo (Enter, Ctrl+A, Meta+C)
webact scroll <target> [px]    # Scroll: up, down, top, bottom, or selector
webact eval <js>               # Run JavaScript in page context
webact cookies                 # List cookies for current page
webact cookies set <n> <v>     # Set a cookie
webact cookies delete <name>   # Delete a cookie
webact cookies clear           # Clear all cookies
webact console                 # Show recent console output
webact console errors          # Show only JS errors
webact block <pattern>         # Block requests: images, css, fonts, media, scripts, or URL
webact block --ads             # Block ads, analytics, and tracking (40+ patterns)
webact block off               # Disable request blocking
webact viewport <preset|w h>   # Set viewport (mobile, tablet, desktop, iphone, ipad)
webact frames                  # List all frames/iframes
webact frame <id|sel>          # Switch to a frame
webact frame main              # Return to main frame
webact tabs                    # List this session's tabs
webact tab <id>                # Switch to a session-owned tab
webact newtab [url]            # Open a new tab in this session
webact close                   # Close current tab
webact search <query>          # Search the web (Google, Bing, DuckDuckGo, or custom)
webact readurls <url1> <url2>  # Read multiple URLs in parallel
webact back / forward / reload # Navigation history
webact activate                # Bring browser window to front (macOS)
webact minimize                # Minimize browser window (macOS)

Ref-based targeting: After axtree -i, observe, or text, use the ref numbers directly as selectors - click 1, type 3 hello. Cached per URL.

Token Stats

Each command is designed to minimize token usage while giving the agent enough context to decide its next step.

Command	webact output	Playwright equivalent	Savings
brief (auto)	~200 chars	No equivalent - `page.content()` returns ~50k-500k chars	~99%
read	~1k-4k chars (clean text)	No equivalent - manual extraction needed	-
text	~1k-4k chars (text + refs)	`page.accessibility.snapshot()` ~10k-50k chars	~90%
dom	~1k-4k chars (compact HTML)	`page.content()` ~50k-500k chars (full raw HTML)	~95%
axtree -i	~500-1.5k chars (flat list)	`page.accessibility.snapshot()` ~10k-50k chars	~95%

Recommended flow for minimal token usage:

State-changing commands auto-print the brief (~200 chars) - often enough to decide next step
Need to read page content? Use read - strips UI chrome, returns clean text
Need to see everything + interact? Use text - full page with refs
Need just interactive elements? Use axtree -i (~500 tokens)
Need HTML structure? Use dom with a selector to scope
Reserve screenshot for visual-heavy pages where text extraction is insufficient

vs. Playwright-based tools

Several tools give AI agents browser control on top of Playwright: agent-browser (Vercel), Playwright MCP (Microsoft), Stagehand (Browserbase), and Browser Use.

	webact	Playwright-based tools
What it is	Rust binary - MCP server + CLI	CLI / MCP server / SDK wrapping Playwright
Architecture	Direct CDP WebSocket to your Chrome	CLI/SDK → IPC → Playwright → bundled Chromium
Install size	Single binary, zero deps	~200 MB+ (node_modules + Chromium download)
Uses your browser	Yes - your Chrome, your cookies, your logins	No - launches bundled Chromium with clean state
User agent	Your real Chrome user agent	Modified Playwright/Chromium UA - detectable
Headed mode	Always - you see what the agent sees	Headless by default

Token comparison (same pages, measured output)

Scenario	webact	Playwright-based*	Savings
Navigate + see page	`navigate` = 186 chars	`open` + `snapshot -i` = 7,974 chars	98%
Navigate + see page	`navigate` = 756 chars	`open` + `snapshot -i` = 8,486 chars	91%
Full page read	`read` = ~3,000 chars	No equivalent (manual extraction)	-
Full page + refs	`text` = ~4,000 chars	`snapshot` = 104,890 chars	96%
Interactive elements	`axtree -i` = 5,997 chars	`snapshot -i` = 7,901 chars	24%

Build from source

git clone https://github.com/kilospark/webact.git
cd webact
cargo build --release
# Binaries: target/release/webact (CLI), target/release/webact-mcp (MCP server)

Requirements

Any Chromium-based browser: Google Chrome, Microsoft Edge, Brave, Arc, Vivaldi, Opera, or Chromium
No runtime dependencies (single Rust binary)

Auto-detected on macOS, Linux, Windows, and WSL. Set CHROME_PATH to override.

License

MIT