GitHub - allenai/molmoweb

Paper | Blog Post | Demo | Models | Data

MolmoWeb is an open multimodal web agent built by Ai2. Given a natural-language task, MolmoWeb autonomously controls a web browser -- clicking, typing, scrolling, and navigating -- to complete the task. This repository contains the agent code, inference client, evaluation benchmarks, and everything needed to reproduce the results from the paper.

Models

Model	Parameters	HuggingFace
MolmoWeb-8B	8B	allenai/MolmoWeb-8B
MolmoWeb-4B	4B	allenai/MolmoWeb-4B
MolmoWeb-8B-Native	8B	allenai/MolmoWeb-8B-Native
MolmoWeb-4B-Native	4B	allenai/MolmoWeb-4B-Native

The first two models (MolmoWeb-8B and MolmoWeb-4B) are Huggingface/transformers-compatible (see example usage on Huggingface); and the last two (MolmoWeb-8B-Native and MolmoWeb-4B-Native) are molmo-native checkpoints.

Collections:

Installation

Requires Python 3.10+. We use uv for dependency management.

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone git@github.com:allenai/molmoweb.git
cd molmoweb
uv venv
uv sync

# Install Playwright browsers (needed for local browser control)
uv run playwright install
uv run playwright install --with-deps chromium

Environment Variables

# Browserbase (required when --env_type browserbase)
export BROWSERBASE_API_KEY="your-browserbase-api-key"
export BROWSERBASE_PROJECT_ID="your-browserbase-project-id"

# Google Gemini (required for gemini_cua, gemini_axtree, and Gemini-based judges)
export GOOGLE_API_KEY="your-google-api-key"

# OpenAI (required for gpt_axtree and GPT-based judges like webvoyager)
export OPENAI_API_KEY="your-openai-api-key"

Quick Start

Three helper scripts in scripts/ let you download weights, start the server, and test it end-to-end.

1. Download the Model

bash scripts/download_weights.sh                                  # MolmoWeb-8B (default)
bash scripts/download_weights.sh allenai/MolmoWeb-4B-Native       # MolmoWeb-4B Native

This downloads the weights to ./checkpoints/<model-name>.

2. Start the Model Server

bash scripts/start_server.sh ./checkpoints/MolmoWeb-8B              # MolmoWeb-8B, port 8001
bash scripts/start_server.sh ./checkpoints/MolmoWeb-4B-Native       # MolmoWeb-4B
bash scripts/start_server.sh ./checkpoints/MolmoWeb-8B 8002         # custom port

Or configure via environment variables:

export CKPT="./checkpoints/MolmoWeb-4B-Native"   # local path to downloaded weights
export PREDICTOR_TYPE="native"             # "native" or "hf"
export NUM_PREDICTORS=1                    # number of GPU workers

bash scripts/start_server.sh

The server exposes a single endpoint:

POST http://127.0.0.1:8001/predict
{
  "prompt": "...",
  "image_base64": "..."
}

Wait for the server to print that the model is loaded, then test it.

3. Test the Model

Once the server is running, send it a screenshot of the Ai2 careers page (included in assets/test_screenshot.png) and ask it to read the job titles:

uv run python scripts/test_server.py                        # default: localhost:8001
uv run python scripts/test_server.py http://myhost:8002     # custom endpoint

The test script sends this prompt to the model:

Read the text on this page. What are the first four job titles listed under 'Open roles'?

You can also do it manually in a few lines of Python:

import base64, requests

with open("assets/test_screenshot.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

resp = requests.post("http://127.0.0.1:8001/predict", json={
    "prompt": "What are the first four job titles listed under 'Open roles'?",
    "image_base64": image_b64,
})
print(resp.json())

Inference Client

The inference package provides a high-level Python client that manages a browser session and runs the agent end-to-end. The client communicates with a running model server endpoint.

Single Query

from inference import MolmoWeb

client = MolmoWeb(
    endpoint="SET_UP_YOUR_ENDPOINT",
    local=True,         # True = local Chromium, False = Browserbase cloud browser
    headless=True,
) 

query = "Go to arxiv.org and find out the paper about Molmo and Pixmo."
traj = client.run(query=query, max_steps=10)

output_path = traj.save_html(query=query)
print(f"Saved to {output_path}")

Follow-up Query

followup_query = "Find the full author list of the paper."
traj2 = client.continue_run(query=followup_query, max_steps=10)

Batch Queries

queries = [
    "Go to allenai.org and find the latest research papers on top of the homepage",
    "Search for 'OLMo' on Wikipedia",
    "What is the weather in Seattle today?",
]

trajectories = client.run_batch(
    queries=queries,
    max_steps=10,
    max_workers=3,
) # Inspect the trajectory .html files default saved under inference/htmls

Inference Backends

Supported backends: fastapi (remote HTTP endpoint), modal (serverless), native (native molmo/olmo-compatible checkpoint), hf (HuggingFace Transformers-compatible checkpoint).

vLLM support coming soon.

Extract Accessibility Tree

from inference.client import MolmoWeb

client = MolmoWeb()
axtree_str = client.get_axtree("https://allenai.org/")
print(axtree_str)
client.close()

License

Apache 2.0. See LICENSE for details.

TODO

Inference
Eval
Training

Table of Contents