Paper | Blog Post | Demo | Models | Data
MolmoWeb is an open multimodal web agent built by Ai2. Given a natural-language task, MolmoWeb autonomously controls a web browser -- clicking, typing, scrolling, and navigating -- to complete the task. This repository contains the agent code, inference client, evaluation benchmarks, and everything needed to reproduce the results from the paper.
Table of Contents
Models
| Model | Parameters | HuggingFace |
|---|---|---|
| MolmoWeb-8B | 8B | allenai/MolmoWeb-8B |
| MolmoWeb-4B | 4B | allenai/MolmoWeb-4B |
| MolmoWeb-8B-Native | 8B | allenai/MolmoWeb-8B-Native |
| MolmoWeb-4B-Native | 4B | allenai/MolmoWeb-4B-Native |
The first two models (MolmoWeb-8B and MolmoWeb-4B) are Huggingface/transformers-compatible (see example usage on Huggingface); and the last two (MolmoWeb-8B-Native and MolmoWeb-4B-Native) are molmo-native checkpoints.
Collections:
Installation
Requires Python 3.10+. We use uv for dependency management.
# Install uv if you don't have it curl -LsSf https://astral.sh/uv/install.sh | sh # Clone and install git clone git@github.com:allenai/molmoweb.git cd molmoweb uv venv uv sync # Install Playwright browsers (needed for local browser control) uv run playwright install uv run playwright install --with-deps chromium
Environment Variables
# Browserbase (required when --env_type browserbase) export BROWSERBASE_API_KEY="your-browserbase-api-key" export BROWSERBASE_PROJECT_ID="your-browserbase-project-id" # Google Gemini (required for gemini_cua, gemini_axtree, and Gemini-based judges) export GOOGLE_API_KEY="your-google-api-key" # OpenAI (required for gpt_axtree and GPT-based judges like webvoyager) export OPENAI_API_KEY="your-openai-api-key"
Quick Start
Three helper scripts in scripts/ let you download weights, start the server, and test it end-to-end.
1. Download the Model
bash scripts/download_weights.sh # MolmoWeb-8B (default) bash scripts/download_weights.sh allenai/MolmoWeb-4B-Native # MolmoWeb-4B Native
This downloads the weights to ./checkpoints/<model-name>.
2. Start the Model Server
bash scripts/start_server.sh ./checkpoints/MolmoWeb-8B # MolmoWeb-8B, port 8001 bash scripts/start_server.sh ./checkpoints/MolmoWeb-4B-Native # MolmoWeb-4B bash scripts/start_server.sh ./checkpoints/MolmoWeb-8B 8002 # custom port
Or configure via environment variables:
export CKPT="./checkpoints/MolmoWeb-4B-Native" # local path to downloaded weights export PREDICTOR_TYPE="native" # "native" or "hf" export NUM_PREDICTORS=1 # number of GPU workers bash scripts/start_server.sh
The server exposes a single endpoint:
POST http://127.0.0.1:8001/predict
{
"prompt": "...",
"image_base64": "..."
}
Wait for the server to print that the model is loaded, then test it.
3. Test the Model
Once the server is running, send it a screenshot of the Ai2 careers page (included in assets/test_screenshot.png) and ask it to read the job titles:
uv run python scripts/test_server.py # default: localhost:8001 uv run python scripts/test_server.py http://myhost:8002 # custom endpoint
The test script sends this prompt to the model:
Read the text on this page. What are the first four job titles listed under 'Open roles'?
You can also do it manually in a few lines of Python:
import base64, requests with open("assets/test_screenshot.png", "rb") as f: image_b64 = base64.b64encode(f.read()).decode() resp = requests.post("http://127.0.0.1:8001/predict", json={ "prompt": "What are the first four job titles listed under 'Open roles'?", "image_base64": image_b64, }) print(resp.json())
Inference Client
The inference package provides a high-level Python client that manages a browser session and runs the agent end-to-end. The client communicates with a running model server endpoint.
Single Query
from inference import MolmoWeb client = MolmoWeb( endpoint="SET_UP_YOUR_ENDPOINT", local=True, # True = local Chromium, False = Browserbase cloud browser headless=True, ) query = "Go to arxiv.org and find out the paper about Molmo and Pixmo." traj = client.run(query=query, max_steps=10) output_path = traj.save_html(query=query) print(f"Saved to {output_path}")
Follow-up Query
followup_query = "Find the full author list of the paper." traj2 = client.continue_run(query=followup_query, max_steps=10)
Batch Queries
queries = [ "Go to allenai.org and find the latest research papers on top of the homepage", "Search for 'OLMo' on Wikipedia", "What is the weather in Seattle today?", ] trajectories = client.run_batch( queries=queries, max_steps=10, max_workers=3, ) # Inspect the trajectory .html files default saved under inference/htmls
Inference Backends
Supported backends: fastapi (remote HTTP endpoint), modal (serverless), native (native molmo/olmo-compatible checkpoint), hf (HuggingFace Transformers-compatible checkpoint).
vLLM support coming soon.
Extract Accessibility Tree
from inference.client import MolmoWeb
client = MolmoWeb()
axtree_str = client.get_axtree("https://allenai.org/")
print(axtree_str)
client.close()
License
Apache 2.0. See LICENSE for details.
TODO
- Inference
- Eval
- Training
