How we solved multi-modal tool-calling in MCP agents – VLM Run MCP

docs.vlm.run

14 points by fzysingularity 5 months ago · 6 comments

Reader

It's impressive how the MCP example in https://docs.vlm.run/mcp/examples/template-search search retains visual context across multiple images and tool calls. Unlike most chat interfaces, it enables seamless multi-step reasoning—like finding a logo in one image and tracking it in another—without losing state. This makes it ideal for building stateful, iterative visual workflows.

fzysingularityOP 5 months ago

Hi HN,

We’ve been building agentic VLMs that operate over visual data (i.e. images, PDFs, videos), and were surprised at how underdeveloped the current infrastructure is for multi-modal tool-calling. MCP is all the rage these days, but it sidesteps a fundamental issue that no one seems to talk about - especially in multimodal contexts.

Some of the pain points we ran into when building our MCP server: - LLMs call tools by-value. That’s fine for text and JSON arguments, but completely breaks down for visual inputs. - You can’t pass images or videos as base64 - it kills context limits and latency, and leads to poor-dev experience. - Most “multimodal” MCP servers out there are single-turn demos. They assume local files and don’t support remote or persistent objects, making it impossible to build real workflows that operate on intermediate visual state - which is the core of most computer vision tasks.

So we built our remotely-hosted MCP server (https://docs.vlm.run/mcp/) that makes it trivial for agents to see, understand, and act on visual content using a suite of computer vision tools. We expose these tools (face detection, redaction, captioning, tracking, etc.) through a clean MCP-compatible API. Any agent that can hook into remote MCP servers - Claude, OpenAI, Cursor - can use it out of the box.

Here are a few end-to-end examples (orchestrated by Claude, using our tools):

[1] Document Redaction: https://docs.vlm.run/mcp/examples/document-redaction [2] Face Detection + Blurring: https://docs.vlm.run/mcp/examples/face-redaction [3] Template Matching + Visual Search: https://docs.vlm.run/mcp/examples/template-search [4] Video editing: https://docs.vlm.run/mcp/examples/video-captioning

We’d love to hear what workflows you’re building - and what visual tools you'd want your agents to build on.

EarlyOom 5 months ago

Shocking how poor frontier models perform on simple visual tasks. Best-in-domain tool calling will Become the norm

coolsank 5 months ago

Very interesting. Document redaction is definitely a great use case. Gotta check this out

mafangchang 5 months ago

Impressive!

slake 5 months ago

Noiceee!

Settings

How we solved multi-modal tool-calling in MCP agents – VLM Run MCP

Keyboard Shortcuts