GitHub - GetStream/Vision-Agents: Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.

Build Real-Time Vision AI Agents

intro.mp4

Multi-modal AI agents that watch, listen, and understand video.

Vision Agents give you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.

Key Highlights

Video AI: Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini/OpenAI in real-time.
Low Latency: Join quickly (500ms) and maintain audio/video latency under 30ms using Stream's edge network.
Open: Built by Stream, but works with any video edge network.
Native APIs: Native SDK methods from OpenAI (create response), Gemini (generate), and Claude ( create message) — always access the latest LLM capabilities.
SDKs: SDKs for React, Android, iOS, Flutter, React Native, and Unity, powered by Stream's ultra-low-latency network.

mirage.mp4

See It In Action

Sports Coaching

golf_web_export.mp4

This example shows you how to build golf coaching AI with YOLO and Gemini Live. Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases. For example: Drone fire detection, sports/video game coaching, physical therapy, workout coaching, just dance style games etc.

# partial example, full example: examples/02_golf_coach_example/golf_coach_example.py
agent = Agent(
    edge=getstream.Edge(),
    agent_user=agent_user,
    instructions="Read @golf_coach.md",
    llm=gemini.Realtime(fps=10),
    # llm=openai.Realtime(fps=1), # Careful with FPS can get expensive
    processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt", device="cuda")],
)

Security Camera with Package Theft Detection

security_demo_quick_cut.mp4

This example shows a security camera system that detects faces, tracks packages and detects when a package is stolen. It automatically generates "WANTED" posters, posting them to X in real-time.

It combines face recognition, YOLOv11 object detection, Nano Banana and Gemini for a complete security workflow with voice interaction.

# partial example, full example: examples/04_security_camera_example/security_camera_example.py
security_processor = SecurityCameraProcessor(
    fps=5,
    model_path="weights_custom.pt",  # YOLOv11 for package detection
    package_conf_threshold=0.7,
)

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Security AI", id="agent"),
    instructions="Read @instructions.md",
    processors=[security_processor],
    llm=gemini.LLM("gemini-2.5-flash-lite"),
    tts=elevenlabs.TTS(),
    stt=deepgram.STT(),
)

Sales Assistant — Invisible Overlay Coach

Apps like Cluely offer realtime coaching via an invisible overlay. This example shows you how you can build your own invisible assistant. It combines Gemini realtime (to watch your screen and audio), and doesn't broadcast audio (only text). This approach is quite versatile and can be used for: Sales coaching, job interview cheating, physical world/ on the job coaching with glasses

Demo video

agent = Agent(
    edge=StreamEdge(),  # low latency edge. clients for React, iOS, Android, RN, Flutter etc.
    agent_user=agent_user,  # the user object for the agent (name, image etc)
    instructions="You are silently helping the user pass this interview. See @interview_coach.md",
    # gemini realtime, no need to set tts, or sst (though that's also supported)
    llm=gemini.Realtime()
)

Quick Start

Step 1: Install via uv

uv add vision-agents

Step 2: (Optional) Install with extra integrations

uv add "vision-agents[getstream, openai, elevenlabs, deepgram]"

Step 3: Obtain your Stream API credentials

Get a free API key from Stream. Developers receive 333,000 participant minutes per month, plus extra credits via the Maker Program.

Features

Feature	Description
True real-time via WebRTC	Stream directly to model providers that support it for instant visual understanding.
Interval/processor pipeline	For providers without WebRTC, process frames with pluggable video processors (e.g., YOLO, Roboflow, or custom PyTorch/ONNX) before/after model calls.
Turn detection & diarization	Keep conversations natural; know when the agent should speak or stay quiet and who's talking.
Voice activity detection (VAD)	Trigger actions intelligently and use resources efficiently.
Speech↔Text↔Speech	Enable low-latency loops for smooth, conversational voice UX.
Tool/function calling	Execute arbitrary code and APIs mid-conversation. Create Linear issues, query weather, trigger telephony, or hit internal services.
Built-in memory via Stream Chat	Agents recall context naturally across turns and sessions.
Text back-channel	Message the agent silently during a call.
Phone and RAG	Interact with the Agent via inbound or outbound phone calls using Twilio and Turbopuffer

Out-of-the-Box Integrations

Plugin Name	Description	Docs Link
AWS Bedrock	Realtime speech-to-speech plugin using Amazon Nova models with automatic reconnection	AWS
AWS Polly	TTS plugin using Amazon's cloud-based service with natural-sounding voices and neural engine support	AWS Polly
AssemblyAI	Streaming STT plugin using Universal-3 Pro with async WebSocket, built-in turn detection, and keyterms boosting	AssemblyAI
Cartesia	TTS plugin for realistic voice synthesis in real-time voice applications	Cartesia
Decart	Real-time AI video transformation service for applying artistic styles and effects to video streams	Decart
Deepgram	STT plugin for fast, accurate real-time transcription with speaker diarization	Deepgram
ElevenLabs	TTS plugin with highly realistic and expressive voices for conversational agents	ElevenLabs
Fast-Whisper	High-performance STT plugin using OpenAI's Whisper model with CTranslate2 for fast inference	Fast-Whisper
Fish Audio	STT and TTS plugin with automatic language detection and voice cloning capabilities	Fish Audio
Gemini	Realtime API for building conversational agents with support for both voice and video. Plugin supports LLMs, Gemini Live, and a VLM interface for Gemini 3 Flash.	Gemini
HeyGen	Realtime interactive avatars powered by HeyGen	HeyGen
Hugging Face	LLM plugin providing access to many open-source language models hosted on the Hugging Face Hub and powered by external providers (Cerebras, Together, Groq, etc.)	Hugging Face
Inworld	TTS plugin with high-quality streaming voices for real-time conversational AI agents	Inworld
Kokoro	Local TTS engine for offline voice synthesis with low latency	Kokoro
Mistral Voxtral	Mistral Voxtral is a real-time transcription tool with speaker diarization.	Mistral Voxtral
Moondream	Moondream provides realtime detection and VLM capabilities. Developers can choose from using the hosted API or running locally on their CUDA devices. Vision Agents supports Moondream's Detect, Caption and VQA skills out-of-the-box.	Moondream
NVIDIA Cosmos 2	VLM plugin using NVIDIA's Cosmos 2 models for video understanding with automatic frame buffering and streaming responses	NVIDIA
OpenAI	Realtime API for building conversational agents with out of the box support for real-time video directly over WebRTC, LLMs and Open AI TTS	OpenAI
OpenRouter	LLM plugin providing access to multiple providers (Anthropic, Google, OpenAI) through a unified API	OpenRouter
Qwen	Realtime audio plugin using Alibaba's Qwen3 with native audio output and built-in speech recognition	Qwen
Roboflow	Object detection processor using Roboflow's hosted API or local RF-DETR models	Roboflow
Smart Turn	Advanced turn detection system combining Silero VAD, Whisper, and neural models for natural conversation flow	Smart Turn
TurboPuffer	RAG plugin using TurboPuffer for hybrid search (vector + BM25) with Gemini embeddings for retrieval augmented generation	TurboPuffer
Twilio	Voice call integration plugin enabling bidirectional audio streaming via Twilio Media Streams with call registry and audio conversion	Twilio
Ultralytics	Real-time pose detection processor using YOLO models with skeleton overlays	Ultralytics
Vogent	Neural turn detection system for intelligent turn-taking in voice conversations	Vogent
Wizper	STT plugin with real-time translation capabilities powered by Whisper v3	Wizper
xAI	LLM plugin using xAI's Grok models with advanced reasoning and real-time knowledge	xAI

Processors

Processors let your agent manage state and handle audio/video in real-time.

They take care of the hard stuff, like:

Running smaller models
Making API calls
Transforming media

… so you can focus on your agent logic.

Documentation

Check out our getting started guide at VisionAgents.ai.

Quickstart: Building a Voice AI app
Quickstart: Building a Video AI app
Tutorial: Building a real-time meeting assistant
Tutorial: Building real-time sports coaching

Examples

🔮 Demo Applications
Cartesia Using Cartesia's Sonic 3 model alongside a vision model to tell a story with emotion based on what's in the frame. • Real-time visual understanding • Emotional storytelling • Frame-by-frame analysis >Source Code and tutorial
Realtime Stable Diffusion Realtime stable diffusion using Vision Agents and Decart's Mirage 2 model to create interactive scenes and stories. • Real-time video restyling • Interactive scene generation • Stable diffusion integration >Source Code and tutorial
Golf Coach Using Gemini Live together with Vision Agents and Ultralytics YOLO, we're able to track the user's pose and provide realtime actionable feedback on their golf game. • Real-time pose tracking • Actionable coaching feedback • YOLO pose detection • Gemini Live integration >Source Code and tutorial
GeoGuesser Together with OpenAI Realtime and Vision Agents, we can take GeoGuesser to the next level by asking it to identify places in our real world surroundings. • Real-world location identification • OpenAI Realtime integration • Visual scene understanding >Source Code and tutorial
Phone and RAG Interact with your Agent over the phone using Twilio. This example demonstrates how to use TurboPuffer for Retrieval Augmented Generation (RAG) to give your agent specialized knowledge. • Inbound/Outbound telephony • Twilio Media Streams integration • Vector search with TurboPuffer • Retrieval Augmented Generation >Source Code and tutorial
Security Camera A security camera with face recognition, package detection and automated theft response. Generates WANTED posters with Nano Banana and posts them to X when packages disappear. • Face detection & named recognition • YOLOv11 package detection • Automated WANTED poster generation • Real-time X posting >Source Code and tutorial

Development

See DEVELOPMENT.md

Open Platform

Want to add your platform or provider? Reach out to nash@getstream.io.

Awesome Video AI

Our favorite people & projects to follow for vision AI


@demishassabis CEO @ Google DeepMind _{Won a Nobel prize}	@OfficialLoganK Product Lead @ Gemini _{Posts about robotics vision}	@ultralytics Various fast vision AI models _{Pose, detect, segment, classify}


@skalskip92 Open Source Lead @ Roboflow _{Building tools for vision AI}	@moondreamai The tiny vision model that could _{Lightweight, fast, efficient}	@kwindla Pipecat / Daily _{Sharing AI and vision insights}


@juberti Head of Realtime AI @ OpenAI _{Realtime AI systems}	@romainhuet Head of DX @ OpenAI _{Developer tooling & APIs}	@thorwebdev Eleven Labs _{Voice and AI experiments}


@mervenoyann Hugging Face _{Posts extensively about Video AI}	@stash_pomichter Spatial memory for robots _{Robotics & AI navigation}	@Mentraglass Open-source smart glasses _{Open-Source, hackable AR glasses with AI capabilities built in}


@vikhyatk AI Engineer _{Open-source AI projects, Creator of Moondream AI}

Inspiration

Livekit Agents: Great syntax, Livekit only
Pipecat: Flexible, but more verbose.
OpenAI Agents: Focused on openAI only

Roadmap

0.5 Documentation/polish - Planned

Excellence on documentation
Segmentation examples with HuggingFace and Roboflow
Automated workflows for maintenance
Local camera/audio support AND/OR WebRTC connection
Embedded/robotics examples

0.4 - Production Polish & Scalability - Feb

Horizontal scaling support via Redis-based session store for multi-node deployments
New model/provider expansions: XAI realtime, Mistral/Voxtral, Gemini 3 Vision, Hugging Face Transformers plugin, Qwen, OpenRouter VLM + upgraded defaults (e.g., GPT-Realtime 1.5)
Enforced async-only APIs, agent testing framework, authentication flow updates, limits in AgentLauncher, and reduced GetStream dependency coupling for standalone use
Multi-speaker call support, LemonSlice Avatar plugin, TTS/audio fixes, VLM message deduplication (Gemini/NVIDIA/Anthropic), and improved video/screen-sharing handling
Lot of polish: full CHANGELOG.md, Grafana/Prometheus examples, and numerous stability/bug fixes

0.3 - Examples and Deploys - Jan

Production-grade HTTP API for agent deployment (uv run <agent.py> serve)
Metrics & Observability stack
Phone/voice integration with RAG capabilities
10 new LLM plugins (AWS Nova 2, Qwen 3 Realtime, NVIDIA Cosmos 2, Pocket TTS, Deepgram TTS, OpenRouter, HuggingFace Inference, Roboflow, Twilio, Turbopuffer)
Real-world examples (security camera, phone integration, football commentator, Docker deployment with GPU support, agent server)
Stability: Fixes for participant sync, video frame handling, agent lifecycle, and screen sharing

0.2 - Simplification - Nov

Simplified the library & improved code quality
Deepgram Nova 3, Elevenlabs Scribe 2, Fish, Moondream, QWen3, Smart turn, Vogent, Inworld, Heygen, AWS and more
Improved openAI & Gemini realtime performance
Audio & Video utilities

0.1 – First Release - Oct

Working TTS, Gemini & OpenAI

Vision AI limitations

Video AI is the frontier of AI. The state of the art is changing daily to help models understand live video. While building the integrations, here are the limitations we've noticed (Dec 2025)

Video AI struggles with small text. If you want the AI to read the score in a game it will often get it wrong and hallucinate
Longer videos can cause the AI to lose context. For instance if it's watching a soccer match it will get confused after 30 seconds
Most applications require a combination of small specialized models like Yolo/Roboflow/Moondream, API calls to get more context and larger models like gemini/openAI
Image size & FPS need to stay relatively low due to performance constraints
Video doesn’t trigger responses in realtime models. You always need to send audio/text to trigger a response.