Open Vision Agents by Stream
Build Real-Time Vision AI Agents
intro.mp4
Multi-modal AI agents that watch, listen, and understand video.
Vision Agents give you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.
Key Highlights
- Video AI: Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini/OpenAI in real-time.
- Low Latency: Join quickly (500ms) and maintain audio/video latency under 30ms using Stream's edge network.
- Open: Built by Stream, but works with any video edge network.
- Native APIs: Native SDK methods from OpenAI (
create response), Gemini (generate), and Claude (create message) — always access the latest LLM capabilities. - SDKs: SDKs for React, Android, iOS, Flutter, React Native, and Unity, powered by Stream's ultra-low-latency network.
mirage.mp4
See It In Action
Sports Coaching
golf_web_export.mp4
This example shows you how to build golf coaching AI with YOLO and Gemini Live. Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases. For example: Drone fire detection, sports/video game coaching, physical therapy, workout coaching, just dance style games etc.
# partial example, full example: examples/02_golf_coach_example/golf_coach_example.py agent = Agent( edge=getstream.Edge(), agent_user=agent_user, instructions="Read @golf_coach.md", llm=gemini.Realtime(fps=10), #llm=openai.Realtime(fps=1), # Careful with FPS can get expensive processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt", device="cuda")], )
Cluely style Invisible Assistant (coming soon)
Apps like Cluely offer realtime coaching via an invisible overlay. This example shows you how you can build your own invisible assistant. It combines Gemini realtime (to watch your screen and audio), and doesn't broadcast audio (only text). This approach is quite versatile and can be used for: Sales coaching, job interview cheating, physical world/ on the job coaching with glasses
Demo video
agent = Agent( edge=StreamEdge(), # low latency edge. clients for React, iOS, Android, RN, Flutter etc. agent_user=agent_user, # the user object for the agent (name, image etc) instructions="You are silently helping the user pass this interview. See @interview_coach.md", # gemini realtime, no need to set tts, or sst (though that's also supported) llm=gemini.Realtime() )
Quick Start
Step 1: Install via uv
uv add vision-agents
Step 2: (Optional) Install with extra integrations
uv add "vision-agents[getstream, openai, elevenlabs, deepgram]"
Step 3: Obtain your Stream API credentials
Get a free API key from Stream. Developers receive 333,000 participant minutes per month, plus extra credits via the Maker Program.
Features
| Feature | Description |
|---|---|
| True real-time via WebRTC | Stream directly to model providers that support it for instant visual understanding. |
| Interval/processor pipeline | For providers without WebRTC, process frames with pluggable video processors (e.g., YOLO, Roboflow, or custom PyTorch/ONNX) before/after model calls. |
| Turn detection & diarization | Keep conversations natural; know when the agent should speak or stay quiet and who's talking. |
| Voice activity detection (VAD) | Trigger actions intelligently and use resources efficiently. |
| Speech↔Text↔Speech | Enable low-latency loops for smooth, conversational voice UX. |
| Tool/function calling | Execute arbitrary code and APIs mid-conversation. Create Linear issues, query weather, trigger telephony, or hit internal services. |
| Built-in memory via Stream Chat | Agents recall context naturally across turns and sessions. |
| Text back-channel | Message the agent silently during a call. |
Out-of-the-Box Integrations
| Plugin Name | Description | Docs Link |
|---|---|---|
| AWS Bedrock | Realtime speech-to-speech plugin using Amazon Nova models with automatic reconnection | AWS |
| AWS Polly | TTS plugin using Amazon's cloud-based service with natural-sounding voices and neural engine support | AWS Polly |
| Cartesia | TTS plugin for realistic voice synthesis in real-time voice applications | Cartesia |
| Decart | Real-time AI video transformation service for applying artistic styles and effects to video streams | Decart |
| Deepgram | STT plugin for fast, accurate real-time transcription with speaker diarization | Deepgram |
| ElevenLabs | TTS plugin with highly realistic and expressive voices for conversational agents | ElevenLabs |
| Fast-Whisper | High-performance STT plugin using OpenAI's Whisper model with CTranslate2 for fast inference | Fast-Whisper |
| Fish Audio | STT and TTS plugin with automatic language detection and voice cloning capabilities | Fish Audio |
| Gemini | Realtime API for building conversational agents with support for both voice and video | Gemini |
| HeyGen | Realtime interactive avatars powered by HeyGen | HeyGen |
| Inworld | TTS plugin with high-quality streaming voices for real-time conversational AI agents | Inworld |
| Kokoro | Local TTS engine for offline voice synthesis with low latency | Kokoro |
| Moondream | Moondream provides realtime detection and VLM capabilities. Developers can choose from using the hosted API or running locally on their CUDA devices. Vision Agents supports Moondream's Detect, Caption and VQA skills out-of-the-box. | Moondream |
| OpenAI | Realtime API for building conversational agents with out of the box support for real-time video directly over WebRTC, LLMs and Open AI TTS | OpenAI |
| OpenRouter | LLM plugin providing access to multiple providers (Anthropic, Google, OpenAI) through a unified API | OpenRouter |
| Qwen | Realtime audio plugin using Alibaba's Qwen3 with native audio output and built-in speech recognition | Qwen |
| Roboflow | Object detection processor using Roboflow's hosted API or local RF-DETR models | Roboflow |
| Smart Turn | Advanced turn detection system combining Silero VAD, Whisper, and neural models for natural conversation flow | Smart Turn |
| Ultralytics | Real-time pose detection processor using YOLO models with skeleton overlays | Ultralytics |
| Vogent | Neural turn detection system for intelligent turn-taking in voice conversations | Vogent |
| Wizper | STT plugin with real-time translation capabilities powered by Whisper v3 | Wizper |
| xAI | LLM plugin using xAI's Grok models with advanced reasoning and real-time knowledge | xAI |
Processors
Processors let your agent manage state and handle audio/video in real-time.
They take care of the hard stuff, like:
- Running smaller models
- Making API calls
- Transforming media
… so you can focus on your agent logic.
Documentation
Check out our getting started guide at VisionAgents.ai.
Quickstart: Building a Voice AI app Quickstart: Building a Video AI app Tutorial: Building real-time sports coaching Tutorial: Building a real-time meeting assistant
Examples
Development
See DEVELOPMENT.md
Open Platform
Want to add your platform or provider? Reach out to nash@getstream.io.
Awesome Video AI
Our favorite people & projects to follow for vision AI
![]() |
![]() |
![]() |
|---|---|---|
| @demishassabis CEO @ Google DeepMind Won a Nobel prize |
@OfficialLoganK Product Lead @ Gemini Posts about robotics vision |
@ultralytics Various fast vision AI models Pose, detect, segment, classify |
![]() |
![]() |
![]() |
|---|---|---|
| @skalskip92 Open Source Lead @ Roboflow Building tools for vision AI |
@moondreamai The tiny vision model that could Lightweight, fast, efficient |
@kwindla Pipecat / Daily Sharing AI and vision insights |
![]() |
![]() |
![]() |
|---|---|---|
| @juberti Head of Realtime AI @ OpenAI Realtime AI systems |
@romainhuet Head of DX @ OpenAI Developer tooling & APIs |
@thorwebdev Eleven Labs Voice and AI experiments |
![]() |
![]() |
![]() |
|---|---|---|
| @mervenoyann Hugging Face Posts extensively about Video AI |
@stash_pomichter Spatial memory for robots Robotics & AI navigation |
@Mentraglass Open-source smart glasses Open-Source, hackable AR glasses with AI capabilities built in |
![]() |
|---|
| @vikhyatk AI Engineer Open-source AI projects, Creator of Moondream AI |
Inspiration
- Livekit Agents: Great syntax, Livekit only
- Pipecat: Flexible, but more verbose.
- OpenAI Agents: Focused on openAI only
Roadmap
0.1 – First Release - Oct
- Working TTS, Gemini & OpenAI
0.2 - Simplification - Nov
- Simplified the library & improved code quality
- Deepgram Nova 3, Elevenlabs Scribe 2, Fish, Moondream, QWen3, Smart turn, Vogent, Inworld, Heygen, AWS and more
- Improved openAI & Gemini realtime performance
- Audio & Video utilities
0.3 - Demos - Early Dec
- Mirage, Roboflow, Facial recognition. Nicer demos
- Deepgram Flux & Elevenlabs Scribe improvements
0.4 - Deploys - December
- Tips on deploying agents at scale, monitoring them etc.
- Guides on SIP & RAG
Vision AI limitations
Video AI is the frontier of AI. The state of the art is changing daily to help models understand live video. While building the integrations, here are the limitations we've noticed (Dec 2025)
- Video AI struggles with small text. If you want the AI to read the score in a game it will often get it wrong and hallucinate
- Longer videos can cause the AI to lose context. For instance if it's watching a soccer match it will get confused after 30 seconds
- Most applications require a combination of small specialized models like Yolo/Roboflow/Moondream, API calls to get more context and larger models like gemini/openAI
- Image size & FPS need to stay relatively low due to performance constraints
- Video doesn’t trigger responses in realtime models. You always need to send audio/text to trigger a response.
We are hiring
Join the team behind this project - we’re hiring a Staff Python Engineer to architect, build, and maintain a powerful toolkit for developers integrating voice and video AI into their products.

















