Multi-modal AI agents that watch, listen, and understand video.
Vision Agents give you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.
Key Highlights
- Video AI: Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini/OpenAI in real-time.
- Low Latency: Join quickly (500ms) and maintain audio/video latency under 30ms using Stream's edge network.
- Open: Built by Stream, but works with any video edge network.
- Native APIs: Native SDK methods from OpenAI (
create response), Gemini (generate), and Claude (create message) — always access the latest LLM capabilities. - SDKs: SDKs for React, Android, iOS, Flutter, React Native, and Unity, powered by Stream's ultra-low-latency network.
Getting Started
Step 1: Install via uv
uv add vision-agents
Step 2: (Optional) Install with extra integrations
uv add "vision-agents[getstream, openai, elevenlabs, deepgram]"
Step 3: Obtain your Stream API credentials
Get a free API key from Stream. Developers receive 333,000 participant minutes per month, plus extra credits via the Maker Program.
Follow the quickstart guide to build your first agent.
See It In Action
golf_web_export.mp4
This example shows you how to build golf coaching AI with YOLO and Gemini Live. Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases. For example: Drone fire detection, sports/video game coaching, physical therapy, workout coaching, just dance style games etc.
# partial example, full example: examples/02_golf_coach_example/golf_coach_example.py agent = Agent( edge=getstream.Edge(), agent_user=agent_user, instructions="Read @golf_coach.md", llm=gemini.Realtime(fps=10), processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt", device="cuda")], )
Features
| Feature | Description |
|---|---|
| Real-time WebRTC | Stream video directly to model providers for instant visual understanding. |
| Video Processing | Pluggable processor pipeline for YOLO, Roboflow, or custom PyTorch/ONNX models before/after LLM calls. |
| Turn Detection | Natural conversation flow with VAD, diarization, and smart turn-taking. |
| Tool Calling & MCP | Execute code and APIs mid-conversation — Linear issues, weather, telephony, or any MCP server. |
| Phone Integration | Inbound and outbound voice calls via Twilio with bidirectional audio streaming. |
| RAG | Retrieval-augmented generation with TurboPuffer vector search or Gemini FileSearch. |
| Memory | Agents recall context across turns and sessions via Stream Chat. |
| Text Back-channel | Message the agent silently during a call — coaching overlays, silent instructions, etc. |
| Production Ready | Built-in HTTP server, Prometheus metrics, horizontal scaling, and Kubernetes deployment. |
Out-of-the-Box Integrations
LLMs: OpenAI · Gemini · xAI · OpenRouter · Hugging Face · Kimi AI
Realtime: OpenAI Realtime · Gemini Live · AWS Nova Sonic · Qwen · Inworld
STT: Deepgram · AssemblyAI · Fast-Whisper · Fish Audio · Wizper · Mistral Voxtral
TTS: ElevenLabs · Cartesia · Deepgram · AWS Polly · Pocket · Kokoro · Inworld · Fish Audio
Vision: Ultralytics · Roboflow · Moondream · NVIDIA Cosmos · Decart
Avatars: LemonSlice
Turn Detection: Vogent · Smart Turn
Other: Twilio · TurboPuffer
Documentation
Check out the full docs at VisionAgents.ai.
Quickstart: Voice AI · Video AI
Guides: MCP & Function Calling · Video Processors · Phone Calling · RAG · Testing
Production: HTTP Server · Deployment · Kubernetes · Horizontal Scaling · Prometheus Metrics
Examples
Development
See DEVELOPMENT.md
Want to add your platform or provider? See Create Your Own Plugin or reach out to nash@getstream.io.
Current Limitations
- Video AI struggles with small text — models may hallucinate scores, signs, etc.
- Context degrades on longer sessions (~30s+) for continuous video understanding
- Most use cases need a mix of specialized models (YOLO, Roboflow) with larger LLMs
- Real-time models require audio/text to trigger responses — video alone won't prompt output





