"I'm watching you, Wazowski. Always watching."
The most intelligent home security camera you've ever seen. Roz monitors a webcam, detects motion, analyzes the scene using an LLM with vision capabilities, and then audibly announces changes by comparing what it sees now vs. what it saw previously.
Features
- Two-stage Motion Detection: OpenCV-based frame differencing with configurable sensitivity to limit LLM use.
- LLM Vision Analysis: Sends images to a vision-capable LLM to understand what's happening
- Multi-Frame Context: Analyzes multiple frames at once to only announce notable differences.
- Text-to-Speech: Based on what it sees, it uses text-to-speech to read out what it sees.
Demo
Here's Roz set up to monitor my front door. Unmute audio for full effect.
roz_compressed.mp4
Requirements
-
A Linux system to run the software (I used a Raspberry Pi 4)
-
An OpenAI API compatible LLM endpoint with vision support. This can be on the same computer but doesn't have to be. You likely want this hosted locally because it will be sending a lot of traffic and you don't want to pay $5/hour for an API. You have been warned 😈.
-
USB webcam. I used this but any should do.
-
USB speaker/speakerphone for TTS output (configured as ALSA device). I used a Jabra 410.
-
Python 3.13+ via UV
-
Piper TTS voice model (downloaded separately)
FAQ
Why did you make this?
I heard an ad for a service that claimed to analyze images from your video doorbell and describe what it sees. I thought, "That sounds cool, but I'm not paying $20/month for it. Can I make that?"
Why is it a golden head?
The first version was just a camera and a speaker in a cardboard box. I 3D printed a case to hold them securely and thought gold paint might look fun. Also, if it was painted gold, maybe people would think of the gold statue from Indiana Jones and not steal it.
What LLMs work?
Any LLM with vision capabilities that provides an OpenAI-compatible API endpoint. I used Qwen3.5 35B-A3B Q4 hosted on another PC in my house with an Nvidia 3090 GPU. You could use llama.cpp, vLLM, LM Studio, or similar.
Can I use a different camera?
Yes. Any USB webcam that is supported by OpenCV on Linux should work.
Why do you go through the trouble of detecting motion in the frame? Why not keep it simple and just send every image to the LLM?
The first stage motion detection is computationally light and works on the Raspberry Pi which only uses a couple watts of power. The second stage runs on a full PC with a GPU and uses about 500 watts. Also, if you only send the frames that have changes in them, you get a quicker response overall because the GPU isn't constantly busy.
Why does it seem like it is two seconds behind what is happening in the scene?
This application is the equivalent of:
- Have a camera capture an image every second
- Take each image, combine it with a text prompt, and upload it to Claude or ChatGPT.
- Wait for the text, bring it back to your PC, and then run another program to convert the text response to audio.
- Start immediately playing the audio while capturing the next frame.
- Repeat as fast as possible.
On my local setup, the LLM response takes about one second and everything else (request, response, TTS synthesis) takes another second. If you used a GPU for TTS and ran it all on one PC, it would probably be faster.
Why does Roz occasionally repeat itself?
Because what constitutes a "meaningful change" in a scene is subjective. In /src/llm/prompt_config.py, the prompt is constructed with rules to control when Roz speaks. This is an attempt to balance two extremes: announcing the full contents of every image every second, or only announcing major changes and potentially missing something important. The current configuration worked reasonably well for my setup, but you might want to adjust it for other situations. This is also model-dependent; smaller 4B models respond quickly but aren't as good at following the prompt or discerning meaningful differences. Larger models generally work better, but the four-second response delay can be annoying. Feel free to tweak the prompt to better define what is and isn't important for your environment.
Installation
-
Clone the repository:
git clone https://github.com/calz1/roz.git cd roz -
Install dependencies (using uv):
-
Download a Piper TTS voice model:
# Example: British English voice wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_GB/alba/medium/en_GB-alba-medium.onnx wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_GB/alba/medium/en_GB-alba-medium.onnx.jsonBrowse available voices at: https://huggingface.co/rhasspy/piper-voices
-
Configure environment:
cp config.yaml.example config.yaml
Edit
config.yamlwith your LLM endpoint, voice model, and other settings:llm: endpoint: http://your-llm-server:8080/v1/chat/completions api_key: not-needed-for-local model: qwen35
Usage
Run the main application:
The system will:
- Initialize the camera and establish a baseline frame
- Continuously monitor for motion
- When motion is detected, send frames to the LLM for analysis
- If a meaningful change is detected, announce it via TTS.
Press Ctrl+C to stop.
config.yaml (Settings)
| Section | Key | Description |
|---|---|---|
llm |
endpoint |
URL to your LLM API |
llm |
api_key |
API key (use "not-needed-for-local" for local LLMs) |
llm |
model |
Model name |
llm |
timeout |
Request timeout in seconds |
llm |
max_retries |
Maximum retry attempts |
motion |
sensitivity |
Motion detection sensitivity: high, medium, low |
motion |
frame_check_interval_ms |
Milliseconds between frame checks |
motion |
min_contour_area |
Minimum pixel area to trigger motion |
motion |
blur_kernel_size |
Gaussian blur kernel size |
motion |
threshold_delta |
Pixel difference threshold |
motion |
enable_morphology |
Enable morphological filtering |
motion |
morphology_kernel_size |
Morphological kernel size |
motion |
min_motion_pixels |
Minimum total motion pixels |
tts |
voice_model |
Piper TTS voice model path |
tts |
volume |
Volume level (0.0 to 1.0) |
logging |
log_dir |
Directory for log files |
See config.yaml.example for all available options.
Architecture
roz/
├── main.py # Main entry point - motion detection loop
├── src/
│ ├── config.py # Configuration loading (config.yaml only)
│ ├── hardware/
│ │ └── camera.py # USB camera interface (OpenCV)
│ ├── detection/
│ │ └── motion_detector.py # Frame differencing motion detection
│ ├── llm/
│ │ ├── vision_analyzer.py # LLM API client for vision analysis
│ │ └── prompt_config.py # Prompt templates for change detection
│ └── speech/
│ └── announcer.py # Piper TTS integration
├── config.yaml.example # Example application settings
└── pyproject.toml # Project dependencies
Audio Device Configuration
The TTS output uses ALSA. To find your audio device:
Update the device in src/speech/announcer.py if needed (default: plughw:3,0).
Troubleshooting
Camera Focus & Positioning
If you are running Roz headless (without a monitor) and need to focus or position the camera, use stream_camera.py. This script starts a lightweight web server that streams the camera feed to your browser.
Then, open your browser and navigate to http://<your-device-ip>:8080.
Audio Issues
If you aren't hearing anything or want to verify your TTS setup, use test_audio.py. This script will attempt to initialize the Piper TTS engine and play a test message ("Testing audio output. Hello world.").
If this fails, check your config.yaml to ensure the tts.device and tts.voice_model paths are correct.
License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See LICENSE for details.
