Real-time monitoring dashboard showing ad detection metrics, system health, and inference times
The Problem with TV Ads
I hate TV ads in general. But the ads on streaming services like Hulu and recently Amazon Prime can be unbearable. Hulu’s ad-supported tier is almost unwatchable because of the frequency and repetition of the ads. Amazon, wanting to squeeze more money out of their existing customers, recently added ads to their service that you need to pay extra to remove. It’s only going to get worse.
Manual Solution (It Sucks)
I normally just mute the TV when an ad comes on and unmute it when the ad is over. But that gets tiring really quickly. On average, Amazon Prime shows ads every 15-20 minutes, which means a lot of manual muting and unmuting during a typical viewing session.
Scope for a Weekend Project
What made this project possible were two key observations:
- Streaming service ads come with a visible label on the screen
- I have a Sonos speaker system that can be controlled via Home Assistant
These two factors meant I could potentially detect and mute ads automatically using computer vision.
Choosing the Hardware
I set out to solve this problem using the smallest and cheapest hardware possible as a fun challenge. That’s when I discovered the ESP32-S3 Sense from Seeed Studio. This tiny powerhouse comes with:
- A detachable OV2640 camera module that can be swapped out for a different camera
- 8MB PSRAM (crucial for image processing and frame buffers)
- 8MB Flash storage
- All while measuring just 21mm x 18mm in size
ESP32-S3 Sense - A compact powerhouse for computer vision
The combination of the OV2640 camera’s capabilities and the generous PSRAM made this the perfect candidate for running lightweight machine learning models.
The Build Process
Step 1: Determining where to mount the ESP32 camera
After initially uploading a simple camera server sketch to the ESP32-S3, I experimented with different camera positions while monitoring the live feed on an iPad. The goal was to find the optimal angle to detect the ad label on the TV screen consistently. Through testing, I discovered that mounting the camera directly on top of the TV with a downward angle provided two key advantages:
- Clear view of the ad label region (reduced features to process)
- No obstruction from people walking in front of the TV
Step 2: Design and 3D Print the Mount
3D Printed Mount with Integrated Camera Angle
Design considerations for the mount:
- Compact housing for the ESP32-S3 Sense (21mm x 18mm)
- Curved angle for optimal camera view of the TV’s ad label region
- Unobtrusive design that doesn’t block the TV view
- Good ventilation for heat dissipation
Build details:
- 3D printed using PLA (35g of filament)
- Print time: 85 minutes
- Secured to TV using clear packaging tape
- Added thermal pads between the ESP32 and mount for better heat transfer
Pro Tip: The camera’s position and angle are crucial for reliable ad detection.
ESP32-S3 Sense mounted on top of the TV
Step 3: System Architecture
Before diving into the ML part, I set up the basic system architecture:
- ESP32-S3 running at 240MHz with PSRAM enabled
- WiFi connection with automatic reconnection handling
- WebSocket server for real-time monitoring
- Home Assistant integration for Sonos control
- Continuous frame capture in YUV422 format for efficient processing.
- YUV422 is a color space format that separates brightness (Y) from color information (U and V), requiring less bandwidth than RGB while maintaining good visual quality. This format is particularly efficient for embedded systems like the ESP32.
Step 4: Teaching the ESP32-S3 Sense to Detect Ads
Data Collection
The first challenge was collecting a good dataset of ad frames:
Initial Setup:
- Created a simple camera server on the ESP32
- Wrote a Python script to capture and save frames on my computer
- Set initial camera resolution to 480x320 for data collection. This was later changed to 96x96 to reduce the size of the model and inference time.
Data Collection Process:
- Started playing shows on Prime Video (They gladly played a new ad block every time I opened a new show)
- Captured continuous frame sequences
- Manually separated ad vs non-ad frames (since ads appear in blocks)
- Captured frames in YUV422 format (matches deployment format)
Dataset Details:
- 11,695 ad frames
- 24,521 non-ad frames
- Each frame resized to 96x96 (lowest supported resolution for OV2640)
- Frames include various lighting conditions and TV brightness levels
RGB preview from the camera
Model Development
-
After experimenting with various TensorFlow Lite options for ESP32, I discovered Edge Impulse. Their platform offered good support for ESP32 devices and direct Arduino library export capabilities for Bring Your Own Model (BYOM).
-
In order to keep the number of parameters in the model small, the architecture of the model was kept simple.
model = models.Sequential([
layers.Input(shape=(IMG_SIZE, IMG_SIZE, 3)),
layers.Conv2D(4, 3, strides=2, padding='same'), # Reduce spatial dimensions
layers.BatchNormalization(), # Stabilize training
layers.ReLU(),
layers.MaxPooling2D(pool_size=(2, 2)), # Further dimension reduction
layers.SeparableConv2D(8, 3, padding='same'), # Efficient feature extraction
layers.BatchNormalization(),
layers.ReLU(),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.GlobalAveragePooling2D(), # Reduce parameters
layers.Dropout(0.5), # Prevent overfitting
layers.Dense(2, activation='softmax', name='output')
])
Training Process:
- Data augmentation: random brightness, contrast, and slight rotations
- Trained for 15 epochs on CPU
- Used YUV422 color space for consistency with deployment
- INT8 quantization for deployment efficiency
Results:
Total test images: 3934
Correctly classified: 3871
Accuracy: 98.4%
Note: While the accuracy looks impressive, the model’s performance is specific to Prime Video’s ad label style. It does not generalize well to other streaming services’ ad layouts.
Model Deployment
Edge Impulse Studio made the deployment process straightforward:
1. Model Optimization:
- Quantized to INT8 precision for faster inference
- Final model size: 293KB
- Exported as an Arduino library
2.ESP32 Implementation:
- Enabled PSRAM for frame buffer storage
- Set up camera initialization:
- 96x96 resolution
- YUV422 pixel format
- Frame flipping for correct orientation
- Implemented frame capture and pre-processing
- Added confidence thresholding logic
3. Real-time Monitoring:
-
Added WebSocket server for live metrics
-
Dashboard displays:
- Ad/No-ad probabilities
- Inference time
- Number of ads encountered
4. Performance Optimization:
- Use PSRAM for frame buffers
- Maintain stable WiFi connection
- Monitor system health
Home Assistant Integration
The final piece was connecting the ESP32 to Home Assistant for Sonos control:
1. Setup Requirements:
- Create a new Long-Lived API Token for authentication
- Add the Sonos speaker entity in Home Assistant
- Connect the ESP32 to the same WiFi network as Home Assistant
2. Implementation:
// Example REST request body to mute the speaker.
{
"entity_id": "media_player.family_room",
"is_volume_muted": true
}
// Function to control the Sonos speaker mute state using Home Assistant REST API.
void controlSonosMute(bool mute) {
if ((mute && !muteTriggerSent) || (!mute && !unmuteTriggerSent)) {
if (mute) {
muteTriggerSent = true;
unmuteTriggerSent = false;
} else {
unmuteTriggerSent = true;
muteTriggerSent = false;
}
if (WiFi.status() == WL_CONNECTED) {
HTTPClient http;
http.begin(HA_URL);
http.addHeader("Content-Type", "application/json");
http.addHeader("Authorization", String("Bearer ") + HA_TOKEN);
String payload = "{\"entity_id\":\"" + String(SONOS_ENTITY) +
"\",\"is_volume_muted\":" + (mute ? "true" : "false") + "}";
int httpCode = http.POST(payload);
if (httpCode >= 200 && httpCode < 300) {
Serial.printf("HA API Call: %s, Response: %d\n", mute ? "MUTE" : "UNMUTE", httpCode);
isMuted = mute;
} else {
Serial.printf("HA Error: %s\n", http.errorToString(httpCode).c_str());
if (mute) muteTriggerSent = false;
else unmuteTriggerSent = false;
}
http.end();
}
}
}
3. Muting Logic:
- Requires 5 consecutive frames with >90% ad confidence to mute.
- To avoid false positives from 1 or 2 frames.
- Requires 10 consecutive frames with >90% no-ad confidence to unmute.
- To prevent triggering unmute prematurely.
- Added debouncing to prevent rapid mute/unmute cycles
- Logs all mute/unmute actions for monitoring
Results and Limitations
Performance Metrics:
- Power consumption: 224mA at 5V (1.12W)
- Inference time: ~250ms per frame (4 FPS)
- Model library size: 293KB
- Average mute time: 1.75 seconds (7 frames)
- Average unmute time: 2.5 seconds (10 frames)
ESP32-S3 Sense in action: detecting ads and muting the speaker
System Stability:
- WiFi connection occasionally drops at start and after several hours
- Added automatic reconnection logic
- Dashboard indicates connection status
Temperature Management:
- ESP32 and camera module can get quite hot
- Added heat sinks to both components
Detection Accuracy:
- Excellent performance on Prime Video ads
- False positives on Netflix UI elements
- Struggles with new ad layouts not in training data
- More robust in consistent lighting conditions
Components Used
Here’s what you’ll need to replicate this project:
- ESP32-S3 Sense - $14
- Heat sinks - $6
- 3D printed mount (35g PLA) - $0.40
- Any compatible 5V USB power supply
- Home Assistant setup with Sonos integration
Total cost: ~$20.40
Final Thoughts
This weekend project successfully automated a daily annoyance, but there’s room for improvement:
Current Limitations:
- Only works with Prime Video ad layouts
- Requires specific camera positioning
- Needs manual reset occasionally
- Limited by single-modality detection (vision only)
Future Improvements:
- Train on multiple streaming services
- Add audio-based ad detection using the XIAO ESP32-S3’s microphone
- Create a multi-modal model (combine audio and video)
- Enhancing the mute mechanism to be more universal by using a IR blaster
Power Management:
- Added a Kasa smart plug with Home Assistant integration
- Automatically powers on/off with the TV
- Prevents unnecessary runtime when TV is off
The XIAO ESP32-S3 Sense proved to be a capable platform for this type of edge ML project.
Despite its limitations, the system reliably mutes ads and provides a better viewing experience. The real-time monitoring dashboard helps track performance and catch issues early.
For future projects, I’m particularly interested in exploring the ESP32-S3’s audio capabilities and potentially creating a more sophisticated multi-modal detection system.