Teaching a $14 ESP32 to Detect and Auto-Mute TV Ads | Naveen Kulandaivelu

ESP32 TV Ad Mute Demo Real-time monitoring dashboard showing ad detection metrics, system health, and inference times

The Problem with TV Ads

I hate TV ads in general. But the ads on streaming services like Hulu and recently Amazon Prime can be unbearable. Hulu’s ad-supported tier is almost unwatchable because of the frequency and repetition of the ads. Amazon, wanting to squeeze more money out of their existing customers, recently added ads to their service that you need to pay extra to remove. It’s only going to get worse.

Hate Ads

Manual Solution (It Sucks)

I normally just mute the TV when an ad comes on and unmute it when the ad is over. But that gets tiring really quickly. On average, Amazon Prime shows ads every 15-20 minutes, which means a lot of manual muting and unmuting during a typical viewing session.

Scope for a Weekend Project

What made this project possible were two key observations:

Streaming service ads come with a visible label on the screen
I have a Sonos speaker system that can be controlled via Home Assistant

These two factors meant I could potentially detect and mute ads automatically using computer vision.

Choosing the Hardware

I set out to solve this problem using the smallest and cheapest hardware possible as a fun challenge. That’s when I discovered the ESP32-S3 Sense from Seeed Studio. This tiny powerhouse comes with:

A detachable OV2640 camera module that can be swapped out for a different camera
8MB PSRAM (crucial for image processing and frame buffers)
8MB Flash storage
All while measuring just 21mm x 18mm in size

xiao-esp32s3-sense ESP32-S3 Sense - A compact powerhouse for computer vision

The combination of the OV2640 camera’s capabilities and the generous PSRAM made this the perfect candidate for running lightweight machine learning models.

The Build Process

Step 1: Determining where to mount the ESP32 camera

After initially uploading a simple camera server sketch to the ESP32-S3, I experimented with different camera positions while monitoring the live feed on an iPad. The goal was to find the optimal angle to detect the ad label on the TV screen consistently. Through testing, I discovered that mounting the camera directly on top of the TV with a downward angle provided two key advantages:

Clear view of the ad label region (reduced features to process)
No obstruction from people walking in front of the TV

Step 2: Design and 3D Print the Mount

3d-print-attachment 3D Printed Mount with Integrated Camera Angle

Design considerations for the mount:

Compact housing for the ESP32-S3 Sense (21mm x 18mm)
Curved angle for optimal camera view of the TV’s ad label region
Unobtrusive design that doesn’t block the TV view
Good ventilation for heat dissipation

Build details:

3D printed using PLA (35g of filament)
Print time: 85 minutes
Secured to TV using clear packaging tape
Added thermal pads between the ESP32 and mount for better heat transfer

Pro Tip: The camera’s position and angle are crucial for reliable ad detection.

mounted-esp32 ESP32-S3 Sense mounted on top of the TV

Step 3: System Architecture

Before diving into the ML part, I set up the basic system architecture:

ESP32-S3 running at 240MHz with PSRAM enabled
WiFi connection with automatic reconnection handling
WebSocket server for real-time monitoring
Home Assistant integration for Sonos control
Continuous frame capture in YUV422 format for efficient processing.
- YUV422 is a color space format that separates brightness (Y) from color information (U and V), requiring less bandwidth than RGB while maintaining good visual quality. This format is particularly efficient for embedded systems like the ESP32.

Step 4: Teaching the ESP32-S3 Sense to Detect Ads

Data Collection

The first challenge was collecting a good dataset of ad frames:

Initial Setup:

Created a simple camera server on the ESP32
Wrote a Python script to capture and save frames on my computer
Set initial camera resolution to 480x320 for data collection. This was later changed to 96x96 to reduce the size of the model and inference time.

Data Collection Process:

Started playing shows on Prime Video (They gladly played a new ad block every time I opened a new show)
Captured continuous frame sequences
Manually separated ad vs non-ad frames (since ads appear in blocks)
Captured frames in YUV422 format (matches deployment format)

Dataset Details:

11,695 ad frames
24,521 non-ad frames
Each frame resized to 96x96 (lowest supported resolution for OV2640)
Frames include various lighting conditions and TV brightness levels

ad-example RGB preview from the camera

Model Development

After experimenting with various TensorFlow Lite options for ESP32, I discovered Edge Impulse. Their platform offered good support for ESP32 devices and direct Arduino library export capabilities for Bring Your Own Model (BYOM).
In order to keep the number of parameters in the model small, the architecture of the model was kept simple.

model = models.Sequential([
    layers.Input(shape=(IMG_SIZE, IMG_SIZE, 3)),
    layers.Conv2D(4, 3, strides=2, padding='same'),    # Reduce spatial dimensions
    layers.BatchNormalization(),                       # Stabilize training
    layers.ReLU(),
    layers.MaxPooling2D(pool_size=(2, 2)),             # Further dimension reduction
    
    layers.SeparableConv2D(8, 3, padding='same'),      # Efficient feature extraction
    layers.BatchNormalization(),
    layers.ReLU(),
    layers.MaxPooling2D(pool_size=(2, 2)),
    
    layers.GlobalAveragePooling2D(),                   # Reduce parameters
    layers.Dropout(0.5),                               # Prevent overfitting
    layers.Dense(2, activation='softmax', name='output')
])

Training Process:

Data augmentation: random brightness, contrast, and slight rotations
Trained for 15 epochs on CPU
Used YUV422 color space for consistency with deployment
INT8 quantization for deployment efficiency

Results:

Total test images: 3934
Correctly classified: 3871
Accuracy: 98.4%

Note: While the accuracy looks impressive, the model’s performance is specific to Prime Video’s ad label style. It does not generalize well to other streaming services’ ad layouts.

Model Deployment

Edge Impulse Studio made the deployment process straightforward:

1. Model Optimization:

Quantized to INT8 precision for faster inference
Final model size: 293KB
Exported as an Arduino library

2.ESP32 Implementation:

Enabled PSRAM for frame buffer storage
Set up camera initialization:
96x96 resolution
YUV422 pixel format
Frame flipping for correct orientation
Implemented frame capture and pre-processing
Added confidence thresholding logic

3. Real-time Monitoring:

Added WebSocket server for live metrics
Dashboard displays:
- Ad/No-ad probabilities
- Inference time
- Number of ads encountered

4. Performance Optimization:

Use PSRAM for frame buffers
Maintain stable WiFi connection
Monitor system health

Home Assistant Integration

The final piece was connecting the ESP32 to Home Assistant for Sonos control:

1. Setup Requirements:

Create a new Long-Lived API Token for authentication
Add the Sonos speaker entity in Home Assistant
Connect the ESP32 to the same WiFi network as Home Assistant

2. Implementation:

// Example REST request body to mute the speaker.
{
  "entity_id": "media_player.family_room",
  "is_volume_muted": true
}

// Function to control the Sonos speaker mute state using Home Assistant REST API.
void controlSonosMute(bool mute) {
    if ((mute && !muteTriggerSent) || (!mute && !unmuteTriggerSent)) {
        if (mute) {
            muteTriggerSent = true;
            unmuteTriggerSent = false;
        } else {
            unmuteTriggerSent = true;
            muteTriggerSent = false;
        }

        if (WiFi.status() == WL_CONNECTED) {
            HTTPClient http;
            http.begin(HA_URL);
            http.addHeader("Content-Type", "application/json");
            http.addHeader("Authorization", String("Bearer ") + HA_TOKEN);

            String payload = "{\"entity_id\":\"" + String(SONOS_ENTITY) +
                          "\",\"is_volume_muted\":" + (mute ? "true" : "false") + "}";

            int httpCode = http.POST(payload);
            if (httpCode >= 200 && httpCode < 300) {
                Serial.printf("HA API Call: %s, Response: %d\n", mute ? "MUTE" : "UNMUTE", httpCode);
                isMuted = mute;
            } else {
                Serial.printf("HA Error: %s\n", http.errorToString(httpCode).c_str());
                if (mute) muteTriggerSent = false;
                else unmuteTriggerSent = false;
            }
            http.end();
        }
    }
}

3. Muting Logic:

Requires 5 consecutive frames with >90% ad confidence to mute.
- To avoid false positives from 1 or 2 frames.
Requires 10 consecutive frames with >90% no-ad confidence to unmute.
- To prevent triggering unmute prematurely.
Added debouncing to prevent rapid mute/unmute cycles
Logs all mute/unmute actions for monitoring

Results and Limitations

Performance Metrics:

Power consumption: 224mA at 5V (1.12W)
Inference time: ~250ms per frame (4 FPS)
Model library size: 293KB
Average mute time: 1.75 seconds (7 frames)
Average unmute time: 2.5 seconds (10 frames)

esp32-ad-mute-demo ESP32-S3 Sense in action: detecting ads and muting the speaker

System Stability:

WiFi connection occasionally drops at start and after several hours
Added automatic reconnection logic
Dashboard indicates connection status

Temperature Management:

ESP32 and camera module can get quite hot
Added heat sinks to both components

Detection Accuracy:

Excellent performance on Prime Video ads
False positives on Netflix UI elements
Struggles with new ad layouts not in training data
More robust in consistent lighting conditions

Components Used

Here’s what you’ll need to replicate this project:

ESP32-S3 Sense - $14
Heat sinks - $6
3D printed mount (35g PLA) - $0.40
Any compatible 5V USB power supply
Home Assistant setup with Sonos integration

Total cost: ~$20.40

Final Thoughts

This weekend project successfully automated a daily annoyance, but there’s room for improvement:

Current Limitations:

Only works with Prime Video ad layouts
Requires specific camera positioning
Needs manual reset occasionally
Limited by single-modality detection (vision only)

Future Improvements:

Train on multiple streaming services
Add audio-based ad detection using the XIAO ESP32-S3’s microphone
Create a multi-modal model (combine audio and video)
Enhancing the mute mechanism to be more universal by using a IR blaster

Power Management:

Added a Kasa smart plug with Home Assistant integration
Automatically powers on/off with the TV
Prevents unnecessary runtime when TV is off

The XIAO ESP32-S3 Sense proved to be a capable platform for this type of edge ML project.

Despite its limitations, the system reliably mutes ads and provides a better viewing experience. The real-time monitoring dashboard helps track performance and catch issues early.

For future projects, I’m particularly interested in exploring the ESP32-S3’s audio capabilities and potentially creating a more sophisticated multi-modal detection system.