Teaching a $14 ESP32 to Detect and Auto-Mute TV Ads | Naveen Kulandaivelu

8 min read Original article ↗

ESP32 TV Ad Mute Demo Real-time monitoring dashboard showing ad detection metrics, system health, and inference times

The Problem with TV Ads

I hate TV ads in general. But the ads on streaming services like Hulu and recently Amazon Prime can be unbearable. Hulu’s ad-supported tier is almost unwatchable because of the frequency and repetition of the ads. Amazon, wanting to squeeze more money out of their existing customers, recently added ads to their service that you need to pay extra to remove. It’s only going to get worse.

Hate Ads

Manual Solution (It Sucks)

I normally just mute the TV when an ad comes on and unmute it when the ad is over. But that gets tiring really quickly. On average, Amazon Prime shows ads every 15-20 minutes, which means a lot of manual muting and unmuting during a typical viewing session.

Scope for a Weekend Project

What made this project possible were two key observations:

  • Streaming service ads come with a visible label on the screen
  • I have a Sonos speaker system that can be controlled via Home Assistant

These two factors meant I could potentially detect and mute ads automatically using computer vision.

Choosing the Hardware

I set out to solve this problem using the smallest and cheapest hardware possible as a fun challenge. That’s when I discovered the ESP32-S3 Sense from Seeed Studio. This tiny powerhouse comes with:

  • A detachable OV2640 camera module that can be swapped out for a different camera
  • 8MB PSRAM (crucial for image processing and frame buffers)
  • 8MB Flash storage
  • All while measuring just 21mm x 18mm in size

xiao-esp32s3-sense ESP32-S3 Sense - A compact powerhouse for computer vision

The combination of the OV2640 camera’s capabilities and the generous PSRAM made this the perfect candidate for running lightweight machine learning models.

The Build Process

Step 1: Determining where to mount the ESP32 camera

After initially uploading a simple camera server sketch to the ESP32-S3, I experimented with different camera positions while monitoring the live feed on an iPad. The goal was to find the optimal angle to detect the ad label on the TV screen consistently. Through testing, I discovered that mounting the camera directly on top of the TV with a downward angle provided two key advantages:

  • Clear view of the ad label region (reduced features to process)
  • No obstruction from people walking in front of the TV

Step 2: Design and 3D Print the Mount

3d-print-attachment 3D Printed Mount with Integrated Camera Angle

Design considerations for the mount:

  • Compact housing for the ESP32-S3 Sense (21mm x 18mm)
  • Curved angle for optimal camera view of the TV’s ad label region
  • Unobtrusive design that doesn’t block the TV view
  • Good ventilation for heat dissipation

Build details:

  • 3D printed using PLA (35g of filament)
  • Print time: 85 minutes
  • Secured to TV using clear packaging tape
  • Added thermal pads between the ESP32 and mount for better heat transfer

Pro Tip: The camera’s position and angle are crucial for reliable ad detection.

mounted-esp32 ESP32-S3 Sense mounted on top of the TV

Step 3: System Architecture

Before diving into the ML part, I set up the basic system architecture:

  • ESP32-S3 running at 240MHz with PSRAM enabled
  • WiFi connection with automatic reconnection handling
  • WebSocket server for real-time monitoring
  • Home Assistant integration for Sonos control
  • Continuous frame capture in YUV422 format for efficient processing.
    • YUV422 is a color space format that separates brightness (Y) from color information (U and V), requiring less bandwidth than RGB while maintaining good visual quality. This format is particularly efficient for embedded systems like the ESP32.

Step 4: Teaching the ESP32-S3 Sense to Detect Ads

Data Collection

The first challenge was collecting a good dataset of ad frames:

Initial Setup:

  • Created a simple camera server on the ESP32
  • Wrote a Python script to capture and save frames on my computer
  • Set initial camera resolution to 480x320 for data collection. This was later changed to 96x96 to reduce the size of the model and inference time.

Data Collection Process:

  • Started playing shows on Prime Video (They gladly played a new ad block every time I opened a new show)
  • Captured continuous frame sequences
  • Manually separated ad vs non-ad frames (since ads appear in blocks)
  • Captured frames in YUV422 format (matches deployment format)

Dataset Details:

  • 11,695 ad frames
  • 24,521 non-ad frames
  • Each frame resized to 96x96 (lowest supported resolution for OV2640)
  • Frames include various lighting conditions and TV brightness levels

ad-example RGB preview from the camera

Model Development

  • After experimenting with various TensorFlow Lite options for ESP32, I discovered Edge Impulse. Their platform offered good support for ESP32 devices and direct Arduino library export capabilities for Bring Your Own Model (BYOM).

  • In order to keep the number of parameters in the model small, the architecture of the model was kept simple.

model = models.Sequential([
    layers.Input(shape=(IMG_SIZE, IMG_SIZE, 3)),
    layers.Conv2D(4, 3, strides=2, padding='same'),    # Reduce spatial dimensions
    layers.BatchNormalization(),                       # Stabilize training
    layers.ReLU(),
    layers.MaxPooling2D(pool_size=(2, 2)),             # Further dimension reduction
    
    layers.SeparableConv2D(8, 3, padding='same'),      # Efficient feature extraction
    layers.BatchNormalization(),
    layers.ReLU(),
    layers.MaxPooling2D(pool_size=(2, 2)),
    
    layers.GlobalAveragePooling2D(),                   # Reduce parameters
    layers.Dropout(0.5),                               # Prevent overfitting
    layers.Dense(2, activation='softmax', name='output')
])

Training Process:

  • Data augmentation: random brightness, contrast, and slight rotations
  • Trained for 15 epochs on CPU
  • Used YUV422 color space for consistency with deployment
  • INT8 quantization for deployment efficiency

Results:

Total test images: 3934
Correctly classified: 3871
Accuracy: 98.4%

Note: While the accuracy looks impressive, the model’s performance is specific to Prime Video’s ad label style. It does not generalize well to other streaming services’ ad layouts.

Model Deployment

Edge Impulse Studio made the deployment process straightforward:

1. Model Optimization:

  • Quantized to INT8 precision for faster inference
  • Final model size: 293KB
  • Exported as an Arduino library

2.ESP32 Implementation:

  • Enabled PSRAM for frame buffer storage
  • Set up camera initialization:
  • 96x96 resolution
  • YUV422 pixel format
  • Frame flipping for correct orientation
  • Implemented frame capture and pre-processing
  • Added confidence thresholding logic

3. Real-time Monitoring:

  • Added WebSocket server for live metrics

  • Dashboard displays:

    • Ad/No-ad probabilities
    • Inference time
    • Number of ads encountered

4. Performance Optimization:

  • Use PSRAM for frame buffers
  • Maintain stable WiFi connection
  • Monitor system health

Home Assistant Integration

The final piece was connecting the ESP32 to Home Assistant for Sonos control:

1. Setup Requirements:

2. Implementation:

// Example REST request body to mute the speaker.
{
  "entity_id": "media_player.family_room",
  "is_volume_muted": true
}
// Function to control the Sonos speaker mute state using Home Assistant REST API.
void controlSonosMute(bool mute) {
    if ((mute && !muteTriggerSent) || (!mute && !unmuteTriggerSent)) {
        if (mute) {
            muteTriggerSent = true;
            unmuteTriggerSent = false;
        } else {
            unmuteTriggerSent = true;
            muteTriggerSent = false;
        }

        if (WiFi.status() == WL_CONNECTED) {
            HTTPClient http;
            http.begin(HA_URL);
            http.addHeader("Content-Type", "application/json");
            http.addHeader("Authorization", String("Bearer ") + HA_TOKEN);

            String payload = "{\"entity_id\":\"" + String(SONOS_ENTITY) +
                          "\",\"is_volume_muted\":" + (mute ? "true" : "false") + "}";

            int httpCode = http.POST(payload);
            if (httpCode >= 200 && httpCode < 300) {
                Serial.printf("HA API Call: %s, Response: %d\n", mute ? "MUTE" : "UNMUTE", httpCode);
                isMuted = mute;
            } else {
                Serial.printf("HA Error: %s\n", http.errorToString(httpCode).c_str());
                if (mute) muteTriggerSent = false;
                else unmuteTriggerSent = false;
            }
            http.end();
        }
    }
}

3. Muting Logic:

  • Requires 5 consecutive frames with >90% ad confidence to mute.
    • To avoid false positives from 1 or 2 frames.
  • Requires 10 consecutive frames with >90% no-ad confidence to unmute.
    • To prevent triggering unmute prematurely.
  • Added debouncing to prevent rapid mute/unmute cycles
  • Logs all mute/unmute actions for monitoring

Results and Limitations

Performance Metrics:

  • Power consumption: 224mA at 5V (1.12W)
  • Inference time: ~250ms per frame (4 FPS)
  • Model library size: 293KB
  • Average mute time: 1.75 seconds (7 frames)
  • Average unmute time: 2.5 seconds (10 frames)

esp32-ad-mute-demo ESP32-S3 Sense in action: detecting ads and muting the speaker

System Stability:

  • WiFi connection occasionally drops at start and after several hours
  • Added automatic reconnection logic
  • Dashboard indicates connection status

Temperature Management:

  • ESP32 and camera module can get quite hot
  • Added heat sinks to both components

Detection Accuracy:

  • Excellent performance on Prime Video ads
  • False positives on Netflix UI elements
  • Struggles with new ad layouts not in training data
  • More robust in consistent lighting conditions

Components Used

Here’s what you’ll need to replicate this project:

  • ESP32-S3 Sense - $14
  • Heat sinks - $6
  • 3D printed mount (35g PLA) - $0.40
  • Any compatible 5V USB power supply
  • Home Assistant setup with Sonos integration

Total cost: ~$20.40

Final Thoughts

This weekend project successfully automated a daily annoyance, but there’s room for improvement:

Current Limitations:

  • Only works with Prime Video ad layouts
  • Requires specific camera positioning
  • Needs manual reset occasionally
  • Limited by single-modality detection (vision only)

Future Improvements:

  • Train on multiple streaming services
  • Add audio-based ad detection using the XIAO ESP32-S3’s microphone
  • Create a multi-modal model (combine audio and video)
  • Enhancing the mute mechanism to be more universal by using a IR blaster

Power Management:

  • Added a Kasa smart plug with Home Assistant integration
  • Automatically powers on/off with the TV
  • Prevents unnecessary runtime when TV is off

The XIAO ESP32-S3 Sense proved to be a capable platform for this type of edge ML project.

Despite its limitations, the system reliably mutes ads and provides a better viewing experience. The real-time monitoring dashboard helps track performance and catch issues early.

For future projects, I’m particularly interested in exploring the ESP32-S3’s audio capabilities and potentially creating a more sophisticated multi-modal detection system.