Detecting Podcast Ads On a Phone | earsay™ - Podcast App

7 min read Original article ↗

Here is the constraint that shaped everything: detect podcast ads in real-time, on a phone, without sending audio anywhere.

The detection has to happen fast enough that users can skip before the ad finishes its first sentence. And it has to run entirely on the device, using only what an iPhone gives you: the CPU and Neural Engine, no server in the loop.

This is a technical overview of how we built it.

Two models, one decision

The system uses two audio processing models. The first is fast and runs on everything. The second is slower but more accurate, and only runs when the first model is uncertain.

Think of it like a radiologist and a second opinion. Most scans are obviously fine or obviously not. Those get processed quickly. The ambiguous cases get escalated. You do not send every scan to the specialist, but you send the ones that matter.

This keeps inference fast for the common case while maintaining accuracy on the hard cases. But even with two models agreeing, you still have a problem: the models output probabilities per segment. Turning those probabilities into precise ad boundaries is the actual problem.

Raw detections are noisy

The ad detection model runs on sliding windows of audio, outputting a probability for each segment. Ad, not ad, or somewhere in between. This is the easy part. The hard part is turning thousands of noisy probability scores into clean, usable ad regions.

If you just threshold the raw output, you get garbage. The model might flag a segment at 4:32 as 87% ad, then 4:35 as 62% ad, then 4:38 as 91% ad. Is that one ad or three? Where does it start? Where does it end? The raw signal tells you something is happening. It does not tell you what to do about it.

This is a signal processing problem as much as a machine learning problem. The signal processing took most of the work.

Coarse-to-fine boundary detection

The naive approach is to classify every segment densely and look for transitions. This is expensive and produces flickering boundaries. Small fluctuations in confidence create phantom transitions. An ad that actually runs from 4:32 to 5:47 gets detected as starting at 4:35, briefly ending at 4:41, restarting at 4:43, and so on.

We use a coarse-to-fine strategy instead.

First pass: scan at wide intervals to find approximate ad regions. This is cheap. You are looking for areas where the signal is consistently elevated, not precise boundaries. Think of it as finding the general neighborhood.

Second pass: once you have a rough region, use binary search to find the exact start point. The model says "this is an ad" at 4:30 and "this is not an ad" at 4:00. So you check 4:15. Still not an ad. Check 4:22. Ad. Check 4:18. Not an ad. Check 4:20. Ad. You converge on the boundary in logarithmic time.

Same process for the end boundary, searching forward.

This gives you precision without paying for dense classification across the entire episode. Most of the audio is obviously not an ad. You do not need to run inference on it at high resolution.

Hysteresis prevents flickering

Even with binary search, you still have a flickering problem. The model might waver at boundaries. One sample says ad, the next says not ad, the next says ad again. If you react to every transition, you get rapid state changes that make no sense to users.

The solution is hysteresis. To enter an ad region, one detection is enough. To exit an ad region, you need multiple consecutive non-ad detections. The threshold is asymmetric on purpose.

Think of it like a thermostat. The heat turns on at 68 degrees but does not turn off until 72. This prevents the system from cycling rapidly when the temperature hovers near the threshold. Same principle, different domain.

The model might briefly waver at a boundary. The system state does not.

Segment merging

Podcasters do not always read ads in one continuous block. They might pause, interject a thought, then continue. "This episode is brought to you by... oh, and speaking of productivity, I've been using this thing for months... anyway, use code PODCAST for 20% off."

Raw detection produces two ad segments with a brief content gap. But to the listener, this is one ad. Skipping the first half and playing the interstitial makes no sense.

So we merge adjacent segments when the gap between them is below a threshold. Two ad regions separated by three seconds become one ad region. The gap gets absorbed.

The merge threshold is not fixed. Short gaps get merged. Long gaps do not. There is a point where two separate ads are genuinely two separate ads, and merging them would be wrong. Finding that point required looking at a lot of real podcast data.

Snapping to natural boundaries

Even with precise boundary detection, you can still land in awkward places. An ad that starts mid-sentence sounds wrong when skipped. The audio cuts in at "...twenty percent off with code PODCAST" instead of at the beginning of the pitch.

We use audio analysis to find natural boundary points: pauses, breaths, sentence edges. The detected boundary is a starting point. The final boundary snaps to the nearest natural break within a small window.

Users don't notice when boundaries are good. They absolutely notice when they're bad.

The pipeline

Put it all together and you have a four-stage pipeline:

  1. Coarse scan to find approximate ad regions.
  2. Binary search to refine boundaries.
  3. Hysteresis to stabilize state transitions.
  4. Merging and snapping to produce final segments.

Each stage can be tuned independently. Each stage has its own failure modes. The coarse scan might miss a short ad. The binary search might converge on a local minimum. The hysteresis might be too aggressive or not aggressive enough. The merging might combine things that should stay separate.

But the stages also compensate for each other. A slightly wrong coarse boundary gets corrected by binary search. A flickering binary search result gets smoothed by hysteresis. Gaps from overly conservative detection get closed by merging.

The system is robust because no single stage has to be perfect.

Deployment

First-run compilation to the Neural Engine happens once, then gets cached. The compilation step is invisible to users but critical for performance. Subsequent launches are fast.

We store intermediate results: every segment probability, every boundary decision, every merge operation. When the model improves, we can reprocess old episodes without re-downloading anything. The raw data is already there. Only the interpretation changes.

What we learned

The interesting problems were not where we expected them. Training the model was straightforward. Getting clean boundaries out of noisy detections was hard. The signal processing pipeline took more iteration than the model itself.

On-device constraints forced better engineering. When you cannot afford dense classification, you build coarse-to-fine pipelines. When you cannot afford flickering state, you add hysteresis. The limitations became features.

The system is not perfect. Some ads slip through. Some content gets flagged. But it is fast, it is private, and it runs entirely on your device.

That was the constraint. That is what we built.

For more on why we chose to build ad detection this way, rather than the approaches used by other services, see our post on earsay's approach to podcast ads.

← Back to Blog