GitHub - jareklupinski/auto-riffer: Tool that automatically aligns and merges humorous audio commentary with video files based on spoken word matching.

3 min read Original article ↗

Auto-Riffer

Tool that automatically aligns and merges humorous audio commentary with video files based on spoken word matching.

Chooser UI Screenshot

After choosing a video file and a commentary track, choose the spoken word language to merge into and a subtitle track to use for finding matching sections, and a manual fine-tuning step allows you to optionally make the sync even more perfect.

Terminal Output Screenshot

Finally the commentary track is merged with the chosen audio track with the correct offset to create a new audio track in the video file chosen.

Tuner UI Screenshot

i set this up because i was using apple music for the riff and vlc for the video and that is not a way to have fun

How It Works

  1. Extract audio from both video and commentary files
  2. Transcribe both using OpenAI Whisper (or subtitles if available)
  3. Align by matching phrases between the two transcripts
  4. Tune alignment interactively with spectrogram visualization
  5. Merge commentary with video audio, creating final output

Installation

# Clone and install
git clone https://github.com/jareklupinski/auto-riffer.git
cd auto-riffer
pip install -r requirements.txt

# System dependencies (macOS)
brew install ffmpeg tesseract

# For interactive mode (recommended)
pip install PyQt6

Usage

# Basic - inserts Riff track into original video
python main.py movie.mkv riff.mp3

# Custom output file
python main.py movie.mkv riff.mp3 -o movie_with_riff.mkv

# Audio-only output
python main.py movie.mkv riff.mp3 --audio-only -o merged.m4a

# Skip interactive tuning
python main.py movie.mkv riff.mp3 --no-interactive

# Manual offset (skip alignment detection)
python main.py movie.mkv riff.mp3 --offset 42.5

# Better accuracy with larger model
python main.py movie.mkv riff.mp3 --whisper-model medium

Options

Option Description
-o, --output Output file path
--audio-only Output merged audio only (no video)
--whisper-model Whisper model: tiny, base, small, medium, large
--video-volume Volume for video audio (default: 0.7)
--riff-volume Volume for commentary (default: 1.0)
--language Language code for recognition (default: en)
--offset Manual offset in seconds
--no-interactive Skip interactive fine-tuning UI
--audio-track Audio track index (0-based)
--subtitle-track Subtitle track index (0-based)

Project Structure

auto-riffer/
├── main.py          # CLI and orchestration
├── models.py        # Data classes (Word, AudioTrack, etc.)
├── utils.py         # Utilities and dependency checking
├── media.py         # Media probing and audio extraction
├── subtitles.py     # Subtitle extraction and OCR
├── transcribe.py    # Whisper speech recognition
├── align.py         # Phrase matching and offset detection
├── interactive.py   # Interactive tuning UI
└── merge.py         # Audio merging and output creation

Requirements

System

  • FFmpeg with ffprobe
  • Tesseract OCR (for bitmap subtitles)
  • Python 3.10+

Python Packages

  • openai-whisper
  • torch
  • pytesseract
  • Pillow
  • matplotlib
  • scipy
  • PyQt6 (recommended for interactive mode)

Tips

  • Subtitles help: If your video has subtitles, they'll be used instead of transcribing audio—faster and often more accurate.

  • GPU acceleration: Whisper automatically uses CUDA or Apple MPS if available.

  • Interactive tuning: The spectrogram view shows both audio tracks. Adjust the offset until patterns align, then use "Play Preview" to verify.

  • Works best when disembaudios repeat movie dialogue, enabling effective phrase matching.