GitHub - jareklupinski/auto-riffer: Tool that automatically aligns and merges humorous audio commentary with video files based on spoken word matching.

Tool that automatically aligns and merges humorous audio commentary with video files based on spoken word matching.

After choosing a video file and a commentary track, choose the spoken word language to merge into and a subtitle track to use for finding matching sections, and a manual fine-tuning step allows you to optionally make the sync even more perfect.

Finally the commentary track is merged with the chosen audio track with the correct offset to create a new audio track in the video file chosen.

i set this up because i was using apple music for the riff and vlc for the video and that is not a way to have fun

How It Works

Extract audio from both video and commentary files
Transcribe both using OpenAI Whisper (or subtitles if available)
Align by matching phrases between the two transcripts
Tune alignment interactively with spectrogram visualization
Merge commentary with video audio, creating final output

Installation

# Clone and install
git clone https://github.com/jareklupinski/auto-riffer.git
cd auto-riffer
pip install -r requirements.txt

# System dependencies (macOS)
brew install ffmpeg tesseract

# For interactive mode (recommended)
pip install PyQt6

Usage

# Basic - inserts Riff track into original video
python main.py movie.mkv riff.mp3

# Custom output file
python main.py movie.mkv riff.mp3 -o movie_with_riff.mkv

# Audio-only output
python main.py movie.mkv riff.mp3 --audio-only -o merged.m4a

# Skip interactive tuning
python main.py movie.mkv riff.mp3 --no-interactive

# Manual offset (skip alignment detection)
python main.py movie.mkv riff.mp3 --offset 42.5

# Better accuracy with larger model
python main.py movie.mkv riff.mp3 --whisper-model medium

Options

Option	Description
`-o, --output`	Output file path
`--audio-only`	Output merged audio only (no video)
`--whisper-model`	Whisper model: tiny, base, small, medium, large
`--video-volume`	Volume for video audio (default: 0.7)
`--riff-volume`	Volume for commentary (default: 1.0)
`--language`	Language code for recognition (default: en)
`--offset`	Manual offset in seconds
`--no-interactive`	Skip interactive fine-tuning UI
`--audio-track`	Audio track index (0-based)
`--subtitle-track`	Subtitle track index (0-based)

Project Structure

auto-riffer/
├── main.py          # CLI and orchestration
├── models.py        # Data classes (Word, AudioTrack, etc.)
├── utils.py         # Utilities and dependency checking
├── media.py         # Media probing and audio extraction
├── subtitles.py     # Subtitle extraction and OCR
├── transcribe.py    # Whisper speech recognition
├── align.py         # Phrase matching and offset detection
├── interactive.py   # Interactive tuning UI
└── merge.py         # Audio merging and output creation

Requirements

System

FFmpeg with ffprobe
Tesseract OCR (for bitmap subtitles)
Python 3.10+

Python Packages

openai-whisper
torch
pytesseract
Pillow
matplotlib
scipy
PyQt6 (recommended for interactive mode)

Tips

Subtitles help: If your video has subtitles, they'll be used instead of transcribing audio—faster and often more accurate.
GPU acceleration: Whisper automatically uses CUDA or Apple MPS if available.
Interactive tuning: The spectrogram view shows both audio tracks. Adjust the offset until patterns align, then use "Play Preview" to verify.
Works best when disembaudios repeat movie dialogue, enabling effective phrase matching.