Auto-Riffer
Tool that automatically aligns and merges humorous audio commentary with video files based on spoken word matching.
After choosing a video file and a commentary track, choose the spoken word language to merge into and a subtitle track to use for finding matching sections, and a manual fine-tuning step allows you to optionally make the sync even more perfect.
Finally the commentary track is merged with the chosen audio track with the correct offset to create a new audio track in the video file chosen.
i set this up because i was using apple music for the riff and vlc for the video and that is not a way to have fun
How It Works
- Extract audio from both video and commentary files
- Transcribe both using OpenAI Whisper (or subtitles if available)
- Align by matching phrases between the two transcripts
- Tune alignment interactively with spectrogram visualization
- Merge commentary with video audio, creating final output
Installation
# Clone and install git clone https://github.com/jareklupinski/auto-riffer.git cd auto-riffer pip install -r requirements.txt # System dependencies (macOS) brew install ffmpeg tesseract # For interactive mode (recommended) pip install PyQt6
Usage
# Basic - inserts Riff track into original video python main.py movie.mkv riff.mp3 # Custom output file python main.py movie.mkv riff.mp3 -o movie_with_riff.mkv # Audio-only output python main.py movie.mkv riff.mp3 --audio-only -o merged.m4a # Skip interactive tuning python main.py movie.mkv riff.mp3 --no-interactive # Manual offset (skip alignment detection) python main.py movie.mkv riff.mp3 --offset 42.5 # Better accuracy with larger model python main.py movie.mkv riff.mp3 --whisper-model medium
Options
| Option | Description |
|---|---|
-o, --output |
Output file path |
--audio-only |
Output merged audio only (no video) |
--whisper-model |
Whisper model: tiny, base, small, medium, large |
--video-volume |
Volume for video audio (default: 0.7) |
--riff-volume |
Volume for commentary (default: 1.0) |
--language |
Language code for recognition (default: en) |
--offset |
Manual offset in seconds |
--no-interactive |
Skip interactive fine-tuning UI |
--audio-track |
Audio track index (0-based) |
--subtitle-track |
Subtitle track index (0-based) |
Project Structure
auto-riffer/
├── main.py # CLI and orchestration
├── models.py # Data classes (Word, AudioTrack, etc.)
├── utils.py # Utilities and dependency checking
├── media.py # Media probing and audio extraction
├── subtitles.py # Subtitle extraction and OCR
├── transcribe.py # Whisper speech recognition
├── align.py # Phrase matching and offset detection
├── interactive.py # Interactive tuning UI
└── merge.py # Audio merging and output creation
Requirements
System
- FFmpeg with ffprobe
- Tesseract OCR (for bitmap subtitles)
- Python 3.10+
Python Packages
- openai-whisper
- torch
- pytesseract
- Pillow
- matplotlib
- scipy
- PyQt6 (recommended for interactive mode)
Tips
-
Subtitles help: If your video has subtitles, they'll be used instead of transcribing audio—faster and often more accurate.
-
GPU acceleration: Whisper automatically uses CUDA or Apple MPS if available.
-
Interactive tuning: The spectrogram view shows both audio tracks. Adjust the offset until patterns align, then use "Play Preview" to verify.
-
Works best when disembaudios repeat movie dialogue, enabling effective phrase matching.


