Denoising, speaker diarization and transcription in a single streamlined process. It's perfect for transcribing podcasts, interviews, or any multi-speaker audio content, as long as they have clear audio. In the output, you'll get a JSON file with the transcript, speaker labels, and timestamps.
Features
- Audio Source Separation: Extract vocals from background music/noise
- Speaker Diarization: Identify and separate different speakers
- Transcription: Convert speech to text with timestamps
- Post-processing: Consolidate transcripts for readability
- Clean Display: Real-time progress updates without cluttering the console
- Step Skipping: Start the pipeline from any step
- Cross-platform: Supports Windows, macOS, and Linux
Requirements
- Python 3.8+
- FFmpeg (for audio processing)
- CUDA-compatible GPU (recommended, but CPU mode available)
- Hugging Face token (optional, for enhanced speaker diarization accuracy)
Installation
# Clone the repository git clone https://github.com/nullwiz/audiopipe.git cd audiopipe # Install dependencies pip install -r requirements.txt # For macOS (with Homebrew) brew install ffmpeg
Usage
# Basic usage - runs all steps in sequence python pipeline.py input.mp3 # Resume from a specific step: python pipeline.py input.mp3 --start-step 2 # Skip separation, start from diarization python pipeline.py input.mp3 --start-step 3 # Skip to transcription step # Optional parameters: python pipeline.py input.mp3 --num-speakers 3 --language en # For very long audio files (>1 hour), use chopping mode: python pipeline.py input.mp3 --chop # Splits into 15-minute chunks for processing
Pipeline Steps
The process consists of three main steps that can be run together or separately:
-
Separation (Step 1): Extracts vocals from background using Demucs
- Input: Any audio/video file
- Output:
output/combined_vocals.wav - Note: Files under 60MB are processed as a single unit; larger files are chunked automatically
-
Diarization (Step 2): Identifies different speakers
- Input:
output/combined_vocals.wav - Output:
output/combined_vocals_diarized.json - Tip: Use
--num-speakersfor better results when speaker count is known
- Input:
-
Transcription (Step 3): Converts complete audio to text, then maps speakers
- Input:
output/combined_vocals.wavand diarization data - Output:
output/final_transcription.json - Architecture: Complete audio transcription → speaker mapping (no chunking)
- Tip: Specify
--languagecode for improved accuracy
- Input:
Output Files Explained
The pipeline creates several files during processing, all stored in the output/ directory:
Audio Files
combined_vocals.wav: Extracted voices/speech from the inputcombined_background.wav: Background music/noise separated from the inputspeakers/SPEAKER_XX/*.wav: Individual audio segments for each speaker
JSON Files
-
combined_vocals_diarized.json: Speaker diarization results showing who speaks when{ "speakers": ["SPEAKER_01", "SPEAKER_02", ...], "segments": [ {"speaker": "SPEAKER_01", "start": 0.0, "end": 2.5}, {"speaker": "SPEAKER_02", "start": 2.7, "end": 5.1}, ... ] } -
final_transcription.json: Complete transcription with speaker attribution in chronological order{ "segments": [ {"text": "Complete sentence or phrase", "start": 0.1, "end": 2.5, "speaker": "SPEAKER_01"}, {"text": "Another speaker's response", "start": 2.7, "end": 5.1, "speaker": "SPEAKER_02"}, {"text": "Continuing conversation", "start": 5.3, "end": 8.0, "speaker": "SPEAKER_01"}, ... ] }
Temporary Directories
separated/: Intermediate files from audio separation (preserved for resuming)chunks/: Audio chunks when using--chopmode (preserved for debugging)
Resuming from Steps
The presence of these files allows the pipeline to resume from different steps:
- If
combined_vocals.wavexists, audio separation can be skipped (step 1) - If
combined_vocals_diarized.jsonexists, diarization can be skipped (step 2)
Visualization Tools
AudioPipe includes tools to visualize your transcripts and generate interactive reports:
# Generate timeline visualization for transcript python visualize.py transcript output/final_transcription.json # Generate interactive HTML report with audio playback python visualize.py report output/final_transcription.json --audio output/combined_vocals.wav # Visualize raw diarization (speaker timeline) python visualize.py diarization output/combined_vocals_diarized.json
For best results:
- Use the HTML report for interactive exploration of longer content
- For very long audio (>1 hour), use
--chopmode for processing
Supported File Formats
- Audio:
.mp3,.wav,.m4a,.flac,.ogg - Video (extracts audio):
.mp4,.mov,.avi,.mkv
Command-Line Options
pipeline.py
python pipeline.py INPUT_AUDIO [OPTIONS]
Arguments:
INPUT_AUDIO Path to input audio/video file
Options:
--num-speakers, -n INT Number of speakers (optional, auto-detected if not specified)
--language, -l STRING Language code for transcription (e.g., 'en', 'es', 'fr')
--start-step, -s [1-3] Start from step: 1=separation, 2=diarization, 3=transcription
--chop, -c Split input audio into 15-minute chunks for processing
--device, -d [cpu|cuda|mps] Device to use for processing (auto-detected if not specified)
--help Show this help message
dem.py (Audio Separation - removes background noise)
python dem.py INPUT_FILE
Arguments:
INPUT_FILE Path to input audio/video file
diarize.py (Speaker Diarization)
python diarize.py INPUT_AUDIO [OPTIONS]
Arguments:
INPUT_AUDIO Path to vocals audio file (usually output/combined_vocals.wav)
Options:
--num-speakers, -n INT Number of speakers (optional, auto-detected if not specified)
macOS Support
For macOS users, there are two operation modes:
CPU Mode (No CUDA)
For Macs without dedicated NVIDIA GPUs:
# Add this to your .bashrc or .zshrc export PYTORCH_ENABLE_MPS_FALLBACK=1 # Run with CPU-only flag python pipeline.py input.mp3 --device cpu
GPU Mode (Apple Silicon)
For M1/M2/M3 Macs, you can utilize Metal Performance Shaders:
# Install PyTorch with MPS support pip install torch torchvision torchaudio # Run with MPS device python pipeline.py input.mp3 --device mps
Output Format
The final output is a JSON file with chronological segments:
{
"segments": [
{
"text": "Transcript text for this segment",
"start": 0.5,
"end": 4.2,
"speaker": "SPEAKER_01"
},
{
"text": "Response from another speaker",
"start": 4.5,
"end": 7.8,
"speaker": "SPEAKER_02"
},
...
]
}Troubleshooting
-
Audio Processing:
- Standard mode processes complete audio files for best quality
- For very long files (>1 hour), use
--chopto split into 15-minute chunks - If you get memory errors, try using
--device cpuwhich uses less memory
-
Transcription Accuracy:
- Specify the language with
--languagefor better results - Complete audio transcription provides better context than chunking
- Improved accuracy for clear audio with minimal background noise
- Specify the language with
-
Speaker Identification:
- If speakers are not correctly identified, try setting
--num-speakers - Better results when speakers have distinct voices and don't talk over each other
- Hugging Face token improves diarization accuracy but is not required
- If speakers are not correctly identified, try setting
Testing
The project includes a test suite for validating the pipeline functionality:
# Run basic integration tests python -m pytest test/test_integration.py -v --integration # Run full pipeline test (slower) python -m pytest test/test_integration.py::test_full_pipeline -v --integration --runslow
- Full Pipeline Test: Use
--runslowto run the complete pipeline test - Hugging Face Token: For full testing, provide your token with
--hf-tokenor set theHUGGING_FACE_TOKENenvironment variable
For more details on Testing, check README.test.md.
Known bugs
- Transcription searching stopped working at some point
- Some buttons are not working on the visualization page