GitHub - tobilg/duckdb-whisper: Use whisper.cpp within DuckDB to translate / transpile speech to text

DuckDB Whisper Extension

A DuckDB extension for speech-to-text transcription using whisper.cpp, the C/C++ port of OpenAI's Whisper model.

Transcribe audio files directly from SQL queries in DuckDB, making it easy to process and analyze audio data alongside your other data.

Features

Transcribe audio files (WAV, MP3, FLAC, OGG, and more)
Live recording and transcription from microphone
Voice-to-SQL: speak natural language questions, get query results
Support for all Whisper models (tiny, base, small, medium, large)
Detailed transcription segments with timestamps and confidence scores
Automatic language detection or specify target language
Works with file paths, BLOB data, or remote URLs

Installation

Community Extension (Recommended)

INSTALL whisper FROM community;
LOAD whisper;

Download a Model

Models must be downloaded before use. They are stored in ~/.duckdb/whisper/models/.

mkdir -p ~/.duckdb/whisper/models

# Download tiny.en model (~75MB, fastest)
curl -L -o ~/.duckdb/whisper/models/ggml-tiny.en.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en.bin

Check available models and download status:

SELECT * FROM whisper_list_models();

Quick Start

Transcribe an Audio File

-- Simple transcription
SELECT whisper_transcribe('audio.wav', 'tiny.en');

-- Get detailed segments with timestamps
SELECT * FROM whisper_transcribe_segments('audio.wav', 'tiny.en');

Setup Audio Device (Required for Recording)

Before using microphone recording or voice query features:

-- List all available audio input devices
SELECT * FROM whisper_list_devices();

-- Set the device ID (use a device_id from the list above)
SET whisper_device_id = 0;

-- Verify your microphone is working
SELECT whisper_mic_level(3);

Record and Transcribe

-- Record for 5 seconds
SELECT whisper_record(5, 'tiny.en');

-- Record until silence (max 30 seconds)
SELECT whisper_record_auto(30);

Voice-to-SQL (Experimental)

Speak natural language questions and get SQL results:

-- Create test data
CREATE TABLE customers (id INT, name VARCHAR, revenue DECIMAL);
INSERT INTO customers VALUES (1, 'Acme', 100000), (2, 'Beta', 50000);

-- Speak your question (e.g., "show all customers")
FROM whisper_voice_query();

Requires text-to-sql-proxy running locally. See Voice-to-SQL Feature for details.

Example Use Cases

Transcribe Remote Audio

INSTALL httpfs;
LOAD httpfs;

SELECT whisper_transcribe(content, 'tiny.en')
FROM read_blob('https://example.com/audio.mp3');

Batch Transcribe Multiple Files

SELECT file, whisper_transcribe(file, 'tiny.en') as transcript
FROM glob('audio/*.wav');

Search Within Transcriptions

SELECT * FROM whisper_transcribe_segments('meeting.wav', 'base.en')
WHERE text ILIKE '%action item%';

Translate Foreign Audio to English

SELECT whisper_translate('german_speech.mp3', 'small');

Generate Subtitles (SRT Format)

SELECT
    segment_id + 1 as id,
    printf('%02d:%02d:%02d,%03d',
        (start_time/3600)::int, ((start_time%3600)/60)::int,
        (start_time%60)::int, ((start_time - start_time::int) * 1000)::int
    ) || ' --> ' ||
    printf('%02d:%02d:%02d,%03d',
        (end_time/3600)::int, ((end_time%3600)/60)::int,
        (end_time%60)::int, ((end_time - end_time::int) * 1000)::int
    ) as timestamp,
    trim(text) as text
FROM whisper_transcribe_segments('video.mp4', 'small.en');

Available Models

Model	Size	Description
`tiny`	~75MB	Fastest, multilingual
`tiny.en`	~75MB	Fastest, English-only
`base`	~142MB	Fast, multilingual
`base.en`	~142MB	Fast, English-only
`small`	~466MB	Good balance, multilingual
`small.en`	~466MB	Good balance, English-only
`medium`	~1.5GB	High quality, multilingual
`medium.en`	~1.5GB	High quality, English-only
`large-v1`	~2.9GB	Best quality, multilingual
`large-v2`	~2.9GB	Best quality, multilingual
`large-v3`	~2.9GB	Best quality, multilingual
`large-v3-turbo`	~1.6GB	Fast + accurate, multilingual

Tip: English-only models (.en suffix) are optimized for English and perform better for English audio.

Supported Audio Formats

The extension uses FFmpeg for audio decoding:

WAV, MP3, FLAC, OGG/Vorbis, AAC/M4A, and many more

Audio is automatically converted to 16kHz mono as required by Whisper.

Function Reference

Transcription Functions

`whisper_transcribe(audio, [model])`

Transcribes audio and returns the full text.

SELECT whisper_transcribe('audio.wav', 'tiny.en');
SELECT whisper_transcribe(audio_blob, 'base.en') FROM audio_table;

`whisper_translate(audio, [model])`

Translates audio from any language to English.

SELECT whisper_translate('german_speech.mp3', 'small');

`whisper_transcribe_segments(audio, [model], [language], [translate])`

Returns a table of transcription segments with timestamps.

Column	Type	Description
segment_id	INTEGER	0-based segment index
start_time	DOUBLE	Start time in seconds
end_time	DOUBLE	End time in seconds
text	VARCHAR	Transcribed text
confidence	DOUBLE	Confidence score (0.0-1.0)
language	VARCHAR	Detected language code

Recording Functions

`whisper_list_devices()`

Lists available audio input devices.

SELECT * FROM whisper_list_devices();

`whisper_record(duration_seconds, [model], [device_id])`

Records audio from microphone and transcribes it.

SELECT whisper_record(5, 'tiny.en');

`whisper_record_auto(max_seconds, [silence_seconds], [model], [threshold], [device_id])`

Records until silence is detected or max duration reached.

SELECT whisper_record_auto(30, 2.0, 'tiny.en');

`whisper_record_translate(duration_seconds, [model], [device_id])`

Records audio and translates to English.

SELECT whisper_record_translate(5, 'small');

`whisper_mic_level(duration_seconds, [device_id])`

Check microphone amplitude levels. Use to determine appropriate silence threshold.

SELECT whisper_mic_level(3);
-- Returns: "Peak: 0.15, RMS: 0.02 (suggested threshold: 0.01)"

Model Management Functions

`whisper_list_models()`

Lists all available Whisper models and their download status.

SELECT * FROM whisper_list_models();
SELECT * FROM whisper_list_models() WHERE is_downloaded = true;

`whisper_download_model(model_name)`

Returns download instructions for a model.

Utility Functions

`whisper_version()`

Returns extension and whisper.cpp version info.

`whisper_check_audio(file_path)`

Validates that an audio file can be read.

`whisper_audio_info(file_path)`

Returns metadata about an audio file (duration, sample rate, channels, format).

Configuration

Configure settings using standard SET statements:

-- Model settings
SET whisper_model = 'small.en';
SET whisper_model_path = '/custom/path/models';
SET whisper_language = 'en';
SET whisper_threads = 4;

-- Recording settings
SET whisper_device_id = 0;
SET whisper_max_duration = 30;
SET whisper_silence_duration = 2;
SET whisper_silence_threshold = 0.005;

-- Voice query settings (if enabled)
SET whisper_text_to_sql_url = 'http://localhost:8080/generate-sql';
SET whisper_text_to_sql_timeout = 60;
SET whisper_voice_query_show_sql = true;

-- View current value
SELECT current_setting('whisper_device_id');

-- View all whisper settings
SELECT * FROM duckdb_settings() WHERE name LIKE 'whisper_%';

-- View all settings with whisper_get_config()
SELECT whisper_get_config();

-- Reset a setting to default
RESET whisper_device_id;

Available Settings

Setting	Type	Default	Description
`whisper_model`	VARCHAR	"base.en"	Whisper model name
`whisper_model_path`	VARCHAR	~/.duckdb/whisper/models	Model storage path
`whisper_language`	VARCHAR	"auto"	Target language code
`whisper_threads`	INTEGER	0	Processing threads (0=auto)
`whisper_device_id`	INTEGER	-1	Audio device ID (-1=default)
`whisper_max_duration`	DOUBLE	15.0	Max recording duration (seconds)
`whisper_silence_duration`	DOUBLE	1.0	Silence to stop recording (seconds)
`whisper_silence_threshold`	DOUBLE	0.001	Silence detection threshold
`whisper_verbose`	BOOLEAN	false	Show status messages during operations
`whisper_use_gpu`	BOOLEAN	true	Use GPU acceleration if available (Metal on macOS)
`whisper_ffmpeg_logging`	BOOLEAN	false	Show FFmpeg log output (warnings, info)
`whisper_text_to_sql_url`	VARCHAR	"http://localhost:4000/generate-sql"	Text-to-SQL proxy URL
`whisper_text_to_sql_timeout`	INTEGER	15	Proxy request timeout (seconds)
`whisper_voice_query_show_sql`	BOOLEAN	false	Show generated SQL in output
`whisper_voice_query_timeout`	INTEGER	30	Timeout for entire voice query operation (seconds)

Transcription Performance

Transcription speed depends heavily on your hardware and model choice.

macOS (Apple Silicon)

On macOS with Apple Silicon (M1/M2/M3/M4), the extension uses Metal GPU acceleration by default, providing significant speedups:

Mode	42-min Podcast	Speed
CPU only	~12 min	3.5x realtime
Metal GPU	~1 min	40x realtime

GPU acceleration is enabled by default. To disable it (e.g., for debugging):

SET whisper_use_gpu = false;

Linux

On Linux, the community extension uses CPU with all available cores. For best performance with CPU:

Use English-only models (.en suffix) when transcribing English audio
Use smaller models (tiny.en, base.en) for faster transcription
Ensure adequate CPU cores available

GPU Acceleration on Linux

The community builds are CPU-only because GPU support requires runtime drivers that vary by system. However, you can build from source with GPU acceleration:

Backend	GPU Vendor	Build Flag	Requirements
Vulkan	Any (NVIDIA, AMD, Intel)	`GGML_VULKAN=ON`	Vulkan SDK + GPU drivers with Vulkan ICD
CUDA	NVIDIA	`GGML_CUDA=ON`	CUDA Toolkit + NVIDIA drivers
ROCm/HIP	AMD	`GGML_HIP=ON`	ROCm stack

Building with Vulkan (recommended for portability):

# Install Vulkan SDK (Ubuntu/Debian)
sudo apt-get install libvulkan-dev vulkan-tools

# Build with Vulkan support
git clone --recursive https://github.com/tobilg/duckdb-whisper.git
cd duckdb-whisper
EXT_FLAGS="-DGGML_VULKAN=ON" VCPKG_TOOLCHAIN_PATH=~/vcpkg/scripts/buildsystems/vcpkg.cmake make release

Building with CUDA (NVIDIA GPUs):

# Requires CUDA Toolkit installed (https://developer.nvidia.com/cuda-toolkit)
EXT_FLAGS="-DGGML_CUDA=ON" VCPKG_TOOLCHAIN_PATH=~/vcpkg/scripts/buildsystems/vcpkg.cmake make release

Note: GPU builds require the corresponding runtime libraries on the target system. The whisper_use_gpu setting controls whether GPU acceleration is used at runtime.

Performance Tips

Choose the right model: tiny.en is ~10x faster than large-v3 with acceptable quality for many use cases
Use English-only models: .en models are optimized and faster for English audio
Local files are faster: Avoid network latency by downloading files first
Monitor with FFmpeg logging: Enable SET whisper_ffmpeg_logging = true to see audio decoding progress

Voice-to-SQL Feature

Speak natural language questions about your data and receive SQL query results.

Prerequisites

text-to-sql-proxy running locally
Audio device configured (see Setup Audio Device)

Voice Query Functions

`whisper_voice_to_sql([model], [device_id])`

Records voice, transcribes, and returns generated SQL without executing.

SELECT whisper_voice_to_sql();

`whisper_voice_query([model], [device_id])`

Records voice, generates SQL, executes it, and returns results.

FROM whisper_voice_query();

`whisper_voice_query_with_sql([model], [device_id])`

Same as above but includes _generated_sql and _transcription columns.

FROM whisper_voice_query_with_sql();

Voice Query Configuration

-- Set text-to-sql proxy URL (default: http://localhost:4000/generate-sql)
SET whisper_text_to_sql_url = 'http://localhost:8080/generate-sql';
SELECT current_setting('whisper_text_to_sql_url');

-- Set timeout (default: 30 seconds)
SET whisper_text_to_sql_timeout = 60;
SELECT current_setting('whisper_text_to_sql_timeout');

-- Show generated SQL in output
SET whisper_voice_query_show_sql = true;
SELECT current_setting('whisper_voice_query_show_sql');

Building from Source

For developers who want to build the extension locally.

Prerequisites

CMake 3.14+
C++17 compiler
vcpkg package manager

Setup vcpkg

First, clone and bootstrap vcpkg (one-time setup):

# Clone vcpkg (outside the project directory)
git clone https://github.com/microsoft/vcpkg.git ~/vcpkg
cd ~/vcpkg

# Bootstrap vcpkg
./bootstrap-vcpkg.sh  # macOS/Linux
# or
.\bootstrap-vcpkg.bat  # Windows

Install Dependencies

# Install required dependencies
~/vcpkg/vcpkg install ffmpeg sdl2 curl

Clone and Build

# Clone the repository
git clone --recursive https://github.com/tobilg/duckdb-whisper.git
cd duckdb-whisper

# Build with vcpkg toolchain
VCPKG_TOOLCHAIN_PATH=~/vcpkg/scripts/buildsystems/vcpkg.cmake make release -j8

Platform-Specific Notes

macOS

You may need to install additional tools:

brew install cmake ninja pkg-config

Ubuntu/Debian

Install build essentials:

sudo apt-get install build-essential cmake ninja-build pkg-config \
    nasm autoconf automake libtool

Windows

Use the Developer Command Prompt or PowerShell with vcpkg:

$env:VCPKG_TOOLCHAIN_PATH="C:\path\to\vcpkg\scripts\buildsystems\vcpkg.cmake"
make release

Build Options

# Build without audio recording
EXT_FLAGS="-DWHISPER_ENABLE_RECORDING=OFF" make release

# Build without voice-to-SQL
EXT_FLAGS="-DWHISPER_ENABLE_VOICE_QUERY=OFF" make release

Running Tests

make test_whisper        # All tests (requires tiny.en model)
make test_whisper_quick  # Quick tests (no model needed)

License

This extension is licensed under the MIT License.

Acknowledgments

OpenAI Whisper - Original Whisper model
whisper.cpp - C/C++ port of Whisper
DuckDB - The database engine
FFmpeg - Audio decoding