GitHub - lyramakesmusic/latent-musicvis: music visualization via umap of stable audio latents

Latent Space Explorer

Interactive 3D visualization of audio latent spaces using Stable Audio VAE + UMAP.

Features

Encode audio to 64-dimensional latent vectors via Stable Audio VAE
UMAP projection to 3D for interactive visualization
Playback sync - click points to hear audio chunks, or play full song with animated playhead
Latent resynthesis - use one song's latents as a "codebook" to resynthesize another
K-means clustering with animated cluster regions
Rainbow line mode for temporal visualization

Quick Start

# Install dependencies
pip install -r requirements.txt

# Set model paths (or place files in current directory)
export VAE_CONFIG_PATH="path/to/stable_audio_2_0_vae.json"
export VAE_CKPT_PATH="path/to/sao_vae_tune_100k_unwrapped.ckpt" # or normal stable audio vae

# Run server
python server.py

Open http://localhost:8420 in your browser.

Environment Variables

Variable	Default	Description
`VAE_CONFIG_PATH`	`stable_audio_2_0_vae.json`	Path to VAE config JSON
`VAE_CKPT_PATH`	`sao_vae_tune_100k_unwrapped.ckpt`	Path to VAE checkpoint
`PORT`	`8420`	Server port

Getting the VAE Model

You need the Stable Audio VAE weights. Options:

Official weights from Stability AI (requires license)
Community fine-tunes from HuggingFace
Train your own using stable-audio-tools

The server will run in mock mode (feature-based pseudo-latents) if the VAE fails to load.

Usage

Upload audio - drag & drop or click "Upload Audio"
Explore - drag to rotate, scroll to zoom, click points to hear chunks
Play - press Space or click Play to animate through the song
Resynth - click "🔀 Resynth" and upload a second audio file to resynthesize it using the first song's latents as a codebook
Adjust K - use the slider to change number of cluster regions

Keyboard Shortcuts

Key	Action
`Space`	Play/Stop
`←`	Seek back 5s
`→`	Seek forward 5s

How Resynthesis Works

Based on Latent Resynthesis:

Source audio → encode → latents (your "codebook")
Target audio → encode → for each latent, find nearest neighbor in codebook
Replaced latents → decode → output

Result: target's structure + source's timbre.

API Endpoints

Endpoint	Method	Description
`/encode_stream`	POST	Upload audio, stream encoding progress via SSE
`/audio_full`	GET	Get full loaded audio as WAV
`/play`	POST	Get 2048-sample chunk by index
`/resynth`	POST	Resynthesize uploaded audio using current codebook
`/health`	GET	Server status

Tech Stack

Backend: FastAPI, PyTorch, torchaudio, UMAP
Frontend: Vanilla JS, Three.js (via CDN)
No build step - just run the server

License

MIT