
AI RESEARCH FROM META
Introducing Meta
Segment Anything Model Audio (SAM Audio)
Introducing Meta
Segment Anything Model Audio
(SAM Audio)
With SAM Audio, you can use simple text prompts to accurately separate any sound from any audio or audio-visual source.
SAM AUDIO CAPABILITIES
SAM Audio separates target and residual sounds from any audio or audiovisual source—across general sound, music, and speech.

Text prompts
SAM Audio enables you to use text-based prompts to describe the specific target audio they want to separate.

Visual prompts
SAM Audio lets you pick out and separate sounds by clicking on the part of the video where you hear them.

Span prompts
SAM Audio is the first model to introduce span prompting, selecting the desired point in the timespan that contains the target audio.

Multi-modal prompts
SAM Audio provides you flexibility with three unifying prompt modalities (text, visual, timespan).
A NEW WAY TO EXPERIENCE SOUND
State-of-the-art model for all sound

EVERYTHING
SAM Audio is a state-of-the-art, unified multimodal model that sets a new standard for audio separation, enabling users to isolate general sounds, music, and speech from complex mixtures using intuitive prompts.

SAM Audio is a state-of-the-art, unified multimodal model that sets a new standard for audio separation, enabling users to isolate general sounds, music, and speech from complex mixtures using intuitive prompts.


GENERAL SOUNDS
Separates everyday sounds—like traffic or barking dogs—from complex audio using multimodal prompts for fast, intuitive noise removal.

Separates everyday sounds—like traffic or barking dogs—from complex audio using multimodal prompts for fast, intuitive noise removal.


MUSIC
Isolates instruments and vocals with high accuracy, leveraging text, visual, and time-based prompts to rival top music separation models.

Isolates instruments and vocals with high accuracy, leveraging text, visual, and time-based prompts to rival top music separation models.


SPEECH
Extracts speech from background noise, enabling clear speaker isolation and voice separation through flexible, intuitive prompts.

Extracts speech from background noise, enabling clear speaker isolation and voice separation through flexible, intuitive prompts.

EVERYTHING
SAM Audio is a state-of-the-art, unified multimodal model that sets a new standard for audio separation, enabling users to isolate general sounds, music, and speech from complex mixtures using intuitive prompts.

SAM Audio is a state-of-the-art, unified multimodal model that sets a new standard for audio separation, enabling users to isolate general sounds, music, and speech from complex mixtures using intuitive prompts.

GENERAL SOUNDS
Separates everyday sounds—like traffic or barking dogs—from complex audio using multimodal prompts for fast, intuitive noise removal.

Separates everyday sounds—like traffic or barking dogs—from complex audio using multimodal prompts for fast, intuitive noise removal.

MUSIC
Isolates instruments and vocals with high accuracy, leveraging text, visual, and time-based prompts to rival top music separation models.

Isolates instruments and vocals with high accuracy, leveraging text, visual, and time-based prompts to rival top music separation models.

SPEECH
Extracts speech from background noise, enabling clear speaker isolation and voice separation through flexible, intuitive prompts.

Extracts speech from background noise, enabling clear speaker isolation and voice separation through flexible, intuitive prompts.

PERFORMANCE
State-of-the-art model performance
SAM Audio achieves beyond state-of-the-art performance for all prompting capabilities.
OUR APPROACH
Model architecture
SAM Audio is a generative separation model that extracts both target and residual stems from an audio mixture using text, visual, or temporal prompts. It is powered by a flow-matching Diffusion Transformer and operates in a DAC-VAE latent space, enabling high-quality joint generation of target and residual audio.

OUR APPROACH
Audiovisual Perception Encoder
PERFORMANCE
Introducing Perception Encoder Audio Video
PE-AV is a new open source model, bringing audio capabilities to Meta's Perception Encoder.
THE SAM AUDIO EVALUATION DATASET
A first-of-its-kind audio separation OSS evaluation set
SAM Audio is releasing a first-of-its-kind OSS evaluation set for prompted audio separation and a judge model highly correlated with human subjective evaluation.
Real world opportunities
"Artificial Intelligence has been a game changer for the disabled community and the use cases for AI-focused start-ups in our ecosystem are vast. By incorporating open source models like SAM Audio into their work, 2GI’s cohort participants can advance their missions while gaining competitive advantage, showcasing that disabled founders are on the cutting edge of technology."
- Diego Mariscal, CEO of 2gether-International
2gether-International empowers disabled founders with resources to launch high-impact startups. In partnership with Meta’s AI for Good team, 2GI leverages open AI models like SAM Audio to accelerate innovation for early-stage, founder-led AI companies.
"For years, Starkey has led the industry in applying artificial intelligence to revolutionize hearing technology. Our ground-breaking work continues to elevate what hearing aids can achieve, particularly in challenging listening situations like noisy environments and overlapping speech. With open models like SAM audio, we see tremendous opportunity to build on our innovations and further our mission to help people hear better and live better."
- Achin Bhowmik, Chief Technology Officer and Executive Vice President of Engineering at Starkey
Starkey is the global leader in hearing technology and the only global American-owned hearing aid manufacturer. Using AI, Starkey transforms hearing aids into smart health and communication devices—delivering innovative, connected solutions that enhance lives