SAM Audio

AI RESEARCH FROM META

Introducing Meta
Segment Anything Model Audio (SAM Audio)

Introducing Meta
Segment Anything Model Audio
(SAM Audio)

With SAM Audio, you can use simple text prompts to accurately separate any sound from any audio or audio-visual source.

SAM AUDIO CAPABILITIES

SAM Audio separates target and residual sounds from any audio or audiovisual source—across general sound, music, and speech.

Text prompts

SAM Audio enables you to use text-based prompts to describe the specific target audio they want to separate.

Visual prompts

SAM Audio lets you pick out and separate sounds by clicking on the part of the video where you hear them.

Span prompts

SAM Audio is the first model to introduce span prompting, selecting the desired point in the timespan that contains the target audio.

Multi-modal prompts

SAM Audio provides you flexibility with three unifying prompt modalities (text, visual, timespan).

A NEW WAY TO EXPERIENCE SOUND

State-of-the-art model for all sound

EVERYTHING

SAM Audio is a state-of-the-art, unified multimodal model that sets a new standard for audio separation, enabling users to isolate general sounds, music, and speech from complex mixtures using intuitive prompts.

GENERAL SOUNDS

Separates everyday sounds—like traffic or barking dogs—from complex audio using multimodal prompts for fast, intuitive noise removal.

MUSIC

Isolates instruments and vocals with high accuracy, leveraging text, visual, and time-based prompts to rival top music separation models.

SPEECH

Extracts speech from background noise, enabling clear speaker isolation and voice separation through flexible, intuitive prompts.

EVERYTHING

GENERAL SOUNDS

Separates everyday sounds—like traffic or barking dogs—from complex audio using multimodal prompts for fast, intuitive noise removal.

MUSIC

Isolates instruments and vocals with high accuracy, leveraging text, visual, and time-based prompts to rival top music separation models.

SPEECH

Extracts speech from background noise, enabling clear speaker isolation and voice separation through flexible, intuitive prompts.

PERFORMANCE

State-of-the-art model performance

SAM Audio achieves beyond state-of-the-art performance for all prompting capabilities.

SAM Audio performance chart

OUR APPROACH

Model architecture

SAM Audio is a generative separation model that extracts both target and residual stems from an audio mixture using text, visual, or temporal prompts. It is powered by a flow-matching Diffusion Transformer and operates in a DAC-VAE latent space, enabling high-quality joint generation of target and residual audio.

OUR APPROACH

Audiovisual Perception Encoder

SAM Audio Audiovisual Perception Encoder performance chart

PERFORMANCE

Introducing Perception Encoder Audio Video

PE-AV is a new open source model, bringing audio capabilities to Meta's Perception Encoder.

THE SAM AUDIO EVALUATION DATASET

A first-of-its-kind audio separation OSS evaluation set

SAM Audio is releasing a first-of-its-kind OSS evaluation set for prompted audio separation and a judge model highly correlated with human subjective evaluation.

SAM Audio dataset chart

Real world opportunities

"Artificial Intelligence has been a game changer for the disabled community and the use cases for AI-focused start-ups in our ecosystem are vast. By incorporating open source models like SAM Audio into their work, 2GI’s cohort participants can advance their missions while gaining competitive advantage, showcasing that disabled founders are on the cutting edge of technology."

- Diego Mariscal, CEO of 2gether-International

2gether-International empowers disabled founders with resources to launch high-impact startups. In partnership with Meta’s AI for Good team, 2GI leverages open AI models like SAM Audio to accelerate innovation for early-stage, founder-led AI companies.

"For years, Starkey has led the industry in applying artificial intelligence to revolutionize hearing technology. Our ground-breaking work continues to elevate what hearing aids can achieve, particularly in challenging listening situations like noisy environments and overlapping speech. With open models like SAM audio, we see tremendous opportunity to build on our innovations and further our mission to help people hear better and live better."

- Achin Bhowmik, Chief Technology Officer and Executive Vice President of Engineering at Starkey

Starkey is the global leader in hearing technology and the only global American-owned hearing aid manufacturer. Using AI, Starkey transforms hearing aids into smart health and communication devices—delivering innovative, connected solutions that enhance lives

Introducing Meta
Segment Anything Model Audio (SAM Audio)

Introducing Meta
Segment Anything Model Audio
(SAM Audio)

SAM Audio separates target and residual sounds from any audio or audiovisual source—across general sound, music, and speech.

Text prompts

Visual prompts

Span prompts

Multi-modal prompts

State-of-the-art model for all sound

State-of-the-art model performance

Model architecture

Audiovisual Perception Encoder

Introducing Perception Encoder Audio Video

A first-of-its-kind audio separation OSS evaluation set

Real world opportunities

More from Segment Anything

SAM 3

SAM 3D

Introducing Meta Segment Anything Model Audio (SAM Audio)

Introducing Meta Segment Anything Model Audio (SAM Audio)

SAM Audio separates target and residual sounds from any audio or audiovisual source—across general sound, music, and speech.

Text prompts

Visual prompts

Span prompts

Multi-modal prompts

State-of-the-art model for all sound

State-of-the-art model performance

Model architecture

Audiovisual Perception Encoder

Introducing Perception Encoder Audio Video

A first-of-its-kind audio separation OSS evaluation set

Real world opportunities

More from Segment Anything

SAM 3

SAM 3D

Introducing Meta
Segment Anything Model Audio (SAM Audio)

Introducing Meta
Segment Anything Model Audio
(SAM Audio)