In 2024, Kyutai released Moshi, a full-duplex voice-to-voice model with 200 ms practical latency. This note documents the diagrams we drew while studying Moshi and Mimi, with pseudocode and equations for readers who want the extra detail.
In 2024, Kyutai released Moshi, a full-duplex voice-to-voice model with 200 ms practical latency. This note documents the diagrams we drew while studying Moshi and Mimi, with pseudocode and equations for readers who want the extra detail.