Show HN: Emotional photoreal AI humans at $0.06 / min
Hey HN, we're the co-founders of Keyframe Labs. We train photoreal AI human models you can FaceTime with (try for yourself here: https://demo.keyframelabs.com). Notably, our models run at 60fps on a single consumer GPU (4090).
Today we're shipping our new model, persona-1.5-live, the first to achieve both photorealism and emotion at conversational latency. We see this as a significant step toward passing the "Turing test" for video agents.
Here's an unedited demo video of a conversation between Cosmo and one of us: https://www.loom.com/share/406534ea9991458cb64030df02e565de
You can also FaceTime Cosmo yourself here: https://demo.keyframelabs.com. Try asking him for a sad Shakespearean monologue. Or for a "funny" dad joke.
Voice has emerged as a primary conversational interface across industries, and we think video is the next leap. In our early pilots, face-to-face interaction drives measurably better outcomes in things like sales training and language learning.
Our constraints from day one have been:
- Make meaningful progress towards beating the uncanny valley
- Run at real-time, with world-scale efficiency
In training persona-1.5-live, we didn't have access to giant clusters or hyperscaler budgets. This forced quite a bit of innovation in how we approached diffusion:
- An aggressively lightweight architecture
- Training tricks to squeeze signal out of limited data
Perhaps the most surprising finding was that, for our problem space, representation quality can be an viable substitute for scale. We spent an inordinate amount of time crafting a from-scratch latent space for persona-1.5-live to keep identity and emotion stable given our compute and data constraints.
The result: photoreal AI humans with emotion and real-time latency, priced at just $0.06 per minute.
If you're interested in building with our API, see the docs here: https://docs.keyframelabs.com. It's free to get started.
Excited to see what y'all think! This is great and exciting. I happened to be doing some research to build memory-efficient diffusion models. I have not yet built the demo, but looking at a mix of architecture from several papers, IMTalker, SageAttension, FlashVSR, and Sparse VideoGen, with the intention to reduce memory to about 8GB. The plan was to swap FlashAttention out, and also for an audio driver; SVG could have improved. At 60FPS, I think you are already doing this. Great work. This looks great! The emotion snapping works pretty well. How does the avatar now when to switch between emotions? Glad you liked it! Currently the avatar does it based on the text, which maps the incoming audio to one of our emotion codes, biasing the generation to that emotion. It's not foolproof, but we've found it works pretty well in practice. We'll hang out in the comments!