Show HN: Emotional photoreal AI humans at $0.06 / min

2 points by kraddypatties a month ago · 4 comments · 2 min read

Hey HN, we're the co-founders of Keyframe Labs. We train photoreal AI human models you can FaceTime with (try for yourself here: https://demo.keyframelabs.com). Notably, our models run at 60fps on a single consumer GPU (4090).

Today we're shipping our new model, persona-1.5-live, the first to achieve both photorealism and emotion at conversational latency. We see this as a significant step toward passing the "Turing test" for video agents.

Here's an unedited demo video of a conversation between Cosmo and one of us: https://www.loom.com/share/406534ea9991458cb64030df02e565de

You can also FaceTime Cosmo yourself here: https://demo.keyframelabs.com. Try asking him for a sad Shakespearean monologue. Or for a "funny" dad joke.

Voice has emerged as a primary conversational interface across industries, and we think video is the next leap. In our early pilots, face-to-face interaction drives measurably better outcomes in things like sales training and language learning.

Our constraints from day one have been:

- Make meaningful progress towards beating the uncanny valley

- Run at real-time, with world-scale efficiency

In training persona-1.5-live, we didn't have access to giant clusters or hyperscaler budgets. This forced quite a bit of innovation in how we approached diffusion:

- An aggressively lightweight architecture

- Training tricks to squeeze signal out of limited data

Perhaps the most surprising finding was that, for our problem space, representation quality can be an viable substitute for scale. We spent an inordinate amount of time crafting a from-scratch latent space for persona-1.5-live to keep identity and emotion stable given our compute and data constraints.

The result: photoreal AI humans with emotion and real-time latency, priced at just $0.06 per minute.

If you're interested in building with our API, see the docs here: https://docs.keyframelabs.com. It's free to get started.

Excited to see what y'all think!

acaciabengo 22 days ago

This is great and exciting. I happened to be doing some research to build memory-efficient diffusion models. I have not yet built the demo, but looking at a mix of architecture from several papers, IMTalker, SageAttension, FlashVSR, and Sparse VideoGen, with the intention to reduce memory to about 8GB.

The plan was to swap FlashAttention out, and also for an audio driver; SVG could have improved. At 60FPS, I think you are already doing this.

Great work.

wtmartinez a month ago

This looks great! The emotion snapping works pretty well. How does the avatar now when to switch between emotions?

kraddypattiesOP a month ago

Glad you liked it!
Currently the avatar does it based on the text, which maps the incoming audio to one of our emotion codes, biasing the generation to that emotion. It's not foolproof, but we've found it works pretty well in practice.

parthradia a month ago

We'll hang out in the comments!

Settings

Show HN: Emotional photoreal AI humans at $0.06 / min

Keyboard Shortcuts