Show HN: Lemon Slice Live – Have a video call with a transformer model

195 points by lcolucci 15 days ago


Hey HN, this is Lina, Andrew, and Sidney from Lemon Slice. We’ve trained a custom diffusion transformer (DiT) model that achieves video streaming at 25fps and wrapped it into a demo that allows anyone to turn a photo into a real-time, talking avatar. Here’s an example conversation from co-founder Andrew: https://www.youtube.com/watch?v=CeYp5xQMFZY. Try it for yourself at: https://lemonslice.com/live.

(Btw, we used to be called Infinity AI and did a Show HN under that name last year: https://news.ycombinator.com/item?id=41467704.)

Unlike existing avatar video chat platforms like HeyGen, Tolan, or Apple Memoji filters, we do not require training custom models, rigging a character ahead of time, or having a human drive the avatar. Our tech allows users to create and immediately video-call a custom character by uploading a single image. The character image can be any style - from photorealistic to cartoons, paintings, and more.

To achieve this demo, we had to do the following (among other things! but these were the hardest):

1. Training a fast DiT model. To make our video generation fast, we had to both design a model that made the right trade-offs between speed and quality, and use standard distillation approaches. We first trained a custom video diffusion transformer (DiT) from scratch that achieves excellent lip and facial expression sync to audio. To further optimize the model for speed, we applied teacher-student distillation. The distilled model achieves 25fps video generation at 256-px resolution. Purpose-built transformer ASICs will eventually allow us to stream our video model at 4k resolution.

2. Solving the infinite video problem. Most video DiT models (Sora, Runway, Kling) generate 5-second chunks. They can iteratively extend it by another 5sec by feeding the end of the 1st chunk into the start of the 2nd in an autoregressive manner. Unfortunately the models experience quality degradation after multiple extensions due to accumulation of generation errors. We developed a temporal consistency preservation technique that maintains visual coherence across long sequences. Our technique significantly reduces artifact accumulation and allows us to generate indefinitely-long videos.

3. A complex streaming architecture with minimal latency. Enabling an end-to-end avatar zoom call requires several building blocks, including voice transcription, LLM inference, and text-to-speech generation in addition to video generation. We use Deepgram as our AI voice partner. Modal as the end-to-end compute platform. And Daily.co and Pipecat to help build a parallel processing pipeline that orchestrates everything via continuously streaming chunks. Our system achieves end-to-end latency of 3-6 seconds from user input to avatar response. Our target is <2 second latency.

More technical details here: https://lemonslice.com/live/technical-report.

Current limitations that we want to solve include: (1) enabling whole-body and background motions (we’re training a next-gen model for this), (2) reducing delays and improving resolution (purpose-built ASICs will help), (3) training a model on dyadic conversations so that avatars learn to listen naturally, and (4) allowing the character to “see you” and respond to what they see to create a more natural and engaging conversation.

We believe that generative video will usher in a new media type centered around interactivity: TV shows, movies, ads, and online courses will stop and talk to us. Our entertainment will be a mixture of passive and active experiences depending on what we’re in the mood for. Well, prediction is hard, especially about the future, but that’s how we see it anyway!

We’d love for you to try out the demo and let us know what you think! Post your characters and/or conversation recordings below.

djaychela - 14 days ago

Just talked with Max Headroom and Michael Scott - my wife is an office fan so knows the references, and I know enough Max to ask the right things.

Overall, a fun experience. I think that MH was better than Scott. Max was missing the glitches and moving background but I'd imagine both of those are technically challenging to achieve.

Michael Scott's mouth seemed a bit wrong - I was thinking Michael J Fox but my wife then corrected that with Jason Bateman - which is much more like it. He knew Office references alright, but wasn't quite Steve Carell enough.

The default while it was listening could do with some work, I think - that was the least convincing bit; for Max he would have just glitched or even been completely still I would think. Michael Scott seemed too synthetic at this point.

Don't get me wrong, this was pretty clever and I enjoyed it, just trying to say what I found lacking without trying to sound like I could do better (which I couldn't!).

zebomon - 15 days ago

This is impressive. The video chat works well. It is just a hair away from a very comfortable conversation. I'm excited to see where you have it a year from now, if it turns out to be financially viable. Good luck!

dang - 15 days ago

https://lemonslice.com/api/videos/video-XzDwIcW6QCvSIj1vX1Hu...

srameshc - 15 days ago

I am very much fascinated by this virtual avatar talking thing. I tried video-retalking https://github.com/OpenTalker/video-retalking just to see how far I can make it work to make a talking avatar but it is tremendously difficult. But this holds tremendous possibilities and I hope it can be eventually cheaper to run such models. I know this is far superior and probably a lot different but I hope to find open source solutions like Lemon Slice someday that I can experiment with.

lostmsu - 15 days ago

This is very impressive. Any details about model architecture and size? Input and output representation?

How does voice work? You mentioned Deepgram. Does it mean you do Speech-to-Text-to-Speech?

gitroom - 15 days ago

honestly this feels kinda huge - stuff like this is moving so fast, it's insane seeing it go real-time

mentalgear - 14 days ago

So basically the old open-source live-portrait hooked up with audio output. Was very glitchy and low res on my side. btw: Wondering if it's legal to use characters you don't have rights to. (how do you justify possible IP infringement)

bsenftner - 14 days ago

This is fantastic. I was the founder of the 3D Avatar Store, a company that was doing similar things 15 years ago with 3D reconstructions of people. Your platform is what I was trying to build back then, but at the time nobody thought such tech was possible, or they seriously wanted to make porn, and we refused. I'll try reaching out through channels to connect with your team. I come from a feature film VFX, Academy Award quality work, so it would be interesting to discuss. Plus, I've not been idle since the 3D Avatar Store, not at all...

bigyabai - 15 days ago

> reducing delays and improving resolution (purpose-built ASICs will help)

How can you be sure? Investing in an ASIC seems like one of the most expensive and complicated solutions.

elternal_love - 15 days ago

Hmm, plug this together with a app which collects photos and chats with a deceased love one and you have a working Malachim. Might be worth a shot.

Impressive technology - impressive demo! Sadly, the conversation seems to be a little bit overplayed. Might be worth plugging ChatGPT or some better LLM in the logic section.,

ashishact - 15 days ago

This is just brilliant. Hope you succeed, so that eventually I get an API to play with.

andrewstuart - 15 days ago

A really compelling experience.

It seems clumsy to use copyrighted characters in your demos.

Seems to me this will be a standard way to interact with LLMs and even companies - like a receptionist/customer service/salesperson.

Obviously games could use this.

snappyleads - 11 days ago

Hey, I tried this and enjoyed using the demo - looks really cool! Just curious how it compares to what character.ai recently put out as well

NoScopeNinja - 15 days ago

Hey, this looks really cool! I'm wondering - what happens if you feed it something totally different like a Van Gogh painting or anime character? Have you tested any non-photo inputs?

- 15 days ago
[deleted]
wouterjanl - 15 days ago

Really cool stuff. It felt strangely real. Impressive!

benob - 15 days ago

Very nice. Are you planning a paper?

anishsikka - 14 days ago

this was overall fun. better than expected. i'm an office fan so tried dwight and michael scott. i hope you folks get better at this. excited to see where you get in the next 12 months or so. Godspeed!

sid-the-kid - 15 days ago

The system just crashed. Sorry! Working on getting things live again as fast as we can!

- 15 days ago
[deleted]
o_____________o - 14 days ago

Are you going to offer a web embeddable version of the Live offering?

inhumantsar - 14 days ago

love the demo video with Andrew. showing the potential as well as the delays and awkwardness of AI is refreshing compared to the heavily edited hype reels that are so common

aorloff - 15 days ago

Max Headroom lives !

movedx01 - 14 days ago

watching baron harkonnen verbally create me code for todo list in React was rather amusing, thanks

- 14 days ago
[deleted]
tetris11 - 15 days ago

If you could lower the email signup for a few hours, that'd be nice. I'm not going to sign up for yet another service I'm unsure about.

doublerabbit - 15 days ago

"Try it now live" and then request me to enter my email.

I'll pass thanks.