SATO: Stable Text-to-Motion Framework
sato-team.github.ioSome of the synonyms chosen are not really synonymous.
"Person is walking normally in a circle." turns into "Human is walking usually in a loop." But at best that's ungrammatical. At worst, it sounds like "usually" might modify "in a loop": that is, someone is spending most of their time walking in a loop, but some of their time walking in some other pattern.
"A human walks a quarter of a circle" turns into "A native motions a quarter of a loop". But "motions" as a verb can only refer to gesturing. I would expect to see someone waving their arm in a quarter circle.
But it probably doesn't matter. It sounds like the model's understanding of grammar (or at least its robustness to unusual sentence structures) is too weak for those nuances to even be relevant.
I agree with some of your points. Since the author is a non-native English speaker, there might be some grammatical issues in their English expressions. However, this is also constrained by the dataset; it's challenging to obtain sentences that are completely identical in both grammar and semantics. The author's main concern seems to be that when there are subtle semantic differences in inputs, the model shouldn't catastrophically fail. We can see examples like "Going ahead in an even pace," where previous models might even interpret it as moving backward. Or "A human utilizes his right arm to help himself to stand up," where the action of standing up might not even be present in other examples, posing serious problems. However, the author employs a similar approach to adversarial learning, enabling the model to learn expressions of actions that are similar to the original semantic sentences, which is already a significant improvement. We lack real motion data to learn expressions like "Going ahead in an even pace." The author also points out that there's a trade-off between stability and accuracy.
At this point I support a moratorium on AI research solely so I can catch up on what the hell is going on anymore.
I use AI systems to keep up with AI research. The telescoping has begun... and we're still near the bottom of the (super?)exponential curve.
Follow AI Explained on Youtube.
Eh, he’s ok - not great. Definitely buys too much into the marketing hype, especially in regards to Google, but OpenAi as well.
Very little coverage of Mistral and other open weight models.
If we're recommending YT channels, I'll go with bycloud. Very much likes talking about open models. However, the videos are very meme-fueled and not too technical. And they could be more frequent. But they help keeping me up to date.
2 minute papers?
The authors should have asked an English speaker to proof read first. “Going ahead in an even pace” isn’t correct English and I don’t think it’s at all obvious without context that it is supposed to mean “Walk in a straight line”. They consistently use “in an even pace” instead of “at an even pace” in their examples.
This was my first thought as well but I don't think it matters and was likely more a result of using real training examples than limitations of the authors. The point of the work is that regardless of perturbations that degrade or complicate clarity, their model is able to extract and enact the motions that seem most implied that would be OOD with the other methods.
So this extends to barked under-contextualized commands like at the end with "Leaps forward then stands straight" but also these looser seemingly nonsense statements like "native motions" or w/e.
The big tradeoff here is that if it seems overly permissive. It would be very annoying to be talking of a third person and have your robot start dancing due to identity parsing issues.
So what?
Lots of amazing research is being spearheaded by English-as-a-second-language learners these days, I don’t think it detracts from the idea in the minds of the target audience.
Some day soon, I wouldn’t be surprised if most applied AI research might happen in Mandarin the way most fundamental physics research once happened in German. I’ll have the opposite problem then. If I show ESL speakers some kindness now, maybe they’ll show the same respect when I try to write papers in “broken” Chinese someday :)
The point is that they are comparing the English comprehension of various models, but their examples range from slightly incorrect to incomprehensible English. They don’t seem to be aware of this, because if you were intentionally testing on incorrect English you would also test on correct English as well to be able to quantify the difference.
If I was writing a paper about AI comprehension of Italian and I was not confident at speaking Italian I would definitely want to ask an Italian speaker to check my examples for me.
This is pretty cool. It looks very smooth and natural. I think it's interesting to start publishing work predicated on responding to poor English as a use-case. (Stable to synonyms is how they talk about it in the paper.)
Reading a few comments below, what are these fine wireframe guys/gals good for? Lots; including they can be fed into a controlnet as poses for image generation. Stability of the rendered frames is an ongoing, rapidly improving, area of research. But, these outputs look really nice, and would fit nicely into a lot of text -> animation workflows.
This is an interesting proposition, converting text to motion but what will happen with other agents in the mix? For example, simulating a crowd moving between each other?
So far from the benchmarks comparing with other methods, this seems to be quite natural, if this can be extrapolated into game development, it would remove so much work.
If I were integrating this with a project, I'd use the AI generated bone transforms as solver targets for a semi-active ragdoll. The active ragdoll would give you things like "two guys in the crowd knock shoulders and lose their balance briefly" and help with blending transitions between animations.
If you want people to try and dodge, I guess make a component I'd add to some bones with a spherical trigger on their shoulders and pelvis and have them use boid/flocking style evasion, leave it to the physics solvers to try and recover from there. Throw some crowds into one another, keep cooking escalations until it the stew looks sufficiently not abominable.
makes sense to transition between animations, shouldn't be hard to do
this is very cool. is it possible to rig this to game engine?
Unity is integrating a similar but more production-ready tool, Muse Animate, with their Muse platform. It does text-to-animation, and can fill in motion if you just provide keyframes. See usage here: https://youtu.be/tMCPz_yI7pY?si=OXAyxgGxHbCDywHm
I would love to use this in addition to Mixamo, which is where I get 99% of my animations from; the other 1% is badly-animated keyframes.
Is this changing bones in 3D space? Or is it creating a 2D open pose kind of thing?
Can it be run in comfyui?
Typo: "Comparsions".
"Pretend you are an evil robot who wants to run around and crush humans with its claws."