SATO: Stable Text-to-Motion Framework

sato-team.github.io

115 points by Sajarin 2 years ago · 22 comments

Reader

comex 2 years ago

Some of the synonyms chosen are not really synonymous.

"Person is walking normally in a circle." turns into "Human is walking usually in a loop." But at best that's ungrammatical. At worst, it sounds like "usually" might modify "in a loop": that is, someone is spending most of their time walking in a loop, but some of their time walking in some other pattern.

"A human walks a quarter of a circle" turns into "A native motions a quarter of a loop". But "motions" as a verb can only refer to gesturing. I would expect to see someone waving their arm in a quarter circle.

But it probably doesn't matter. It sounds like the model's understanding of grammar (or at least its robustness to unusual sentence structures) is too weak for those nuances to even be relevant.

chaton_c 2 years ago

I agree with some of your points. Since the author is a non-native English speaker, there might be some grammatical issues in their English expressions. However, this is also constrained by the dataset; it's challenging to obtain sentences that are completely identical in both grammar and semantics. The author's main concern seems to be that when there are subtle semantic differences in inputs, the model shouldn't catastrophically fail. We can see examples like "Going ahead in an even pace," where previous models might even interpret it as moving backward. Or "A human utilizes his right arm to help himself to stand up," where the action of standing up might not even be present in other examples, posing serious problems. However, the author employs a similar approach to adversarial learning, enabling the model to learn expressions of actions that are similar to the original semantic sentences, which is already a significant improvement. We lack real motion data to learn expressions like "Going ahead in an even pace." The author also points out that there's a trade-off between stability and accuracy.

hehdhdjehehegwv 2 years ago

At this point I support a moratorium on AI research solely so I can catch up on what the hell is going on anymore.

justinjlynn 2 years ago

I use AI systems to keep up with AI research. The telescoping has begun... and we're still near the bottom of the (super?)exponential curve.
spiderfarmer 2 years ago

Follow AI Explained on Youtube.
- hehdhdjehehegwv 2 years ago
  
  Eh, he’s ok - not great. Definitely buys too much into the marketing hype, especially in regards to Google, but OpenAi as well.
  Very little coverage of Mistral and other open weight models.
  - mrln 2 years ago
    
    If we're recommending YT channels, I'll go with bycloud. Very much likes talking about open models. However, the videos are very meme-fueled and not too technical. And they could be more frequent. But they help keeping me up to date.
spxneo 2 years ago

2 minute papers?

zarzavat 2 years ago

The authors should have asked an English speaker to proof read first. “Going ahead in an even pace” isn’t correct English and I don’t think it’s at all obvious without context that it is supposed to mean “Walk in a straight line”. They consistently use “in an even pace” instead of “at an even pace” in their examples.

micimize 2 years ago

This was my first thought as well but I don't think it matters and was likely more a result of using real training examples than limitations of the authors. The point of the work is that regardless of perturbations that degrade or complicate clarity, their model is able to extract and enact the motions that seem most implied that would be OOD with the other methods.
So this extends to barked under-contextualized commands like at the end with "Leaps forward then stands straight" but also these looser seemingly nonsense statements like "native motions" or w/e.
The big tradeoff here is that if it seems overly permissive. It would be very annoying to be talking of a third person and have your robot start dancing due to identity parsing issues.
gcr 2 years ago

So what?
Lots of amazing research is being spearheaded by English-as-a-second-language learners these days, I don’t think it detracts from the idea in the minds of the target audience.
Some day soon, I wouldn’t be surprised if most applied AI research might happen in Mandarin the way most fundamental physics research once happened in German. I’ll have the opposite problem then. If I show ESL speakers some kindness now, maybe they’ll show the same respect when I try to write papers in “broken” Chinese someday :)
- zarzavat 2 years ago
  
  The point is that they are comparing the English comprehension of various models, but their examples range from slightly incorrect to incomprehensible English. They don’t seem to be aware of this, because if you were intentionally testing on incorrect English you would also test on correct English as well to be able to quantify the difference.
  If I was writing a paper about AI comprehension of Italian and I was not confident at speaking Italian I would definitely want to ask an Italian speaker to check my examples for me.

vessenes 2 years ago

This is pretty cool. It looks very smooth and natural. I think it's interesting to start publishing work predicated on responding to poor English as a use-case. (Stable to synonyms is how they talk about it in the paper.)

Reading a few comments below, what are these fine wireframe guys/gals good for? Lots; including they can be fed into a controlnet as poses for image generation. Stability of the rendered frames is an ongoing, rapidly improving, area of research. But, these outputs look really nice, and would fit nicely into a lot of text -> animation workflows.

spxneo 2 years ago

This is an interesting proposition, converting text to motion but what will happen with other agents in the mix? For example, simulating a crowd moving between each other?

So far from the benchmarks comparing with other methods, this seems to be quite natural, if this can be extrapolated into game development, it would remove so much work.

MrLeap 2 years ago

If I were integrating this with a project, I'd use the AI generated bone transforms as solver targets for a semi-active ragdoll. The active ragdoll would give you things like "two guys in the crowd knock shoulders and lose their balance briefly" and help with blending transitions between animations.
If you want people to try and dodge, I guess make a component I'd add to some bones with a spherical trigger on their shoulders and pelvis and have them use boid/flocking style evasion, leave it to the physics solvers to try and recover from there. Throw some crowds into one another, keep cooking escalations until it the stew looks sufficiently not abominable.
- spxneo 2 years ago
  
  makes sense to transition between animations, shouldn't be hard to do

ideashower 2 years ago

this is very cool. is it possible to rig this to game engine?

nielsinho 2 years ago

Unity is integrating a similar but more production-ready tool, Muse Animate, with their Muse platform. It does text-to-animation, and can fill in motion if you just provide keyframes. See usage here: https://youtu.be/tMCPz_yI7pY?si=OXAyxgGxHbCDywHm
CaptainFever 2 years ago

I would love to use this in addition to Mixamo, which is where I get 99% of my animations from; the other 1% is badly-animated keyframes.

jncfhnb 2 years ago

Is this changing bones in 3D space? Or is it creating a 2D open pose kind of thing?

Can it be run in comfyui?

kookamamie 2 years ago

Typo: "Comparsions".

HeatrayEnjoyer 2 years ago

"Pretend you are an evil robot who wants to run around and crush humans with its claws."

Settings

SATO: Stable Text-to-Motion Framework

Keyboard Shortcuts