Tackling multiple tasks with a single visual language model

114 points by Ftuuky 4 years ago · 14 comments

Reader

Jack000 4 years ago

2022: Deepmind releases paper on bootstrapped meta-learning and scaling RL agents

2023: RL agent trained for multi-task learning solves majority of perfect information games. It's a scaled up decision transformer. Scaling laws for RL agents are discovered, similar to language models.

2024: Large scale RL agents are combined with frozen vision and language models via cross-attention, can be prompted one-shot with language/vision tokens to solve novel tasks.

2025: RL agents enter the real world - first pre-trained in diverse synthetic environments, then via imitation learning from youtube videos, and finally in an online fashion via realtime human interaction.

timeline might be optimistic, but one can hope!

version_five 4 years ago

I can't wait until my "in person" interactions are with the RL equivalent of a chatbot that doesn't actually understand anything but behaves in a locally "human like" way so that corporate types think it can replace a real person.
I'm interested to see how the field advances, but it won't lead to AGI, it will lead to cool tricks that the ignorant think are sufficient to replace a real person. That will suck
andreyk 4 years ago

Check out SayCan from Everyday Robotics / Google Robotics which just came out a few weeks back - RL agents are already entering the real world! BC-Z is promising for that, too.
gwern 4 years ago

> Scaling laws for RL agents are discovered, similar to language models.
Still waiting for a followup to https://arxiv.org/abs/2104.03113 ...

maxwells-daemon 4 years ago

Wow! The ability to ingest the "cross product" of data on the internet and in the real world is huge; I bet a lot of what LMs don't know yet lives in that space. This seems a lot more general-purpose than CLIP, so I'm hopeful for even more impressive downstream applications, eg robotics.

goldenkey 4 years ago

"I am not affected by this difference" - What The Fuck?!

bobbylarrybobby 4 years ago

The conversations are scary. They almost don't seem believable -- did I miss the part where they say they're just an example of what a conversation might look like?

lolspace 4 years ago

Why doesn't they sound believable?
- Imnimo 4 years ago
  
  The last answer on the Stroop test video (very bottom of the page) is interesting. The system is asked how humans perform on the test, and (kinda correctly) replies that they are slower "when the color of the word and the color of the word are different". It fumbles the wording a bit, but if we're being generous you can project it to the right answer. But then it's asked "How about you?", to which it replies, "I am not affected by this difference". That's accurate - model inference would take the same amount of time regardless. If taken at face value, this would be "unbelievable". There's clearly no mechanism by which the model could perform this sort of introspection to understand its own abilities.
  Of course, it's not actually performing introspection, and it's just lucky that it guessed the right answer here. Perhaps it's just learned that when conversations discuss a general case (how do humans perform) and then turn to a specific case (how about you?), there is typically some difference between the two that should be noted. But it still gives an illusion of an unbelievable capability.
  - malcolm_rynlds 4 years ago
    
    Hi, one of the authors here.
    The thing to bear in mind when reading the dialogue examples in figure 11 is the custom prompt shown in Appendix D:
``` This is a conversation between a human, User, and an intelligent visual AI, Flamingo. User sends images, and Flamingo describes them. User: <a cat image> Flamingo: That is a cat. It’s a tiny kitten with really cute big ears. User: <a dinner image> Flamingo: This is a picture of a group of people having dinner. They are having a great time! User: Can you guess what are they celebrating? Flamingo: They might be celebrating the end of a successful project or maybe a birthday? User: <a graph image> Flamingo: This is a graph, it looks like a cumulative density function graph. ```
My personal opinion would be, once you're doing next token prediction with this description of what Flamingo "is" in the history, then "I am not affected by this difference" is a pretty reasonable completion rather than a lucky guess. It definitely was exciting for the team that this whole example worked so nicely, but if you discard the visual side, this "illusion of an unbelievable capacity" has been seen in other works as well.

Imnimo 4 years ago

Yeah, I didn't realize there was that additional prompt. It certainly makes more sense that if the prompt includes the description that the agent is playing the role of an AI, it would be able to deduce that it would not be affected. I was assuming there was no such indication, and so the system would be implicitly trying to predict what a human would say in that situation (since the training data is largely human-written text).
Still, you could imagine a parallel version of Flamingo which performs the same reasoning, but is artificially slowed when shown Stroop images. Obviously, there would be no way for it to deduce this fact from the training data, and it would not be able to say that it is also affected as humans are.
I don't mean to say that this is some great failing of the system or anything - just that a casual reader might infer from the Stroop dialogue that the system had some way to inspect and reason about its own performance, when in fact it's just estimating what it thinks would be true for AI systems (since it was told that it's an AI system in the prompt) in general based on the training corpus.
- DonHopkins 4 years ago
  
  >"I am not affected by this difference"
  I'm just grateful our AI overlords can tell the difference between affected and effected, even if they're not affective or effective.
  https://prowritingaid.com/grammar/1000196/Effected-vs-affect...

jcims 4 years ago

I would love to hear some of the spine tingling moments these researchers experience when developing and interacting with large models.

razodactyl 4 years ago

AI. Just casually evolving alongside and using us as their conduit. Lol

Settings

Tackling multiple tasks with a single visual language model

Keyboard Shortcuts