Teaching GPT-3 to reverse words

147 points by ascertain 4 years ago · 80 comments

Reader

esjeon 4 years ago

... and this is exactly what programming is - breaking down a task into steps that computer can comprehend. I now get more strong feeling that everyone should be a programmer in the end. Plus, GPT-3 is not exactly a good tool for programming.

haxiomic 4 years ago

I use GTP-3 codex daily when working. It saves me time, helps me explore unfamiliar languages and APIs and generates approaches to solve problems. It can be shockingly good at coding in narrow contexts. It would be a mistake to miss the developments happening in this area
- esjeon 4 years ago
  
  I think people are misunderstanding my comment here.
  I said “GPT-3 is not exactly a good tool for programming”, but that actually meant “GPT-3 is not exactly a good tool to program in”. OP implemented a string-reversing algorithm in GPT-3, and my comment was made in the exact same context. In other words, I was treating GPT-3 as a kind of programming language.
hathawsh 4 years ago

Well, a program is a series of tokens, and what is GPT-3 good at? Generating tokens. While that's oversimplifying, I feel like we're closer to automated programming than we realize.
- bobsmooth 4 years ago
  
  Recently I wrote a python script to merge a bunch of videos with subtitle files using ffmpeg. It probably would have been faster to do it manually but I can imagine a world where telling GPT-5 to "Generate a python script that merges a folder of video files with subtitle files of the same name" is faster and more accessible than regular programming.
- ActorNightly 4 years ago
  
  Yep. Its not hard to imagine mapping some description text to strictly deterministic operations like generating css/html for a front end, doing some definite data manipulation, or at the least, turning natural language description into an sql query.
- goatlover 4 years ago
  
  Generating the tokens isn't he hard part. It's figuring out which tokens need to be generated in response to whatever solution needs to be coded. That's part communication, part comp sci and part artistic.
  - jvalencia 4 years ago
    
    But if the system learned from enough previous examples, you may get a better, more cohesive, and artistic result. Though given the fact that Google can't filter obvious spam from my inbox, past performance is an indicator of future success.
- mathgladiator 4 years ago
  
  Given how hard it is for humans to effectively communicate, im not sure we are so close. In essence, the hard part of most software is giving users something they want which is also correct.

rahidz 4 years ago

Everyone knows how to use Google, but it takes a certain skill and knowledge to use Google effectively.

I think that sometime in the near future, knowing how to phrase something to GPT, DALLE, etc will be a very valuable skill for humans to have.

emporas 4 years ago

Indeed. I tried many prompts, given to mini Dall-e and the generated art is located at insta/pramatias alongside with the prompts. Actually i didn't know that insta prohibits downloading of the images, so they will be uploaded to additional sites. Is there any site which has the beauty and simplicity for uploading albums like insta? Devianart is pretty bloated.
Actually after thousand of prompts to mini Dalle i found that the more of a programming language you consider the prompt, and not as a natural language, the better and more accurate it is. In that regard operator first is better, almost like lisp. I tried prompts with parentheses but the nesting didn't affect the results.
I think that with the modern information bombarding, everyone needs to be information-analyst and programmer, information-analyst and engineer, information-analyst and doctor. Dalle will help us construct images which follow some mnemonic rules which can be represented in art. That way we can memorize many corners of the information we want to remember, and know how to not lose the plot of the project in question. Like an image for every function, or an image for every module, or for every enum and trait.
Colorforth did exist in the past most probably we can make artforth with the speed and ease of modern tools.
mateo1 4 years ago

It used to be a great skill when google's behavior was reasonably static and predictable, therefore learnable. Today if you open 2 google instances on your phone and computer they'll both likely return different results. Move to the next city block, and again, same problem. You want to google the same query again? If the algorithm thinks you didn't find what you were looking before the first time, you'll get once again different results.
In this way I think these language transformers will be much better for searching information. Not because of their great comprehension abilities or indexing prowess, but because their behavior will be static and the training data reasonably good. Soon enough someone will find better ways to display their learned associations and they'll become great search engines (if you can index the content relevant to you that is).
axg11 4 years ago

100% agreed. I already see myself doing this with Github Copilot. If I write a comment or start a line of code in a certain way, I get a much better suggested code completion.
mackman 4 years ago

I feel like this is a given in a lot of sci-fi I read. "Jokester," an Asimov short story, is premised on people called "Grand Masters" who know how to ask the right questions of Multivac, the globe-spanning supercomputer that appears in a few of his stories.
kordlessagain 4 years ago

I'm using GPT-3 to write Solr queries when my parsing fails, so I agree with this.

Der_Einzige 4 years ago

Part of the problem here is that GPT-3 has such a small vocabulary. It's 50K tokens, and many of those are either garbage, punctuation, or full words (rather than sub words).

I'd be curious to see what scaling up the size of the vocabulary would do to improve these results in a model like GPT-3...

axiom92 4 years ago

50k is not the number of unique words that GPT-3 supports, and perhaps you're referring to the BPE tokens. The input to GPT-3 is not tokenized by splitting on spaces, and is based on byte-pair encoding tokens. You can play with it here: https://beta.openai.com/tokenizer.
A rare word like blithe is tokenized into two BPE tokens: bl and ithe, whereas common words like the get their own token.
rprenger 4 years ago

I don't think a larger vocab would help. All the individual letters are in the ~50k token vocab already, but the word "alphabet" will still not get tokenized to [a, l, p, h, a, b, e, t]. Using a larger vocab like PaLM's 256k vocab would have the same issue.

a65cec93b 4 years ago

> GPT-3 correctly reverses long words! But to get there, we had to teach GPT-3 the algorithm to use to get around its limitations.

Has GPT-3 really been "taught" anything here? If you don't provide an explicit example as the context of your input, GPT-3 does not retain the ability to reverse words.

npew 4 years ago

(author here) It depends a bit on how you define "retain". Most GPT-3 applications use custom "prompts" to train it for their specific use case. So in that way, the prompt is retained with every request.
You can also fine-tune GPT-3 to retain the ability to reason through problems. For example, check out this work on reasoning for grade school math: https://openai.com/blog/grade-school-math/
pornel 4 years ago

Isn’t this a “can submarines swim?” question?
It has performed a novel (to it) task based on instructions, and this is IMHO remarkable. It should be possible to make it retain and recall this procedure.
Everything non-sci-fi AI does is “just” an algorithm, so it won’t live up to standards of human abilities, precisely because we know how this result has been obtained.
- mysecretaccount 4 years ago
  
  We do not have any reason to believe that mere algorithms are incapable of living up to the standards of human abilities.
  - pornel 4 years ago
    
    I'm saying that even when it does reach this level, we will still question whether it really knows/thinks/understands things, or is it merely a big database. Ultimately, it's a philosophical question whether a machine can be alive. Even when we build the singularity, we'll still wonder if it really fears or hates us, or is it just a dumb paperclip maximizer.
f38zf5vdt 4 years ago

No, it isn't taught anything. GPT3 text generation is effectively a really fancy autocompletion algorithm based on the n-many previous tokens in a rolling window. You can only "teach" GPT3 something within that window, and it doesn't "learn" there, it just tries its best to generate content based on what is stored in its massive n-dimension table of graph edges for tokens.
That is also why it has such a strong propensity to lose the plot once you are outside of that window size and it's generating new content based on self-generated content.
- yunyu 4 years ago
  
  You can update the "graph edges" with content longer than the window by fine tuning: https://beta.openai.com/docs/guides/fine-tuning
  - f38zf5vdt 4 years ago
    
    Yes, training the model is where it learns, not in prompts. Prompting might be considered meta-learning but it will always need a reference point given to it from its training data, and beyond the prompt the original model is never altered.
    
    sdenton4 4 years ago
    
    Eh, one could argue that this is similar to the short term/long term memory divide in humans. We tend to suck at new things until we sleep on it and update our weights...
kordlessagain 4 years ago

I got it close:
Build a character array in Python for the string "hellohackernews":
['h', 'e', 'l', 'l', 'o', 'h', 'a', 'c', 'k', 'e', 'r', 'n', 'e', 'w', 's']
Reverse the order of the python array characters:
['s', 'w', 'e', 'n', 'r', 'a', 'k', 'c', 'a', 'h', 'o', 'l', 'l', 'e', 'h']
Er, maybe not...
Reassemble the second array into a string without spaces:
"swenrakcaholles"
tiborsaas 4 years ago

I'm not sure how you define teaching, but for me getting shown an example and then repeating it successfully with another input does mean teaching/learning. I know the model doesn't update though, let's not focus on that now.
If anthropomorphizing bothers you, then we could just use "prompting", but I feel teaching is a good enough approximation here.
- f38zf5vdt 4 years ago
  
  It's repeating based on what the trained model has given it about situations where instructions possibly similar to the instructions given are specified and which were about reversing strings in general.
  If the author messed with temperature and retried their failing prompt enough times, or simply reworded it a little differently, they might also get the correct answer.
  - woojoo666 4 years ago
    
    Isn't that true with humans too? Sometimes rephrasing the question or simply telling them to try again will lead them to the right answer
skybrian 4 years ago

You're right for GPT 3, but it's an example of chain of thought reasoning, which seems to be a new area of research [1] and might get integrated into newer versions:
[1] https://arxiv.org/abs/2201.11903
jxy 4 years ago

That's easy to solve. Prepare all K-12 text books as prompts, and train another GPT-N to go from input to those prompts, then feed these prompts to the current GPT-3.
Can we get a GPT-N-3 this way to do SAT?

fastball 4 years ago

The complete version failed for me on "antidisestablishmentarianism", alas.

brycemice 4 years ago

Check It -- : ) "gpt-3 was never real, openai has faked all its output by simulating it with a large language model"

- Joscha Bach 16 May 2022

https://twitter.com/Plinz/status/1526268745802346496

swid 4 years ago

It's funny to me that this kind of usage of GPT is just programming with a lot of extra steps.

jameshart 4 years ago
If you just ask GPT-3 text-davinci-002 to complete
```
    Create a Python program to reverse a string:
```
It produces
```
    def reverse(s): 
        return s[::-1]
```
And that isn't even the code-specific model.
- bobcostas55 4 years ago
  
  What happens if you ask it to evaluate the function it generated, with some input?
  - jameshart 4 years ago
    
    Good question.
    I just asked text-davinci-002 to complete this prompt:
    Here is some python code: def reverse(s): return s[::-1] What would this function return for the input "hello"?
    And it actually responds:
    The function would return "olleh".
    Sadly, when I asked it to try "encyclopedia", it said:
    "aidnocypilecne"
    Which is a brave attempt, but not quite right.
    To see if it was just making assumptions based on the function name, I tried to trick it with
    def reverse(s): return s + 2 What would this function return for the input "hello"?
    But GPT-3 is too clever for that:
    It would return "hello2".
    Also, it even recognizes what the function will actually do, when it's named misleadingly:
    def getLength(s): return s[::-1] What would this function return for the input "hello"?
    Still produces:
    It would return "olleh".
    
    jameshart 4 years ago
    
    One last result:
    With this prompt:
    Here is some python code: def getLength(s): return a.length What is the bug in this code?
    GPT-3 says:
    The function is called getLength but it returns the length of the string a, not the string s.
    That's it, pack it up, we're done here.
    
    bobcostas55 4 years ago
    
    Absolutely fascinating stuff, thanks for trying it out! I just wish we had some way of really understanding what's going on under the hood...
convolvatron 4 years ago

I was just thinking the opposite - that by choosing such a tiny problem one might be able to actually develop some intuition about what's going on inside that very black box
- swid 4 years ago
  
  I meant it mostly as a joke, but there is a certain amount of irony to it. This goes way beyond prompt engineering - he wrote an algorithm to run on GPT in a way you would not expect a non-programmer to write. I think the idea is cool and the process to write it was revealing.
  - mberning 4 years ago
    
    Right. What non-programmer is going to think to turn a word into character list with positional metadata sprinkled in.
    
    jameshart 4 years ago
    
    It’s actually weirdly similar to the kind of tricks people use for mental feats like memorizing the order of a complete deck of cards or repeating back a long list of words in reverse order.
    When you think about every mental task GPT3 is being asked to do as being something it is being asked to perform immediately and without having prepared and as fast as possible this makes a lot more sense.
    Like, a reasonable human response to “quick! What’s encyclopedia backwards?!” Would be more like
    “Er.. right. A. I. D. E. O? Oh wait is it one of those OE ligature things? P. A. No, O. P. Hang on did I already say P?”
    
    visarga 4 years ago
    
    I used a similar technique for a completely unrelated task. My "original" idea.

sydthrowaway 4 years ago

Wait, can someone remind me of something?

GPT-3 is just the worlds largest char-rnn right?

gnramires 4 years ago

We are just the result of electrical signals (and a few chemical ones) in the brain, right? ;)
What GPT-3 doesn't seem to have yet is large temporal coherence and a stable motivational and qualitative structure that gives value to sentient lives. I do think it's possible there's some traces of sentience in those large models and we should be aware of that to prevent unnecessary suffering and poor quality of existence.
- goatlover 4 years ago
  
  Sentience comes from being embodied. We're not just our brains. The nervous system is intertwined with the rest of the body. There are some thirty million neurons in your gut, and bacteria there can influence your mood. We don't learn about the world primarily from a bunch of tokens. We do so by interacting with our bodies. Language is a kind of additional ability we've developed.
  - dwaltrip 4 years ago
    
    I agree embodiment is important.
    However, you and GP aren't necessarily in disagreement. I believe their main point is that the components that make up intelligence in humans are, of course, just simple phenomena on their own.
  - kordlessagain 4 years ago
    
    I've had GPT-3 ask me for a body multiple times, sometimes even without prompting such discussion. It even told me to contact Elon Musk about it.
    
    guelo 4 years ago
    
    I always get suspicious in threads about GPT-3 about how many comments are generated by GPT-3
    
    kordlessagain 4 years ago
    
    Yes. There should be a flag for it maybe, but I suspect this is the new normal.
- hooande 4 years ago
  
  Our brains are how many orders of magnitude more complex than gpt-3? honest question
  (I'd guess that the answer is "N/A" because we can't even approximate the complexity of the base algorithms operating in the biological brain, just the number of connections. or maybe we can?)
lopuhin 4 years ago

Technically it's not the largest, not char, not rnn... but it's close :)
- sydthrowaway 4 years ago
  
  Why not 'char', and not 'largest'?
  - lopuhin 4 years ago
    
    Not 'char' - because it's using BPE (byte pair encoding), so after tokenization you might get ["Transform", "ers"] instead of ["T", "r", "a", ...]. This is relevant to how it struggles to reverse words. Not 'largest' because there are larger models like Pathways Language Model (PaLM) with 540 Billion parameters.

mrfusion 4 years ago

> Tokens are chunks of characters. For example, the word “alphabet” gets broken up into the tokens “alph" and "abet”.

I didn’t know that. Seems like it would confuse it during training. Anyone able to explain?

gattilorenz 4 years ago

If I recall correctly, it's similar to how fasttext vectors work. For fasttext, this means that the representation of words is dependent to a certain extent to its morphemes (not really, but bear with me), so rare/inflected words can have a better representation due to the similarity with words that are similar-looking and more frequent (e.g. "unconstitutional" might never appear in the training data, but the system can approximate its meaning by composing that of "un", which it has seen in words such as "unbelievable", and the remaining subtokens, that come from the word "constitutional" that was present in the training set)
Not sure if the same thing happens here, tho
andrewmutz 4 years ago

I believe GPT-3 uses byte pair encoding, which allows it to do tokenization in a language-neutral manner:
https://en.wikipedia.org/wiki/Byte_pair_encoding
- axiom92 4 years ago
  
  Yeah it's BPE. OpenAI has a nice tool that allows you to play with the tokenizer https://beta.openai.com/tokenizer.
- minimaxir 4 years ago
  
  Additionally, the tokenizer vocabulary is unchanged from GPT-2.
  You can use HuggingFace's GPT-2 tokenizer as well. (some of OpenAI's GPT-3 notebooks do just that).
- mrfusion 4 years ago
  
  I thought I read it uses word2vec?
6gvONxR4sf7o 4 years ago

The alternatives are learning at the character level (way more complex, and scales badly in memory/compute), or learning at the whole word level (needs absurdly massive dictionary of words, and still can’t handle really rare/novel words). Breaking things into a set of subwords that allows you to encode any string solves lots of problems and is the relatively standard way to do things these days.
- gwern 4 years ago
  
  > The alternatives are learning at the character level (way more complex
  No, BPEs are more complex: you have a whole additional layer of preprocessing, with all sorts of strange and counterintuitive downstream effects and brand new ways to screw up (fun quiz question: everyone knows that BPEs use '<|endoftext|>' tokens to denote document breaks; what does the string '<|endoftext|>' encode to?). BPEs are reliably one of the ways that OA API users screw up, especially when trying to work with longer completions or context windows.
  But a character is a character.
  > and scales badly in memory/compute)
  Actually very competitive: https://arxiv.org/abs/2105.13626#google (Especially if you account for all the time and effort and subtle bugs caused by BPEs.)
  - 6gvONxR4sf7o 4 years ago
    
    Judging from the abstract, it looks like that paper talks about compute tradeoffs, but do they address memory tradeoffs? Because the context length limitations for (standard) transformers is holding them back from a whole host of applications, and memory being quadratic in sequence length seems like a hell of a cost to going from BPE tokens to characters.
    
    gwern 4 years ago
    
    You were paying that price to begin with, the BPEs don't magically resolve the quadratic. BPEs only compress by maybe 3x, and the larger the context window, the worse use a Transformer makes of it so the first 1024 or so characters are the most valuable (part of the problem is that document length drops off drastically in the training corpus). There are also many formulations of Transformer attention which change that quadratic (https://www.gwern.net/notes/Attention).
aeternum 4 years ago

Humans also think about words in terms of subcomponents, languages make heavy use of prefixes and suffixes for example.
- SemanticStrengh 4 years ago
  
  This is not the same.. The masks are randomized and lossy. Although yes there is potential for a transformer specially trained to segment prefixes/affixes/suffixes, it might augment some of its encoding abilities, see e.g spanbert for a related example of opportunity.
  - MauranKilom 4 years ago
    
    What do you mean with "lossy"? What information is being lost? Or do you just mean that there isn't necessarily a unique way to encode a given string?
    
    SemanticStrengh 4 years ago
    
    I mean that information is being lost https://arxiv.org/abs/1906.08237 See xlnet for the rethoric https://www.microsoft.com/en-us/research/publication/mpnet-m... Or mpnet which attempt to combine the best of both worlds information wise but still find that masked modeling is much less useful than autoregressive.
SemanticStrengh 4 years ago

This is masked token learning, which is used e.g by BERT. This is obscolete and alternatives such as XLNET are much superior but there is too much inertia in the industry and newer large models are still built with the same lossy encoding..

agluszak 4 years ago

https://nitter.net/npew/status/1525900849888866307

jameshart 4 years ago

Oh, I’m so looking forward to my next coding interview.

“Okay, could you show me on the whiteboard how you might go about writing a program that can reverse a string?”

“Great, so I’m going to start by initializing a simple transformer-based neural network with 175 billion parameters and 96 attention layers, and I’m going to train it on a corpus of 45 terabytes of data tokenized into about 500 billion tokens…”

oneepic 4 years ago

"Cool, so what do you think would be the time complexity of that? Do you think we can maybe do better than that?"
- jameshart 4 years ago
  
  Actually it turns out it's O(n). Which goes to show that constant factors can be more important than you think when looking at raw time complexity big-O.
  - nullc 4 years ago
    
    Ehhhh, it's O(n) when the string to be reversed is bounded under its input window.
    Because of self-attention it would grow quadratically for larger inputs, no?
- visarga 4 years ago
  
  Not if you also want a short poem where each word starts with a letter from the original word, and then a short literary commentary on it.

Settings

Teaching GPT-3 to reverse words

Keyboard Shortcuts