Teaching GPT-3 to reverse words
twitter.com... and this is exactly what programming is - breaking down a task into steps that computer can comprehend. I now get more strong feeling that everyone should be a programmer in the end. Plus, GPT-3 is not exactly a good tool for programming.
I use GTP-3 codex daily when working. It saves me time, helps me explore unfamiliar languages and APIs and generates approaches to solve problems. It can be shockingly good at coding in narrow contexts. It would be a mistake to miss the developments happening in this area
I think people are misunderstanding my comment here.
I said “GPT-3 is not exactly a good tool for programming”, but that actually meant “GPT-3 is not exactly a good tool to program in”. OP implemented a string-reversing algorithm in GPT-3, and my comment was made in the exact same context. In other words, I was treating GPT-3 as a kind of programming language.
Well, a program is a series of tokens, and what is GPT-3 good at? Generating tokens. While that's oversimplifying, I feel like we're closer to automated programming than we realize.
Recently I wrote a python script to merge a bunch of videos with subtitle files using ffmpeg. It probably would have been faster to do it manually but I can imagine a world where telling GPT-5 to "Generate a python script that merges a folder of video files with subtitle files of the same name" is faster and more accessible than regular programming.
Yep. Its not hard to imagine mapping some description text to strictly deterministic operations like generating css/html for a front end, doing some definite data manipulation, or at the least, turning natural language description into an sql query.
Generating the tokens isn't he hard part. It's figuring out which tokens need to be generated in response to whatever solution needs to be coded. That's part communication, part comp sci and part artistic.
But if the system learned from enough previous examples, you may get a better, more cohesive, and artistic result. Though given the fact that Google can't filter obvious spam from my inbox, past performance is an indicator of future success.
Given how hard it is for humans to effectively communicate, im not sure we are so close. In essence, the hard part of most software is giving users something they want which is also correct.
Everyone knows how to use Google, but it takes a certain skill and knowledge to use Google effectively.
I think that sometime in the near future, knowing how to phrase something to GPT, DALLE, etc will be a very valuable skill for humans to have.
Indeed. I tried many prompts, given to mini Dall-e and the generated art is located at insta/pramatias alongside with the prompts. Actually i didn't know that insta prohibits downloading of the images, so they will be uploaded to additional sites. Is there any site which has the beauty and simplicity for uploading albums like insta? Devianart is pretty bloated.
Actually after thousand of prompts to mini Dalle i found that the more of a programming language you consider the prompt, and not as a natural language, the better and more accurate it is. In that regard operator first is better, almost like lisp. I tried prompts with parentheses but the nesting didn't affect the results.
I think that with the modern information bombarding, everyone needs to be information-analyst and programmer, information-analyst and engineer, information-analyst and doctor. Dalle will help us construct images which follow some mnemonic rules which can be represented in art. That way we can memorize many corners of the information we want to remember, and know how to not lose the plot of the project in question. Like an image for every function, or an image for every module, or for every enum and trait.
Colorforth did exist in the past most probably we can make artforth with the speed and ease of modern tools.
It used to be a great skill when google's behavior was reasonably static and predictable, therefore learnable. Today if you open 2 google instances on your phone and computer they'll both likely return different results. Move to the next city block, and again, same problem. You want to google the same query again? If the algorithm thinks you didn't find what you were looking before the first time, you'll get once again different results.
In this way I think these language transformers will be much better for searching information. Not because of their great comprehension abilities or indexing prowess, but because their behavior will be static and the training data reasonably good. Soon enough someone will find better ways to display their learned associations and they'll become great search engines (if you can index the content relevant to you that is).
100% agreed. I already see myself doing this with Github Copilot. If I write a comment or start a line of code in a certain way, I get a much better suggested code completion.
I feel like this is a given in a lot of sci-fi I read. "Jokester," an Asimov short story, is premised on people called "Grand Masters" who know how to ask the right questions of Multivac, the globe-spanning supercomputer that appears in a few of his stories.
I'm using GPT-3 to write Solr queries when my parsing fails, so I agree with this.
Part of the problem here is that GPT-3 has such a small vocabulary. It's 50K tokens, and many of those are either garbage, punctuation, or full words (rather than sub words).
I'd be curious to see what scaling up the size of the vocabulary would do to improve these results in a model like GPT-3...
50k is not the number of unique words that GPT-3 supports, and perhaps you're referring to the BPE tokens. The input to GPT-3 is not tokenized by splitting on spaces, and is based on byte-pair encoding tokens. You can play with it here: https://beta.openai.com/tokenizer.
A rare word like blithe is tokenized into two BPE tokens: bl and ithe, whereas common words like the get their own token.
I don't think a larger vocab would help. All the individual letters are in the ~50k token vocab already, but the word "alphabet" will still not get tokenized to [a, l, p, h, a, b, e, t]. Using a larger vocab like PaLM's 256k vocab would have the same issue.
> GPT-3 correctly reverses long words! But to get there, we had to teach GPT-3 the algorithm to use to get around its limitations.
Has GPT-3 really been "taught" anything here? If you don't provide an explicit example as the context of your input, GPT-3 does not retain the ability to reverse words.
(author here) It depends a bit on how you define "retain". Most GPT-3 applications use custom "prompts" to train it for their specific use case. So in that way, the prompt is retained with every request.
You can also fine-tune GPT-3 to retain the ability to reason through problems. For example, check out this work on reasoning for grade school math: https://openai.com/blog/grade-school-math/
Isn’t this a “can submarines swim?” question?
It has performed a novel (to it) task based on instructions, and this is IMHO remarkable. It should be possible to make it retain and recall this procedure.
Everything non-sci-fi AI does is “just” an algorithm, so it won’t live up to standards of human abilities, precisely because we know how this result has been obtained.
We do not have any reason to believe that mere algorithms are incapable of living up to the standards of human abilities.
I'm saying that even when it does reach this level, we will still question whether it really knows/thinks/understands things, or is it merely a big database. Ultimately, it's a philosophical question whether a machine can be alive. Even when we build the singularity, we'll still wonder if it really fears or hates us, or is it just a dumb paperclip maximizer.
No, it isn't taught anything. GPT3 text generation is effectively a really fancy autocompletion algorithm based on the n-many previous tokens in a rolling window. You can only "teach" GPT3 something within that window, and it doesn't "learn" there, it just tries its best to generate content based on what is stored in its massive n-dimension table of graph edges for tokens.
That is also why it has such a strong propensity to lose the plot once you are outside of that window size and it's generating new content based on self-generated content.
You can update the "graph edges" with content longer than the window by fine tuning: https://beta.openai.com/docs/guides/fine-tuning
Yes, training the model is where it learns, not in prompts. Prompting might be considered meta-learning but it will always need a reference point given to it from its training data, and beyond the prompt the original model is never altered.
Eh, one could argue that this is similar to the short term/long term memory divide in humans. We tend to suck at new things until we sleep on it and update our weights...
I got it close:
Build a character array in Python for the string "hellohackernews":
['h', 'e', 'l', 'l', 'o', 'h', 'a', 'c', 'k', 'e', 'r', 'n', 'e', 'w', 's']
Reverse the order of the python array characters:
['s', 'w', 'e', 'n', 'r', 'a', 'k', 'c', 'a', 'h', 'o', 'l', 'l', 'e', 'h']
Er, maybe not...
Reassemble the second array into a string without spaces:
"swenrakcaholles"
I'm not sure how you define teaching, but for me getting shown an example and then repeating it successfully with another input does mean teaching/learning. I know the model doesn't update though, let's not focus on that now.
If anthropomorphizing bothers you, then we could just use "prompting", but I feel teaching is a good enough approximation here.
It's repeating based on what the trained model has given it about situations where instructions possibly similar to the instructions given are specified and which were about reversing strings in general.
If the author messed with temperature and retried their failing prompt enough times, or simply reworded it a little differently, they might also get the correct answer.
Isn't that true with humans too? Sometimes rephrasing the question or simply telling them to try again will lead them to the right answer
You're right for GPT 3, but it's an example of chain of thought reasoning, which seems to be a new area of research [1] and might get integrated into newer versions:
That's easy to solve. Prepare all K-12 text books as prompts, and train another GPT-N to go from input to those prompts, then feed these prompts to the current GPT-3.
Can we get a GPT-N-3 this way to do SAT?
The complete version failed for me on "antidisestablishmentarianism", alas.
Check It -- : ) "gpt-3 was never real, openai has faked all its output by simulating it with a large language model"
- Joscha Bach 16 May 2022
It's funny to me that this kind of usage of GPT is just programming with a lot of extra steps.
If you just ask GPT-3 text-davinci-002 to complete
It producesCreate a Python program to reverse a string:
And that isn't even the code-specific model.def reverse(s): return s[::-1]What happens if you ask it to evaluate the function it generated, with some input?
Good question.
I just asked text-davinci-002 to complete this prompt:
And it actually responds:Here is some python code: def reverse(s): return s[::-1] What would this function return for the input "hello"?
Sadly, when I asked it to try "encyclopedia", it said:The function would return "olleh".
Which is a brave attempt, but not quite right."aidnocypilecne"To see if it was just making assumptions based on the function name, I tried to trick it with
But GPT-3 is too clever for that:def reverse(s): return s + 2 What would this function return for the input "hello"?
Also, it even recognizes what the function will actually do, when it's named misleadingly:It would return "hello2".
Still produces:def getLength(s): return s[::-1] What would this function return for the input "hello"?It would return "olleh".One last result:
With this prompt:
GPT-3 says:Here is some python code: def getLength(s): return a.length What is the bug in this code?
That's it, pack it up, we're done here.The function is called getLength but it returns the length of the string a, not the string s.Absolutely fascinating stuff, thanks for trying it out! I just wish we had some way of really understanding what's going on under the hood...
I was just thinking the opposite - that by choosing such a tiny problem one might be able to actually develop some intuition about what's going on inside that very black box
I meant it mostly as a joke, but there is a certain amount of irony to it. This goes way beyond prompt engineering - he wrote an algorithm to run on GPT in a way you would not expect a non-programmer to write. I think the idea is cool and the process to write it was revealing.
Right. What non-programmer is going to think to turn a word into character list with positional metadata sprinkled in.
It’s actually weirdly similar to the kind of tricks people use for mental feats like memorizing the order of a complete deck of cards or repeating back a long list of words in reverse order.
When you think about every mental task GPT3 is being asked to do as being something it is being asked to perform immediately and without having prepared and as fast as possible this makes a lot more sense.
Like, a reasonable human response to “quick! What’s encyclopedia backwards?!” Would be more like
“Er.. right. A. I. D. E. O? Oh wait is it one of those OE ligature things? P. A. No, O. P. Hang on did I already say P?”
I used a similar technique for a completely unrelated task. My "original" idea.
Wait, can someone remind me of something?
GPT-3 is just the worlds largest char-rnn right?
We are just the result of electrical signals (and a few chemical ones) in the brain, right? ;)
What GPT-3 doesn't seem to have yet is large temporal coherence and a stable motivational and qualitative structure that gives value to sentient lives. I do think it's possible there's some traces of sentience in those large models and we should be aware of that to prevent unnecessary suffering and poor quality of existence.
Sentience comes from being embodied. We're not just our brains. The nervous system is intertwined with the rest of the body. There are some thirty million neurons in your gut, and bacteria there can influence your mood. We don't learn about the world primarily from a bunch of tokens. We do so by interacting with our bodies. Language is a kind of additional ability we've developed.
I agree embodiment is important.
However, you and GP aren't necessarily in disagreement. I believe their main point is that the components that make up intelligence in humans are, of course, just simple phenomena on their own.
I've had GPT-3 ask me for a body multiple times, sometimes even without prompting such discussion. It even told me to contact Elon Musk about it.
I always get suspicious in threads about GPT-3 about how many comments are generated by GPT-3
Yes. There should be a flag for it maybe, but I suspect this is the new normal.
Our brains are how many orders of magnitude more complex than gpt-3? honest question
(I'd guess that the answer is "N/A" because we can't even approximate the complexity of the base algorithms operating in the biological brain, just the number of connections. or maybe we can?)
Technically it's not the largest, not char, not rnn... but it's close :)
Why not 'char', and not 'largest'?
Not 'char' - because it's using BPE (byte pair encoding), so after tokenization you might get ["Transform", "ers"] instead of ["T", "r", "a", ...]. This is relevant to how it struggles to reverse words. Not 'largest' because there are larger models like Pathways Language Model (PaLM) with 540 Billion parameters.
> Tokens are chunks of characters. For example, the word “alphabet” gets broken up into the tokens “alph" and "abet”.
I didn’t know that. Seems like it would confuse it during training. Anyone able to explain?
If I recall correctly, it's similar to how fasttext vectors work. For fasttext, this means that the representation of words is dependent to a certain extent to its morphemes (not really, but bear with me), so rare/inflected words can have a better representation due to the similarity with words that are similar-looking and more frequent (e.g. "unconstitutional" might never appear in the training data, but the system can approximate its meaning by composing that of "un", which it has seen in words such as "unbelievable", and the remaining subtokens, that come from the word "constitutional" that was present in the training set)
Not sure if the same thing happens here, tho
I believe GPT-3 uses byte pair encoding, which allows it to do tokenization in a language-neutral manner:
Yeah it's BPE. OpenAI has a nice tool that allows you to play with the tokenizer https://beta.openai.com/tokenizer.
Additionally, the tokenizer vocabulary is unchanged from GPT-2.
You can use HuggingFace's GPT-2 tokenizer as well. (some of OpenAI's GPT-3 notebooks do just that).
I thought I read it uses word2vec?
The alternatives are learning at the character level (way more complex, and scales badly in memory/compute), or learning at the whole word level (needs absurdly massive dictionary of words, and still can’t handle really rare/novel words). Breaking things into a set of subwords that allows you to encode any string solves lots of problems and is the relatively standard way to do things these days.
> The alternatives are learning at the character level (way more complex
No, BPEs are more complex: you have a whole additional layer of preprocessing, with all sorts of strange and counterintuitive downstream effects and brand new ways to screw up (fun quiz question: everyone knows that BPEs use '<|endoftext|>' tokens to denote document breaks; what does the string '<|endoftext|>' encode to?). BPEs are reliably one of the ways that OA API users screw up, especially when trying to work with longer completions or context windows.
But a character is a character.
> and scales badly in memory/compute)
Actually very competitive: https://arxiv.org/abs/2105.13626#google (Especially if you account for all the time and effort and subtle bugs caused by BPEs.)
Judging from the abstract, it looks like that paper talks about compute tradeoffs, but do they address memory tradeoffs? Because the context length limitations for (standard) transformers is holding them back from a whole host of applications, and memory being quadratic in sequence length seems like a hell of a cost to going from BPE tokens to characters.
You were paying that price to begin with, the BPEs don't magically resolve the quadratic. BPEs only compress by maybe 3x, and the larger the context window, the worse use a Transformer makes of it so the first 1024 or so characters are the most valuable (part of the problem is that document length drops off drastically in the training corpus). There are also many formulations of Transformer attention which change that quadratic (https://www.gwern.net/notes/Attention).
Humans also think about words in terms of subcomponents, languages make heavy use of prefixes and suffixes for example.
This is not the same.. The masks are randomized and lossy. Although yes there is potential for a transformer specially trained to segment prefixes/affixes/suffixes, it might augment some of its encoding abilities, see e.g spanbert for a related example of opportunity.
What do you mean with "lossy"? What information is being lost? Or do you just mean that there isn't necessarily a unique way to encode a given string?
I mean that information is being lost https://arxiv.org/abs/1906.08237 See xlnet for the rethoric https://www.microsoft.com/en-us/research/publication/mpnet-m... Or mpnet which attempt to combine the best of both worlds information wise but still find that masked modeling is much less useful than autoregressive.
This is masked token learning, which is used e.g by BERT. This is obscolete and alternatives such as XLNET are much superior but there is too much inertia in the industry and newer large models are still built with the same lossy encoding..
Oh, I’m so looking forward to my next coding interview.
“Okay, could you show me on the whiteboard how you might go about writing a program that can reverse a string?”
“Great, so I’m going to start by initializing a simple transformer-based neural network with 175 billion parameters and 96 attention layers, and I’m going to train it on a corpus of 45 terabytes of data tokenized into about 500 billion tokens…”
"Cool, so what do you think would be the time complexity of that? Do you think we can maybe do better than that?"
Actually it turns out it's O(n). Which goes to show that constant factors can be more important than you think when looking at raw time complexity big-O.
Ehhhh, it's O(n) when the string to be reversed is bounded under its input window.
Because of self-attention it would grow quadratically for larger inputs, no?
Not if you also want a short poem where each word starts with a letter from the original word, and then a short literary commentary on it.