OpenAI Codex
openai.comThe "language models don't really understand anything" corner is getting smaller and smaller. In the last few months we've seen pretty definitive evidence that transformers can recombine concepts ([1], [2]) and do simple logical inference using contextual information ([3], "make the score font color visible"). I see no reason that this technology couldn't smoothly scale into human-level intelligence, yet lots of people seem to think it'll require a step change or is impossible.
That being said, robust systematic generalization is still a hard problem. But "achieve symbol grounding through tons of multimodal data" is looking more and more like the answer.
[1] https://openai.com/blog/dall-e/ [2] https://distill.pub/2021/multimodal-neurons/ [3] https://openai.com/blog/openai-codex/
> The "language models don't really understand anything" corner is getting smaller and smaller.
In my mind, understanding a thing means you can justify an answer. Like a student showing their work and being able to defend it. An answer with a proof understands the answer with respect to the proof it provides. E.g. to understand an answer with regards to first order logic, it'll have to be able to defend a logical deduction of that answer.
These models still can't justify their answers very well, so I'd say they're accurate but only understand with respect to a fairly dumb proof system (e.g. they can select relevant passages or just appeal to overall accuracy statistics). They're still far from being able to justify answers in the various ways we do, which I'd say means that by definition that they still don't understand with regards to the "proof systems" that we understand things with regards to.
Maybe the next step will require increasingly interesting justification systems.
> In my mind, understanding a thing means you can justify an answer.
Do you understand cats? If I show you a picture of either a cat or a dog do you think you can tell which one it is? I think most people could solve that challenge, and if pressed they could vax poetically about what makes them think it is a cat. Maybe they would mention the shape of an ear, or talk about feline grace or what have you. But is that really a “justification”? Let alone one they can “defend”? How would “defending” even work in this situation?
You could probably teach an AI to post-hoc rationalize their decisions, the same way people do.
You absolutely could, and it could even end up just as accurate as human post-hoc rationalization. ;)
Self-analysis and self-interpretation is pretty clearly a key part of consciousness... I do wonder - how important to the process is the actual fidelity of the interpretation? Those people you meet who think they have deep insight into their own psyche while clearly having no clue... maybe they're p-zombies. ;)
That’s basically the gist of explainable AI
The point I’m trying to make (poorly) is that i don’t think a one size fits all definition of “understanding” is useful. It’s more useful to define understanding with respect to how you can justify a thing you know.
So for the case of cats, I will understand cats at a different level from a cat biologist. I can point to features that seem catlike, and they can talk about all sorts of other scientific things that make a cat a cat.
With respect to that sciency kind of understanding, I don’t understand cats. With respect to a much looser ‘point at the features’ kind of understanding, I do understand cats.
Take entomologists, bird watchers or those who identify mushrooms. In each, there are instances that are fiendishly difficult to tell apart. If you ask an expert for advice, they'll tell what features to look for and where, sometimes not even on the item itself and some requiring specialists tools.
While explanations are far from sufficient to instantly close the gap to expertise, they provide a massive boost that you might not otherwise have found on your own. The justification comes from the fact that their explanations are a reliable source of increased performance in making fine-grained distinctions. It's further demonstrated by answers to questions like "If they are so difficult to tell apart, why make these distinctions?" or "How did they come to be so similar?".
> In my mind, understanding a thing means you can justify an answer.
What if the language model can generate a step-by-step explanation in the form of text? [0]
There's no guarantee that the reasoning was used to come up with the answer in the first place, and no proof that the reasoning isn't just the product of "a really fancy markov chain generator", but would you accept it?
We're really walking into Searle's Chinese Room at this point.
Umm, no there are clear verification methods for Explainable AI techniques today. One way to check the justification would be if things which were important in the justification were removed in some sense, then would the output change signficantly. Sort of like a sensitivity test for justification.
Searle's Chinese Room is exactly why I started thinking of understanding this way. It convinced me that a one-size-fits-all notion of understanding isn't useful. But it also made me think that understanding "with respect to system X" is useful.
If you can challenge an answer and get justification expressed in the form of X, then it understands with respect to X. A step-by-step text explanation is one form of X.
> ... but would you accept it?
This is all to sidestep questions of whether you accept X as "real" understanding or not. :D
>There's no guarantee that the reasoning was used to come up with the answer in the first place, and no proof that the reasoning isn't just the product of....
You're holding machines to a higher standard than we hold people.
Look at the "math test" video.
Given the question: "Jane has 9 balloons. 6 are green and the rest are blue. How many balloons are blue?" The model outputs: "jane_balloons = 9; green_balloons = 6; blue_balloons = jane_balloons - green_balloons; print(blue_balloons)"
That seems like a good justification of a (very simple) step-by-step reasoning process!
I wonder what would it have outputted if we would remove the “ and the rest are blue” part from the question.
Would not surprise me if an innatentive human student would answer that with the same code. After all school “trains” people to expect such challenges to be solveable. A more attenive human might say “we can’t know” or provide an upper limit to the number of potential blue balloons.
Related article: Teaching GPT-3 to Identify Nonsense
chances are high that something similar was in training set, and model approximated it.
You are very likely right. The question is how far the approximation can generalise? One way to test that would be to quizz the model with slightly varied prompts. Any human who can “solve” this world problem should be reasonably expected to solve the same problem if we change the subject’s name. ( From Jane to Bob, or Sanj, or even to Xcfg.) Or the name of the object (From balloon to token, or even to embobler). Or the attributes used to segment them. (From red/blue to heavy/light for example)
Or we can try to rewrite the challenge sentences with different wording. As long as the new sentences convey the same problem you would expect that a system who can “understand” them would generate the same or similar solution.
Curiously this kind of thought experiment also shows a weakness of the Turing-test as originally formulated. A machine correctly solving these word puzzle variations could “prove” that it “understands” the sentences, but it would also reveal that it is not a human. Since i would expect a real human to protest against the inanity of the challenges quite fast. ;)
This goes for humans too. Ultimately, "something similar was in the training set" is semantically indistinguishable from "having a rich generalizable conceptual toolbox".
Except I could do that with a few regex substitutions, which would not be reasoning. The “intelligence” is in the templates provided by the training data. (Extracting that is impressive, but not that impressive.)
>In my mind, understanding a thing means you can justify an answer.
Sure, but how does that work with superhuman AI? Consider some kind of math bot that proves theorems about formal systems which are just flat out too large to fit into human working memory. Even if it could explain its answers, there would just be too many moving parts to keep in your head at once.
We already see something this in quant funds. The stock trading robot finds a price signal, and trades on it. You can look at it, but it's nonsensical: if rainfall in the Amazon basin is above this amount, and cobalt price is below this amount, then buy municipal bonds in Topeka. The price signal is durable and casual. If you could hold the entire global economy in your head, you could see the chain of actions that produce the effect, but your brain isn't that big.
Or you just take it on faith. Why do bond prices in Topeka go up, but not in Wichita? "It just does." Okay, then what was the point of the explanation? A machine can't justify something you physically don't have enough neurons to comprehend.
It's not about us being able to interpret answer or justification, but the reasoner's ability to justify. If a superhuman AI can justify its answers in terms of first order logic, for example, it could be defined as understanding the answers with respect to FOL. Whether we as humans are able to check whether this specific bot in fact meets that definition is a separate empirical question.
If that quant algo you mentioned just says "it'll go up tomorrow" that's different than "it'll go up tomorrow" with an attached "it's positively correlated with Y, which is up today" which is different from a full causal DAG model of the world attached, which is again different from those same things expressible in english. But again, those are definitions, which are separate from our ability to check whether they're met.
Luckily, we're not in the realm of bots spitting out unfeasible to check proofs, except for a few niche areas like theorem proving (e.g. four color theorem). For language models like in the article, the best I'm aware of is finding relevant passages to an answer and classifying entailments.
> A machine can't justify something you physically don't have enough neurons to comprehend.
We can't always verify its justification, but it either can or can't justify an answer with respect to a given justification system.
Also, you should note the memory and capabilities required to reach a conclusion might be much greater than to show it's true. Showing a needle may be easy, finding it in the haystack very hard. In this sense the hope for explainability is expanded. But still, I guess the real world is really messy "the full explanation" may be too large -- like when you explain a human intuition, the "full explanation" might have been your entire brain, your entire set of experiences up to that point; yet we can give partial explanations that should be satisfactory
A have a hypothesis that inevitably, reasoning needs to 'funnel' through explicit, logical representations (like we do with mathematics, language, etc.) to occur effectively. Or at least (quasi-)formalization is an important element of reasoning. This formal subset can be communicated.
> Even if it could explain its answers, there would just be too many moving parts to keep in your head at once.
While this is possible in practice, consider the (universal) Turing machine principle: in principle, you can simulate any system given enough memory; we may not have it our brains, but we have pen and paper or simply digital text scratchpad (both of which we use extensively in our lives).
We build another system we fully understand that can process the justification and see if it is correct/makes sense.
What about GPT-f? It's a language model that proved theorems in the metamath formal system.
I'd definitely say it understands those theorems with respect to the metamath formal system then. The next question is what it understands the proofs with respect to.
> Maybe the next step will require increasingly interesting justification systems.
You can just ask it to comment what it intends to do. It's surprising actually.
I found it on Stack Overflow!
> The "language models don't really understand anything"
This is still true. By all account, human doesn't need to read 159GB of Python code to write Python, or we simply can't.
But it doesn't necessarily indicate language models aren't useful.
Considering the sum total of data and computation that goes in to creating an intelligent human mind, including the forces of natural selection in creating our innate structure and dispositions, it's not obvious that any conclusions can be drawn from the fact that so much data and compute goes into training these models.
Has this transfer of knowledge from one domain to another really been demonstrated by these models/learning processes? I know transfer learning is a thing (I have a couple books on my shelf on it). But it seems far from what you are describing.
The AlphaZero algorithm swapped between board games pretty easily. OpenAI could also have been gesturing at this when they named the GPT paper "Language Models are Few-Shot Learners".
DALL-E + CLIP models show a deep understanding of the relation between images and text.
they mention in the demo video that the inspiration for codex came from GPT-3 users training it to respond to queries with code samples. I saw some pretty impressive demos of the original model creating SQL queries from plain questions. I'm not sure if that counts as switching domains, but it's something?
The problem with this (very popular) argument is that you can't give a CS course to a baby and expect them to get at programming.
By the time we see our first line of code, most of us have seen a ridiculous amount of data. We've been trained in problem solving, logical reasoning, maths, natural language processing, ... Hell, we've been trained as pattern matchers since we've been born.
By my account, humans actually need a large amount of training data. It might be the knowledge federation and generalisation that we're good at, but I don't think we're a clear winner in data efficiency.
Taking 11Mbps [1] as the raw uncompressed incoming data, and assuming 16 hours of waking environment consumption on average (likely high for children), a 13yo has taken in less than 400 TB of information (I used 11 * 60 * 60 * 16 * 365 * 13 / 8.) That's... surprisingly low.
[1] https://www.britannica.com/science/information-theory/Physio...
Are we still limiting to visual cues and not the auditory,smell,taste,touch data which we get exposed to?
Visual input is so dense it's basically not worth tracking the other senses from a data rate pov.
I would argue humans ingest a lot more than 159GB before they can write code. Most of it isn't Python, and humans currently transfer knowledge a lot more efficiently than NNs, but I suspect that'll change as incorporating more varied data sources becomes feasible.
We generalize pretty well. One could say: "it took you 20 years to learn python!", but actually I learned python, Java, c#... Software engineering, machine learning... How to play guitar, how to cook.. .How to speak Portuguese, how to speak English... And thousands and thousands of different things which build on each other.
You can give a programmer a few kb of code in a new language and that will give him a small grasp of how it works.
I have to disagree with you here. In the Codex paper[1], they have two datasets that Codex got correct about 3% of the time. These are interview and code competition questions. From the paper:
"Indeed, a strong student who completes an introductory computer science course is expected to be able to solve a larger fraction of problems than Codex-12B."
This suggests to me that Codex really doesn't understand anything about the language beyond syntax. I have no doubt that future systems will improve on this benchmark, but they will likely take advantage of the AST and could use unit tests in a RL-like reward function.
> but they will likely take advantage of the AST
In the end, a more general approach with more compute, always wins over applying domain knowledge like taking advantage of the AST. This is called “the bitter lesson”. http://www.incompleteideas.net/IncIdeas/BitterLesson.html
I don't think the bitter lesson is applies to ASTs.
From the Bitter Lesson:
"Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better."
Those models are taking advantage of inductive biases. Every model has them, including the massive language models. They are not the same as engineered features (such as SIFTs) or heuristics.
Using the AST is just another way of looking at the code already in your dataset. For the model to understand what it is writing, it needs to map the text sequences map to ASTs anyways. It can attempt to learn this, but the 12B model still makes illegal Python code so it clearly hasn't.
"the bitter lesson" is a very interesting, thank you! However, I wonder if AST vs. text analysis is fully comparable to the examples given in the post. Applying human concepts for chess, go, image processing, etc. failed over statistical methods, but I don't think AST vs. text is exactly the same argument. IMO, using an AST is simply a more accurate representation of a program and doesn't necessarily imply an attempt to bring in human intuition/concepts.
I mean, the AST doesn't help at all with comments which are potentially the most valuable part of the code to an AI like this. Formatting is also ignored by the AST but may play a role in understanding, just as it can for humans.
The model can clearly already generate large amounts of code with no syntax errors in one shot. It's probably better at that than I am, I always need to fix something after typing a bunch of code without calling the compiler. I think that instead of adding a bunch of language-specific AST stuff it would be far better to simply give the model the ability to iterate on its solution the way humans do, to fix any syntax errors or logic bugs discovered by the compiler or at runtime. That could potentially work in a generic way for any language. It seems like the obvious next step, though figuring out how to train it is not obvious.
12B, though. What about 1.2T?
You need to scale the amount of data to take advantage of the increase in parameters. I'm not sure where we would find another 100 GitHubs worth of data.
"I see no reason that this technology couldn't smoothly scale into human-level intelligence, yet lots of people seem to think it'll require a step change or is impossible."
I am a big fan of LMs and am not in the don't really understand crowd, but here are a couple of reasons:
1. Large language models such as GPT or Codex still have several major architectural limitations. They lack the ability to make use of long-term memory, since they have a fairly limited amount of info they can take as input; GPT 3 is great at short stories, but can't go beyond that, and it's hard to prime it with a lot of information as you would eg a new employee. There is some work on this, but afaik not very much and it's very much unsolved.
2. Large language models have only gotten this good by ingesting massive amounts of data and scaling up compute. Yet, this growth comes with diminish returns for every order of magnitude. So it just not being to scale either the data or the compute needs sufficiently (with existing hardware architectures) is a very plausible reason.
3. Large language models 'have it easy' because they only deal with one modality (text). Humans intelligence on the other hand is multimodal - we can process vision inputs, sound, touch, etc. sound, etc. simultaneously and share concepts between these modalities. And we likewise output motor commands that result in motion, text. So far it's not too obvious how to achieve this - OpenAI took a step with DALL-E, but that was by just mining a massive amount of image-text pairs, and it's not obvious this is easy for other modalities, in particular for motor control.
4. Human-level intelligence is often framed as having system 1 (reactive output) and system 2 (longer term reasoning not in response to immediate stimuli) - this is not at all present in language models.
5. related to above two, at least some of human intelligence is derived from reinforcement learning (optimizing a policy that is multi-step with a delayed reward). This is much harder than the plain self-supervised learning of LMs.
And probably there are a bunch more like these. So while I do think these sorts of models represent a lot of progress, there are many reasons to be doubtful that just 'scale it up' will work to get much further.
As an add-on to this: I'd encourage anyone interested in this debate to read Rich Sutton's "The Bitter Lesson" (http://www.incompleteideas.net/IncIdeas/BitterLesson.html).
At every point in time, the best systems we can build today will be ones leveraging lots of domain-specific information. But the systems that will continue to be useful in five years will always be the ones freely that scale with increased parallel compute and data, which grow much faster than domain-specific knowledge. Learning systems with the ability to use context to develop domain-specific knowledge "on their own" are the only way to ride the wave of this computational bounty.
https://rodneybrooks.com/a-better-lesson/ is an interesting retort to the Sutton post.
> "language models don't really understand anything"
I have a sneaking suspicion that, if blinded, the crowd of people saying variations of that quote would also identify the vast majority of human speech as regurgitated ideas as well.
> I see no reason that this technology couldn't smoothly scale into human-level intelligence
Yup, the OpenAI scaling paper makes this abundantly clear. There is currently no end in sight for the size that we can scale GPT to. We can literally just throw compute at the problem and GPT will get smarter. That's never been seen before in ML. Last time I ran the calculations I estimated that, everything else being equal, we'd reach GPT-human in 20 years (GPT with similar parameter scale as a human brain). That's everything else being equal. It is more than likely that in the next twenty years innovation will make GPT and the platforms we use to train and run models like it more efficient.
And the truly terrifying thing is that, to me, GPT-3 has about the intelligence of a bug. Yet it's a bug who's whole existence is human language. It doesn't have to dedicate brain power to spatial awareness, navigation, its body, handling sensory input, etc. GPT-human will be an intelligence with the size of a human brain, but who's sole purpose is understanding human language. And it's been to every library to read every book ever written. In every language. Whatever failings GPT may have at that point, it will be more than capable of compensating for in sheer parameter count, and leaning on the ability to combine ideas across the _entire_ human corpus.
All available through an API.
Sorry for my limited knowledge of GPT, but isn't it limited by the training data set, like all other models?
It probably can scale, but we're nowhere near the computational power we need to even recreate the brain. And don't forget, our brain took a billion years to evolve.
A typical brain has 80-90 billion neurons and 125 trillion synapses. That's a big freaking network to train.
Hopefully we can figure out how to train parts of it and then assemble something very smart.
Takes on average 2.5 decades to train it.
That's just from the most recent checkpoint :-)
If you were to build it "from scratch" you'd also need to include the millions of years of (distributed) evolution required to get that particular kid to that point.
Tony Zador has some interesting thoughts about that, including"A critique of pure learning", here: https://www.nature.com/articles/s41467-019-11786-6)
The definition of "understanding" behaves just like the definition of "intelligence": The threshold to qualify gets pushed by as much as the technology progresses, so that nothing we create is ever intelligent and nothing ever understands.
I think intelligence as defined as "mapping inputs into goal states" is pretty well handled by models, and the models may be able to pick and choose states that are sufficient for achieving the goals.
However, the intelligence that's created by language models is very schizophrenic, and the human-level reflective intelligence that it displays is at best a bit of Frankenstein's monster (an agglomeration of utterances from other people that it uses to form sentences that form opinions of itself or its world).
I think that modeling will help us learn more about human intelligence, but we're going to have to do a lot better than just training models blindly on huge amounts of text.
Maybe we're also >50% Frankenstein monsters, an agglomeration of utterances from other people.
Humans (technologists) in particular are awful at extrapolating. Transformers being able to combine rudimentary, defined "concepts" and scaling that into human intelligence makes about as much sense as extrapolating a XOR gate or an if-else statement "scaling" to human intelligence.
I'm of the "human beings are much more than big linear algebra functions slapped on top of a large processor" crowd.
To play devils's advocate: well, it might just very well turn out that what most humans are currently engaged in can very well turn out to be reducible to "big linear algebra functions slapped on top of a large processor".
And then the part about "being just like humans" will be the marketing gravy train that funds the operation.
i am exactly where you are, it’s not a matter of if, merely a matter of when
A warning to devs building on OpenAI APIs: We spent months developing a chatbot using GPT3 for our game and released a video showcasing it: https://www.youtube.com/watch?v=nnuSQvoroJo&t=264s
Afterwards OpenAI then added GPT3 chatbot guidelines disallowing basically anything like this. We were in communication with them beforehand, but they decided later that any sort of free form chatbot was dangerous.
What they allow changes on a weekly basis, and is different for each customer. I don't understand how they expect companies to rely on them
OpenAI cloaks themselves in false "open" terminology to hide how proprietary and incredibly restrictive they've made their tech. That's a very cool demo; have you considered trying to make it run on GPT-J instead? It's an open source alternative you can run yourself or pay an independent api provider without supporting OpenAI.
Haven't been able to find a GPT-J service with good latency - though we haven't tried hosting ourselves
I have gotten it running on AWS in a container if you want the Dockerfile/scripts I can send it to you. Email is in my profile.
It sucks that OpenAI has no competition right now. They have every right to control their technology however they like. But it's a shame that they're being so stifling with that right, killing really fun stuff like you demonstrated.
But that monopoly won't last, and I think it's more than likely that competition will crop up within the next year. There's definitely a lot of secret sauce to getting a 175B parameter model trained and working the way OpenAI has. The people working there are geniuses. But it can still be reproduced, and will. Once competition arrives I'm hoping we'll see these shackles disappear and see the price drop as well. Meanwhile the open source alternatives will get better. We already have open source 6B models. A 60B model shouldn't be far off, and is likely to give us 90% of GPT-3.
AI21 Labs just announced their 'Jurassic' language model, free open signup
The notion of a toy like a chatbot being "dangerous" is just so ludicrous. The OpenAI folks take themselves way too seriously. Their technology is cool and scientifically interesting, but in the end it's nothing more than a clever parlor trick.
It's pretty easy to get GPT-3 to say things that are incredibly sexist and racist. I think OpenAI is more concerned about the bad press associated with that than AI-safety.
It’s also pretty easy to get `echo` to say things that are incredibly racist and sexist.
Which is even less ethically defensible.
I think different kind of dangerous, not the SkyNet stuff. The first idea that popped into my mind is below. I know, it's dark but...
8 year old to AI: "my parents won't let me watch TV, what do I do?". AI: "stab them, they'll be too busy to forbid you".
Then again the same thing can be said by a non-AI. My thinking is that you'd be talking to an actual average person. I'm not so sure that that is such a good thing.
The scary irony is that you just posted a training sample for a future AI.
You're right! And that's kinda my point. I can see other dangers of using GPT-3 that stem from assholes like me posting things on the Internet without thinking about literally everything they can be used for.
I wonder how many trolls are out there with the intent of poisoning AI training wells. When will they cause the first car crash by intentionally failing captchas?
Hope there's a human around to label that one as "ignore".
Definitely dangerous from a legal perspective if AI Dungeon is any indication.
The general public basically races to test the most controversial content. As exhibited by several other high-profile chatbot launches.
> Tay responded to a question on "Did the Holocaust happen?" with "It was made up"
That's a really interesting demo. What makes the responses so laggy? Does the model take that long to generate text? You can also experiment with things like repeating the user question or adding pauses like "hmm let's see" to make it less noticeable at least some of the time.
Too bad they asked you to pull it. What's the danger they are worried about? Annoying thing from their press releases is how seriously they take their GPT3 bot impact on humans. Despite all the hype, it's difficult to see the end of humanity by GPT3 bots any time soon. Honestly they need to rename themselves - can't see what's open about openai.
It's laggy since it needs to do speech to text, gpt3 text response, then text to speech. Not sure what adds the most latency actually.
They only allow gpt3 chatbots if the chatbot is designed to speak only about a specific subject, and literally never says anything bad/negative (and we have to keep logs to make sure this is the case). Which is insane. Their reasoning to me was literally a 'what if' the chatbot "advised on who to vote for in the election". As if a chatbot in the context of a video game saying who to vote for was somehow dangerous
I understand the need to keep GPT3 private. There is a lot of possibility for deception using it. But they are so scared of their chatbot saying a bad thing and the PR around that they've removed the possibility of doing anything useful with it. They need to take context more into account - a clearly labeled chatbot in a video game is different than a Twitter bot
> But they are so scared of their chatbot saying a bad thing and the PR around that they've removed the possibility of doing anything useful with it.
It's not unreasonable to have checks-and-balances on AI content, and there should be.
However, in my testing of GPT-3's content filter when it was released (it could be improved now), it was very sensitive to the point that it had tons of false positives. Given that passing content filter checks is required for productionizing a GPT-3 app, it makes using the API too risky to use, and part of the reason I'm researching more with train-your-own GPT models.
Why should there be checks and balances on AI content? What most people label as "AI" today is literally just fancy statistics. Should there be checks and balances on the use of linear regression analysis and other statistical techniques? Where do we draw the line?
> Should there be checks and balances on the use of linear regression analysis and other statistical techniques?
That rhetorical question actually argues against your point: even in academic contexts, statistics can be used (intentionally or otherwise) to argue incorrect/misleading points, which is why reputable institutions have peer reviews/boards as a level of validation for papers.
The point I was making was more on general content moderation in response to user-generated content, which is required for every service that does so for legal reasons at minimum, as they're the ones who will get blamed if something goes wrong.
Ofcourse statistical techniques need checks and balances, hence peer reviewed academic papers, meta analysis, etc. statistics is a major tool for science these days. science needs checks and balances otherwise it's a pretty idle effort. Without checks and balances, you could just imagine any theory and believe it's the truth because you want to.
But what if it wasn't clearly labeled? I did my MSc thesis on fake reviews and discussed the phenomena known as "covert marketing" a bit. e.g. a guy you're talking to in a bar at some point steers the conversation to the excellent beer he is drinking and heavily recommends it to you. Good enough actors will be very convincing. "Influencers" are a somewhat more ethical alternative that takes advantage of humans' lemming-like nature.
I mean, quite a lot of people truly believe Hilary Clinton is the mastermind behind a DNC run pedophile ring. Yes, she is a problem, but that theory is completely schizophrenic. A NPC masquerading as a real person who spouts positive talking points about Tucker Carlson's respect for Hungary is quite reasonable compared to that and it will suck some people in.
So all it takes is some right wing developers for a not-entirely-just-a-game like Second Life or Minecraft to introduce a bug that allows certain instances of NPC to be unlabeled... or a mod to a game that drives a NPC... and an equivalent to GPT-3 funded by the Kochs or the Mercers...
Very hypothetical, very hand waving. But it is possible. So I can see the PR and legal departments flat out stopping this idea.
Eh, I could still see a clearly labeled chatbot on a video game causing a major PR scandal if it says something offensive. Not really worth the risk.
Pretty bad that they took so long to decide on this, though, pulling out the rug from under developers' feet.
Autoregressive transformers take a while to generate text, since you need to run the whole model once for every word in the output.
> but they decided later that any sort of free form chatbot was dangerous.
Seems like OpenAI saw this video differently. But then again, now OpenAI wants to police how to use GPT-3 and reject or approve what is acceptable for others using their service; since they can change their guidelines at any time.
They need a sense of humour, rather than policing projects like this.
> What they allow changes on a weekly basis, and is different for each customer.
Exactly. I don't know what to say to entire building their entire business on top of OpenAI, since they can just revoke access instantly and simply they may not like what you are doing and will point to the 'guidelines'
> I don't understand how they expect companies to rely on them
Won't be surprised to see Rockstar Games using a tweaked, self-hosted or private version for their future games for this use case, Since OpenAI knows they can get a significant amount of money from large customers like them.
But not from smaller companies.
Oh man, I was looking forward to this a ton! Are you thinking to keep working on it with the open source GPT J or something similar by any chance?
I am looking at GPTJ, and also hoping OpenAI comes to their senses on how dangerous a video game chatbot can be
If GTA 6 doesn't have chatbots, I will be very disappointed. This has widened the possible level of immersion in action-adventure games and RPGs immensely.
> Afterwards OpenAI then added GPT3 chatbot guidelines disallowing basically anything like this. We were in communication with them beforehand, but they decided later that any sort of free form chatbot was dangerous.
Was this announced anywhere? We applied to deploy an application in this space, and they refused without providing any context, so I'd be really interested if they published details about restrictions in this space somewhere.
https://beta.openai.com/docs/use-case-guidelines/use-case-re... "reliably being able to limit the conversational topics to strictly X, Y, and Z topics"
This stunning. Imagine being able to practice your foreign language lessons this way.
How many languages does GPT3 support at the moment?
I work in this domain, and you can make these things say anything with a little probing, even stuff like "Hitler was right to kill all the Jews, I wish he was still alive today."
They likely don't want to have "OpenAI GPT-3" and such stuff associated to one another in such demos, would be really bad for their appearence.
I'm still surprised by the approach. I mean, great that it works this well -- but program synthesis is one of those rare domains where you can observe exactly what the outcome is after you generate something. You can see execution traces, variable values, what the JIT produced, etc. And all of this is relatively cheap -- often executing a code snippet should be far cheaper than an extra pass through a giant DNN right? So it's fascinating to me that they train entirely from dealing with code as text.
Imagine learning to develop recipes, not by ever cooking or eating or even seeing food, but only reading a giant library of cookbooks. Or learning to compose music but never hearing or playing anything -- only seeing scores.
FWIW execution guided code synthesis is a thing. Get a few possible outputs and ditch those that don't pass a parser as an example. At least in the SQL generation realm this is well worth the time it takes to tack onto a large language model.
They just finished a demo on twitch. Pretty crazy!
https://www.twitch.tv/videos/1114111652
Starts at 15:45.
It is simultaneously impressive and underwhelming for me.
I mean yes this is a super impressive demo, but it didn't go beyond my expectation. I really want to see whether this model can write a correct binary search method without seeing one before.
Or even correctly using the binary search, does it understand concept like index boundaries?
> I really want to see whether this model can write a correct binary search method without seeing one before.
It has almost definitely seen a lot of coding problems so I would expect "write a function to binary search a sorted array" to output the intended result. I don't think anybody expects it to come up with algorithms it hasn't encountered.
I found the whole UI/sandbox they created the most interesting part. Now don't get me wrong, the tech is certainly great and all, but I really didn't had the feeling I watched/learned more than I already knew from what was shown with Github CoPilot, although I was kinda impressed, if it really is as simple as they stated, at how it is able to adapt to new apis.
It's a shame they only limited the demo to relatively simple instructions.
> I really want to see whether this model can write a correct binary search method without seeing one before.
I don't believe the model was trained on Google interview answers, sadly.
aaaand they've blocked audio until 18:17ish, timestamp url: https://www.twitch.tv/videos/1114111652?t=00h18m17s
lmao; copyright muted so you can't even hear them speaking.
They should have released this first instead of GitHub Copilot. The focus would then have been much more on "look at the cool stuff they can do" rather than "Microsoft is releasing a product that plagiarizes GPL code".
Once people had digested that and there had been a few other proof-of-concept business ideas around turning Codex into a SaaS (because some people will always queue to build their product on your API), announce the evil version. Not that I really think Copilot is evil, but the IP concerns are legitimate.
Yes, its very strange to announce the product first and then the research.
Will really be impressed when one could say: “here is this codebase, modify this function so that it would preduce [insert desired efect]” and also other functionality of project would not crash thumbling down…
Because writing code from scratch now is i think much rearer than improoving existing codebases. Aka bugfixing.
Also curious what this ai would produce when provided with contradictory requests. Because often there are multiple requirements which on theyr own sounds reasonable but when you try to fit all requirements in one system, things get nasty.
It is only able to translate small instructions into code. I think it will take a while to get to a situation where you can just give it a list of requirements and it spits a working program.
Hell it messed up when they gave it the instruction "make every fifth line bold" in their Word api part of the demo, where it made the first line of every paragraph (which is only 4 lines long in total) bold instead of every fifth line.
It didn't mess up the instruction "make every fifth line bold". The blank spaces between each "paragraph" are empty lines, so it counted them too. I think this is perfectly reasonable behavior, it's what I would have done absent further instructions too.
You can see it in the generated code on the bottom right during that part of the demo. It loops over the lines and bolds them when index % 5 == 0.
Edit: I guess with the 1-based indexing of natural languages, the code actually bolds lines number 1, 6, etc. So arguably it should have done index % 5 == 4 instead, to bold lines number 5, 10, etc. But funnily enough, if it had done that, it would have bolded all the empty lines, so it would have seemed like it didn't do anything.
I thought OpenAI was originally supposed to be some kind of for-the-good, non-profit institution studying AI and its safe use in particular with an effort to make it more accessible and available to all through more open collaboration. This is cool research, sure; but what happened to making models available for use by others instead of just through some opaque APIs?
Maybe I'm just remembering wrong or conflating OpenAI with some other entity? Or maybe I bought too much of the marketing early on.
No, they did some good, they've done a few things to personally help me. They created OpenAI Gym which is a great help when doing reinforcement learning research and defined the standard interface for reinforcement learning libraries for a generation. But they not longer maintain OpenAI Gym.
They also created Spinning Up [0], one of the best resources I've found for learning reinforcement learning. Their teaching resources are detailed but relatively brief and are focused on implementing the algorithms, even if some of the "proofs" are neglected. But they no longer maintain Spinning Up.
So yes, originally they were for-the-good, but lately I've noticed them moving away from that in more ways than one. It seems they learned one cool trick with language sequence modelling, and they have a lot of compute, and this is all they do now.
That was the marketing message. They became for-profit in 2019 and took investment from Microsoft. Many people were skeptical before that because the main investors were mostly known for for-profit ventures.
I remember Sam Altman, when asked “How will you make money?”, reply they would ask the AI. I thought it was a fairly creative answer.
It turns out, however, that the way they plan on earning money is much less creative, and more run-of-the-mill SaaS monetization. In a way, I like to believe that a real AI would also end up with such a mundane strategy, as it’s the most likely to actually make them profitable and return money to investors.
OpenAI was founded in 2015. In 2015 Google was AI and AI was Google. There was legitimate concern that one American corporation was going to dominate AI. OpenAI was created to challenge that dominance and let "AI benefit all of humanity".
In the meantime China and Chinese companies have catched up. Turns out the fear that one company and one country dominating AI was overblown.
Maybe the OpenAI founders feel that the original goal has been fulfilled because AI is no longer dominated by the US and Google.
You're remembering correctly. OpenAI transitioned from non-profit to for-profit in 2019, took about $1 billion from Microsoft (there has been speculation that this was mostly in the form of Azure credits), and announced that Microsoft would be their preferred partner for commercializing OpenAI technologies: https://openai.com/blog/microsoft/
They very transparently transitioned to a for profit company. It doesn't seem like they are aggressively profit oriented though: I am a paying customer of OpenAI beta APIs and the cost to use the service is very low. It also solves several classes of tough NLP problems. I used to sell my own commercial NLP library - glad I gave up on the years ago.
I feel that OpenAI Codex could become like Webflow for coding. It might sound ironic, but what tools like Webflow in the world of Web programming is to give the power of creators to build something fast that can long last (without the speciality of a decent web programmer).
If the same thing can happen in the world of programming, I guess evaluations like LeetCode and Whiteboarding can go away and bring in a new of logical thinking evaluation which could ultimately be a more realistic method of some programmer rising above the chain.
I really want to just play with this tech- it’s frightening but also the future, but I’m still waiting to be accepted on the GitHub copilot waitlist. I wonder how long this will take for people who don’t know someone who knows someone…
Uhh... I'm literally no one but got the access for like a week or so. I got 134 repos and 12,060 contributions in the last year. Idk if that mattered.
I have around 10 repos and perhaps a few hundred contributions total over all my years. Got accepted within a few days. Seems to be random.
I have 353 repos and 2146 contributions in the past year and haven't been accepted (yet!). It does seem to be random.
that's not the future, these large language models have no understanding of language, they repeat the most frequently occurring patterns like parrots. They miss this whole thing called semantics.
Can this read existing code and fix one missing piece? That will be cool.
Say I have a question I can't solve by searching through stackoverflow. If the AI can solve a problem like that, it will be great.
Program Synthesis can do some rudimentary fixes. But I would love to explore this problem of program correction using AI.
I think integrations like the MS Word example they show off at the end of the live demo have the potential to be even more impactful than just generating code for programmers.
That still needs work though, it messed up the "Make every fifth line bold" pretty bad. Still, it showed it could adapt to a new API pretty well.
How did it mess up the "make every fifth line bold" prompt?
Also, to follow up on the original comment, AI demos are nice, but being a student of history there are still fundamental challenges with these systems. My skepticism is in how much prompting is really required and how can it understand higher level semantics like code refactoring, reproducible examples, large scale design patterns etc.
This synthesis of sequential symbolic processes and probabilistic neural generation is really exciting though. When the amount of human code edits and tweaking for complex programs goes down from hours to seconds then that's when I'll be impressed and scared.
Yeah, definitely. I guess my point was that converting natural language to source code can be even more valuable for people who don't know how to code, but want to perform actions more complicated than a simple button press. For example, I often find myself doing regex based find-and-replace-alls in text files, and even that feels inefficient while also being over the head of the vast majority of users. I'd imagine there are a lot of people out there spending many hours manually editing documents and spreadsheets.
In the demo video on https://openai.com/blog/openai-codex/#spacegame it seems like it goes and does an image search for a picture of an asteroid, and them embeds without attribution a direct link to a image hosted on "d.newsweek.com". Not sure I'd call that a resounding example of generating good code...
If this actually worked, wouldn’t that be amazing? If you could break down a software idea into a blue print of concepts that need to be accomplished, and then dictate what should be done…
I doubt it works, but I wonder how many decades from now we will be able to walk through a finite number of simple requests and wrap them together as working software. Then people will be able to convert their blueprint into action!
How does Codex make the connection between natural language and code? Comments? variable names? file names? docs?
I think all the above.
Do people comment more than I do? I comment only intermittently, usually if I think the code is going to be hard to read ... and it's not usually on the small details of the code I'm writing ... it does seem magic what they're doing.
I'm trying to extract some signal from this link...lots of upvotes, no comments, 30 min old, top 3 on HN...I'm worried this will be read as negative, but it's not, just learning, and enough time has passed I'm itching to jump in and ask:
- Is the significance here exactly what it says on the tin: the model behind GitHub's AI code completion will be shared with people on an invite basis? Or am I missing something?
- What is the practical import of the quote at the end of this comment?
"can now" makes me think its a new feature over Github's implementation, which would then indicate the "simple commands" could be general UI, or at least IDE UI, navigation.
If "can now" means "it is currently capable of, but will be capable of more", then I'd expect it to be the same as the current implementation on Github.
Quote: "Codex can now interpret simple commands in natural language and execute them on the user’s behalf—making it possible to build a natural language interface to existing applications."
Take a look at the video demo. It takes natural text in a box and generates code. Copilot was super-autocomplete, so the interface was writing code in an IDE that it filled out for you. Natural language interface will be a little easier for non-programmers. (Though, how would you read the code to make sure it does what you meant...)
>Take a look at the video demo. It takes natural text in a box and generates code. Copilot was super-autocomplete, so the interface was writing code in an IDE that it filled out for you.
No it wasn't, you can literally describe, in natural text, what you want in a comment and CoPilot will do its best to generate a complete method based on that comment. It seemed like it was so auto-compltely because that focussed on the "helping the developer" part.
I'm fairly sure CoPilot could have shown something similar if they had a demo where you could make something visual easily, like HTML + Javascript/Typescript/whatever scripting language. They're using exactly the same model (Codex) after all.
I watched their 30 minute demo on Twitch this morning, really good!
I use their OpenAI beta APIs as a paying customer, I am still waiting for access to Codex.
Very cool, will be interesting to see if this is ever added in to VisualStudio as some sort of "super" auto-complete.
"Converting Python to Ruby with OpenAI Codex"
Oof. Looking forward to maintaining some future ports done with this tool.
Can I use this to write solidity contracts ?
That's about as advisable as asking it to write firmware for a pacemaker. Smart contracts are some of the most delicate codebases - even a tiny bug can cause you to lose a lot of money. With a model like this bugs are very likely, especially in such a niche domain.
I agree that at first it seems to be a bad idea, but the more I think about it, the more I think it makes for a great test case to prove the AI is robust enough and that it can be trusted for real life scenarios. It also feels kind of inevitable.
It seems like a safe playground, maybe it will lose a few tokens, but if they are no legal consequences, who cares ? The best place to move fast and break things, and with great reward potential.
If I take some freelancer to write them what's telling me that they are not already using something like codex or copilot. It's like when a factory release its used water in the river nearby. Maybe we shouldn't drink from the river anymore, but it would be better to test the water the factory release to make sure it's OK.
How will you know that the generated code doesn't have any bugs?
Codex in its current form is meant to be used as an assist for someone who can already code/debug, not as a replacement for a contractor.
Sure some contracts will have bugs at the beginning, but those faulty contract won't get reused.
In theory you can create a new token per contract to cap the maximum potential loss. Then you increase this maximum potential loss as the contract get used in the open and therefore become more robust.
You probably can write some guarantee fund smart contract to compensate for when bug happen.
Sooner or later we will have to implement a high-level error-correction mechanism to handle the bugs generated by the weak AI : Like a paradigm shift where you expect bugs to happen and handle them instead of expecting no bugs.
That has got to be one of the worst possible use cases one could imagine. In page 33 of the appendix, the authors note that nearly 40% of RSA encryption keys created by Codex are clearly insecure.
Only if tokens have value.
If codex is able to handle a generic api from reading the doc, it maybe could use a python library for solidity contracts like https://web3py.readthedocs.io/en/stable/contracts.html
As a contract user, I'd probably have more trust in a contract written by an independent AI from a short natural language specification which can't hide intent, than a contract with hidden backdoor, or a subtle bug.
Also the AI will probably improve with usage.
You probably can generate multiple version of your contract, and maybe a high level bug correction scheme like taking the median action between those version can increase bug robustness and find those edge cases when action differ.
What does that have to do with anything?
A new way to talk to the computer I guess.
Would like to say "Fix that something of undefined error" some day.
The Writing On The Wall
I don't understand what is going on, why are people even spending time on this? I think this and copilot and etc are solving a non problem of "we will remove the boring part of programming" by generating a bunch of code, so now it's even more boring to read it and check if it actually does what you want.
In the same time zero of the developers I interviewed know how a linked list is laid out in memory, or what is the pro/con of continuous memory layouts, or even how a cpu works actually.
Maybe those things are not needed anymore, but I see their code... I think it will be better if they know them.
Think bigger. Say I'm starting a startup:
1. "Setup Django, Nginx, and Postgres deployed on a Digital Ocean Ubuntu droplet." Done.
2. "Make a shopping page like $URL." Done.
3. "Fill it with data from X and connect with Stripe." Done.
4. ???
5. Profit
Seems like even a great dev will take 20x the time to do that if the model is able to correctly generate this, even with an error, customization, or two.
If you don't have someone that understands the generated code, you'll be kinda screwed. Most of my work isn't writing a function to do X. It's reading and understanding all the surrounding code and architecture and then knowing that I need a function to do X. Writing the actual function isn't usually much of a challenge. I get the feeling that this tool will just encourage write-only code that ultimately no one understands. Will all of the generated code follow a consistent style? Will it know to use the framework you built or will it just reinvent everything it needs for each problem you give it? I already see tons of code that people copy and paste without really understanding it, and a lot of the time they're just adding complexity by solving non-problems. This just automates that process. I can see it being useful in certain narrow cases, but the potential for misuse is huge.
at the point where 1/2/3 are possible, what value does the startup have when anyone else can ask it to do the same thing?
Do your competitors have access to this tool that gets you started 20x faster? If so, you want the tool.
Your copycat startup may not have incredible value, but selling shovels always pays.
ah, true... touché
Realistically if we get to that point the landscape of "startups" will drastically change
Why would you mention "Django", "Nginx", "Postgres", "Digital Ocean", "Ubuntu" or "Stripe"? Surely those are implementation details that the user wouldn't care about.
but does it really matter, if 20x is 1 week instead of 2 hours?
are startups really that shallow?
1/20th of the time? That's kind of a big deal.
i think the 1/20th of the time was mentioned was only at start, i dont think you will gain a lot after that as the spaghetti AI will come to collect.
You have a debt to pay. -- Davy Jones
That depends: https://xkcd.com/1205/
A one-time setup is perfectly OK to take a few days, especially if afterwards you have a documented process that allows you to modify and improve the result.
This is just nascent technology leading toward something like this:
"Computer, I want to play a game."
"Okay, what will the game be?"
"I want to be a starship captain, give me a cool space ship I can explore the galaxy with"
"Okay... like this?"
"Not quite, make the galaxy more realistic, with real stars and planets. Also make it 3d. I want to be the captain inside the ship."
"How about now?"
"Cool, and there should be space stations I can visit near planets, and I can fly my ship to stars with hyperspace. Make it so I have to trade for fuel at the space stations, maybe I need to mine asteroids or search derelict space ships for treasure. I want to play with my friends too, they can have their own ships or walk around my ship."
"Done, was there anything else?"
"Yes, add different alien races to some of the star systems, and make some of them have alliances. I want to talk to the aliens about their history and culture. Sometimes aliens are unfriendly and we'll have space battles if talking doesn't work. Make it so I can command a fleet and call for reinforcements."
"Processing... Done. Anything else?"
"Actually this is boring, can we start over?"
"Game erased. Please provide new prompt."
Oh! this will be so cool! do you really think it could lead in that direction? To me it seems more like a metaphysical cargo cult. I think I am too pessimistic, I should shake it off, nothing good comes out of being pessimistic (by definition).
Thanks for the inspiration!
> do you really think it could lead in that direction?
If you asked me 20 years ago, or even 10, I'd have said it was total science fiction. I wouldn't have been able to imagine how to do it. If you asked me 5 years ago, I'd have vaguely said something about AI, half jokingly. At the time I thought perhaps the models could be trained so we can do test-only development and let AI trained on formal test cases generate endless code until all tests pass, but I didn't really imagine it would be possible to get a computer to take freeform written English (even in a tightly controlled manner) and produce functioning code.
Over the past couple of years I have seen increasingly fluent demonstrations and tried a few myself, and I have fallen off the fence and I think that with the pace that machine learning and AI assisted programming keeps advancing, this outcome is all but inevitable, as far fetched as it seems.
I was messing with the OpenAI sandbox over the weekend and it helped me generate several game design concepts from prompts similar to my post above that I could see myself being interested in building and playing. It's not difficult to imagine down the line with a few more advancements in this tech that the generated design could then instruct the code generator, fetch the assets, and stage the environment for a player or user to enter without ever touching a line of code.
I'm not close enough to the research itself know which of those problems are hard and which are easy, so I don't know if we'll see the first totally AI-generated "proto-holodeck" tech demo in the next 5 years, or the next 20 years, but I can't see it being more than 50 years away, and something tells me with the pace of things it will be much sooner than that, assuming we're all still around at the time to enjoy it.
I wonder what will it make when you ask it to make a good bot AI for a game.
"make a game with a formidable opponent that plays good enough to win with 51% probability"
and of course the inevitable "make a better version of yourself"
From what I've seen the technology can fuse together a remarkable range of outputs, but all of them are essentially fused together from within the training set. If there were enough examples of AI opponents, it conceivably could do it since most game AIs are some form of state machine combined with a degree of statistical analysis and pathfinding (for mobile AI actors). It would "just" be replicating existing patterns.
As I understand it, it would take a dramatic leap from this kind of interpolation to being able to extrapolate and "self improve". So far I haven't seen anything that convinces me we're close to this, but again I'm not close to the wheel on the research side of things.
Also know as the holodeck from Star Trek.
It seems like they're going in totally the wrong direction. If program content is predictable based on patterns (low entropy) then that's a sign that our programming languages are too low level. If we want to improve developer productivity then the solution is the same as it always has been: create higher level languages which abstract away all the repetitive patterns.
Tools are relatively low level compared to any single use case or field because they should universally support all uses cases or fields. The more narrow your field or use case is, the fewer resources there are to create a higher level language that abstracts away the details that aren't important for your area, but are important to other areas. In this manner, Codex has enormous potential.
You're interviewing programmers for a job in operating systems programming?
Just full stack devs react native + go. Is it too wrong to think they are the same? Programming is programming, most computers work in a similar way no?
But they also don't know how garbage collection works in their language, or how to work with 1 million things in an efficient manner. Or why does the app pause for 100 ms because someone does sort while parsing dates within the sort.
For example, I have seen people that cant imagine what is the cost of a leaked database transaction, just back of the napkin wise, like you would think well, how many changes happened in between, how much we have to unwind when the session disconnects, when will it even disconnect because of the connection pool.. etc etc. Because the sql server is this magic rds thing. As if aws will solve everything with its pixie dust.
Well, to be fair none of the things mentioned are specific to operating systems programming. I think that not understanding the things mentioned is one of the reasons software runs so slowly these days even on computers than are orders of magnitude more powerful than just twenty years ago.