“Can machines think?” So asked Alan Turing in his 1950 paper, “Computing Machinery and Intelligence.” Turing quickly noted that, given the difficulty of defining thinking, the question is “too meaningless to deserve discussion.” As is often done in philosophical debates, he proposed replacing it with a different question. Turing imagined an “imitation game,” in which a human judge converses with both a computer and a human (a “foil”), each of which vies to convince the judge that they are the human. Importantly, the computer, foil, and judge do not see one another; they communicate entirely through text. After conversing with each candidate, the judge guesses which one is the real human. Turing’s new question was, “Are there imaginable digital computers which would do well in the imitation game?”
This game, now known as the Turing Test, was proposed by Turing to combat the widespread intuition that computers, by virtue of their mechanical nature, cannot think, even in principle. Turing’s point was that if a computer seems indistinguishable from a human (aside from its appearance and other physical characteristics), why shouldn’t we consider it to be a thinking entity? Why should we restrict “thinking” status only to humans (or more generally, entities made of biological cells)? As the computer scientist Scott Aaronson described it, Turing’s proposal is “a plea against meat chauvinism.”
Turing offered his test as a philosophical thought experiment, not as a practical way to gauge a machine’s intelligence. However, the Turing Test has taken on iconic status in the public’s mind as the ultimate milestone of artificial intelligence (AI)—the chief metric to determine if general machine intelligence has arrived. And now, nearly 75 years later, the reporting on AI is full of pronouncements that the Turing Test has finally been passed by chatbots such as OpenAI’s ChatGPT and Anthropic’s Claude. Last year, OpenAI’s CEO Sam Altman posted, “[G]ood sign for the resilience and adaptability of people in the face of technological change: the [T]uring test went whooshing by and everyone mostly went about their lives.” Various media headlines have made similar claims, such as one newspaper’s report that “ChatGPT passes the famous ‘Turing test’—suggesting the AI bot has intelligence equivalent to a human.”
Have modern chatbots actually passed the Turing Test? And if so, should we grant them thinking status, as Turing proposed? Surprisingly, given the Turing Test’s broad cultural importance, there’s little agreement in the AI community on the criteria for passing, and much doubt about whether having conversational skills that can fool a human reveals anything about a system’s underlying intelligence or “thinking status.”
Because he was not proposing a practical test, Turing’s description of the imitation game was short on details. How long should the test last? What types of questions are allowed? What qualifications do humans need to act as the judge or the foil? Turing didn’t specify such fine points. He did make one specific prediction: “I believe that in about 50 years’ time it will be possible to programme computers...to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning.” In short, in a five-minute conversation, the average judge will be fooled 30% of the time.
Some have taken this casual prediction as the “official” criterion for passing the Turing Test. In 2014, the Royal Society in London hosted a Turing Test competition with five computer programs, 30 human foils, and 30 judges. The human participants were a diverse group of young and old, native and non-native English speakers, and computer experts and non-experts. Each judge conducted several rounds of five-minute conversations in parallel with a pair of contestants—one human and one machine—after which the judge had to guess which was human. A chatbot named “Eugene Goostman,” which purported to be a Ukrainian teenager, won the competition by fooling 10 (33.3%) of the judges. Adopting the “30% chance of fooling after five minutes” criterion, the organizers proclaimed, “[t]he 65-year-old iconic Turing Test was passed for the very first time by computer programme Eugene Goostman....This milestone will go down in history...”
AI experts, reading the transcript of Eugene Goostman’s conversations, scoffed at the claim that this unsophisticated and unhumanlike chatbot had passed the kind of test Turing had in mind. The limited conversation time and uneven expertise of judges made the test one of human gullibility rather than machine intelligence. The results were a stark example of the “ELIZA effect,” named for the 1960s chatbot ELIZA that, in spite of its utter simplicity, managed to fool many people into thinking it was an understanding, sympathetic psychotherapist, playing on our human tendency to ascribe intelligence to any entity that seems able to converse with us.
Another Turing Test competition, the Loebner Prize, allowed more conversation time, included more expert judges, and required a contestant to fool at least half of them. In nearly 30 years of annual competitions, no machine passed this version of the test.
Although Turing’s original paper lacked specifics on how a test should be carried out, it was clear that the imitation game required three participants: a computer, a human foil, and a human judge. However, the meaning of the term “Turing Test” in public discourse has evolved over the years into something considerably weaker: any interaction between a human and a computer in which the computer seems sufficiently humanlike.
For example, when the Washington Post reported in 2022 that “Google’s AI passed a famous test—and showed how the test is broken,” they were referring not to an imitation game, but rather to Google engineer Blake Lemoine’s impression that Google’s LaMDA chatbot was “sentient.” A 2024 press release from Stanford University proclaimed that a Stanford team’s research “marks one of the first times an artificial intelligence source has passed a rigorous Turing test.” But here, the so-called Turing Test consisted of comparing statistics of how GPT-4’s behavior on psychological surveys and interactive games compared with those of humans. The Stanford team’s formulation might not be recognizable to Turing: “We say an AI passes the Turing test if its responses cannot be statistically distinguished from randomly selected human responses.”
The most recent claims of a chatbot passing the Turing Test involved a 2024 study that used a “two-player formulation” of the test: Unlike Turing’s “three-player” imitation game, in which a judge questions both a computer and a human foil, here each judge interacted only with a computer or with a human. The researchers recruited 500 human participants, each of whom was assigned to be either a judge or a human foil. Each judge played a single five-minute round of the game with either a foil, GPT-4 (which had been prompted with human-written suggestions on how to fool a judge), or a version of the ELIZA chatbot. After conversing over a web interface for five minutes, the judge guessed whether their conversation partner was human or machine. Human foils were judged to be human on 67% of their rounds; GPT-4 was judged human on 54% of its rounds, and ELIZA was judged human on 22% of its rounds. The authors defined “passing” as fooling judges more than 50% of the time—that is, more than would be achieved by random guessing. By this definition, GPT-4 passed, even though the human foils had a higher score.
It’s certainly concerning that a majority of the human judges were fooled by GPT-4 after a five-minute conversation. The use of generative AI systems to impersonate humans to propagate disinformation or carry out scams is a genuine danger that society must grapple with. But is it true that today’s chatbots have passed the Turing Test?
The answer is, of course, it depends on which version of the test you’re talking about. A three-player imitation game with expert judges and longer conversation time has still not been passed by any machine (though there are plans to hold an ultra-strict version of it in 2029).
Because its focus is on fooling humans rather than on more directly testing intelligence, many AI researchers have long dismissed the Turing Test as a distraction, a test “not for AI to pass, but for humans to fail.” But the test’s prominence in popular culture persists. Holding a conversation is a big part of how each of us assesses other humans, so it’s natural to assume that an agent that can converse fluently must possess humanlike intelligence and other mental characteristics such as beliefs, desires, and a sense of self.
However, if the history of AI has taught us anything, it’s that our intuitions are often wrong about such assumptions. Decades ago, many prominent AI experts believed that creating a machine that could beat humans at chess would require something equivalent to full human intelligence. “If one could devise a successful chess machine, one would seem to have penetrated to the core of human intellectual endeavor,” wrote AI pioneers Allen Newell and Herbert Simon in 1958, and cognitive scientist Douglas Hofstadter predicted in 1979 that in the future, “there may be programs which can beat anyone at chess, but...they will be programs of general intelligence.” Of course within the next two decades IBM’s DeepBlue beat world chess champion Garry Kasparov using a brute-force approach that is far from anything we would call “general intelligence.” Similarly, progress in AI has shown that tasks that were once thought to require general intelligence—speech recognition, natural language translation, and even driving—can be carried out by machines that lack anything like human understanding.
It’s likely that the Turing Test will become yet another casualty of our shifting conceptions of intelligence. In 1950, Turing intuited that the ability for humanlike conversation should be firm evidence of “thinking,” and all that goes with it. That intuition is still strong today. But perhaps what we have learned from ELIZA and Eugene Goostman, and what we may still learn from ChatGPT and its ilk, is that the ability to sound fluent in natural language, like playing chess, is not conclusive proof of general intelligence.
Indeed, there is emerging evidence from neuroscience that language fluency is surprisingly disassociated from other aspects of cognition. MIT neuroscientist Ev Fedorenko and her collaborators have shown in a series of careful and compelling experiments that the brain networks underlying what they call “formal linguistic competence”—the abilities related to language production—are largely separate from the networks underlying common sense, reasoning, and other aspects of what we might call “thinking.” Our intuitive assumption that fluent language is a sufficient condition for general intelligence is, these researchers claim, a “fallacy.”
In his 1950 paper, Turing wrote, “I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.” We’re not there yet. It remains to be seen if Turing’s prediction is merely off by a few decades, or if the real alteration will be in our conceptions of what “thinking” is—and our realization that intelligence is more complex and subtle than Turing, and the rest of us, had appreciated.