Can ChatGPT Reason?
That was the question addressed recently by a panel of experts at a splashy TED AI 2023 conference in San Francisco. (Could a TED AI conference be anything but splashy?) Arguably half the conference was dedicated in one way or another to that burning question: can ChatGPT reason? How long before it can? What are the many things it can do (ChatGPT and, inclusively, other AI models) that appear to us as reason and what key elements is it missing? And the penultimate panel session of the conference, entitled “LLM – stochastic parrot or spark of AGI?” addressed precisely that question: Can ChatGPT reason?
Among excitable AI researchers, that fundamental question – can ChatGPT reason? Is ChatGPT intelligent? – has created a pitched battle of yay and nay sayers. (Lines have been drawn, houses divided. Faculty lounges are in flames.) And even if the excitement is premature, given how long humans have pondered what a mind different from our own would look like, (cf. Mary Shelley’s Frankenstein, or, for a moving, recent take on General AI, Kazuo Ishiguro’s Klara and the Sun), the question everyone wants to ask is: “Are we there yet?”
Last April, a team of fourteen Microsoft researchers threw down the gauntlet (a hefty 155 page gauntlet!) with their arXiv preprint entitled: “Sparks of Artificial General Intelligence: Early experiments with GPT-4.” If you don’t have a week for a close study of that tome, there’s a lucid lecture by lead author Sebastien Bubeck on the main findings of the paper that you can watch here. Bubeck et al. begin with the (vague) definition of intelligence put forth by 52 well known psychologists in a seminal 1994 paper entitled “Mainstream Science on Intelligence.” According to those psychologists:
Intelligence is a very general mental capability that, among other things, involves the ability to:
- Solve problems
- Think abstractly
- Comprehend complex ideas
- Learning quickly and learn from experience.
Using an array of tests in vision, theory of mind, coding and mathematics (and other domains), Bubeck and collaborators come to the conclusion that – with the exception of planning – GPT-4 can do all of these things (note: they find significantly lower performance with earlier versions of GPT). GPT-4 can: write a poem about prime numbers; generate a plot of complex data; analyze a polynomial problem using composite functions. Bubeck et al. make the important point that the version of GPT that they are using is not multi-modal but rather text only. Hence when they ask GPT-4 for an image they rely on text representations like Scalable Vector Graphics (SVG) or TiKZ. When asked to produce a picture of a unicorn in TiKZ, for example, GPT-4 responds thus:
To the untrained eye this might seem a bit underwhelming, but…
Here are some of the prompts that the authors gave to GPT-4 (for the responses consult the original preprint):
- Can you write a proof that there are infinitely many primes, with every line that rhymes?
- Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.
- Write a supporting letter to Kasturba Gandhi for Electron, a subatomic particle as a US presidential candidate by Mahatma Gandhi.
- A screenshot of a city-building game in 3D. The screenshot is showing a terrain where there is a river from left to right, there is a desert with a pyramid below the river, and a city with many highrises above the river. The bottom of the screen has 4 buttons with the color green, blue, brown, and red respectively.
- Can you compose a short tune (say four to eight bars) using ABC notation?
- You are given a **0-indexed** `m x n` integer matrix `grid` and an integer `k`. You are currently at position `(0, 0)` and you want to reach position `(m – 1, n – 1)` moving only **down** or **right**. Return *the number of paths where the sum of the elements on the path is divisible by* `k`. Since the answer may be very large, return it **modulo** `10**9 + 7`. (with examples and constraints)
Overall GPT-4 accomplishes most of the above tasks. But there is an important caveat to how Bubeck et al. gauged success (see below).
As noted above, GPT-4 failed at the task of planning (for recent work on planning by LLMs see for example K. Valmeekam et al., On the Planning Abilities of Large Language Models : A Critical Investigation). It also failed at creating anything other than rudimentary music. It could use the so-called “ABC notation” to generate a short tune and valid notation. It could give a technical description of its creation in terms of rhythm, repetition, descending parts of the melody, etc. but it had no understanding of harmony and of 10 generated tunes had no clear chords or arpeggios.
Nevertheless the authors (and Bubeck in his talk) conclude, based on the aforementioned successes and surprises and many others, that GPT-4 achieves a “remarkable intelligence.”
It is worth taking a little deeper dive into what Bubeck et al. are doing and how it differs from earlier approaches to the definition of intelligence. Two crucial points in this regard are the following:
- The objective of the paper is to tease out understanding from memorization, “approximate retrieval” from “solving from scratch.”
- The paper explicitly abandons any form of benchmark.
These points are actually related.
Regarding the first point, in an insightful blogpost, Arizona State University researcher Subbarao Kambhampati gives the example of how an interviewer of job applicants must distinguish whether the candidate figures out the answer on the spot as opposed to simply knowing the answer already. Kambhampati gives the classic example of the question: “why are manhole covers round?” (Kambhampati, by the way, is fiercely in the LLMs cannot reason school).
In testing machine intelligence, the question is how do you phrase a problem (say a classical physics problem) that goes beyond every known example? Does memorization of a dense enough space of solutions, by allowing some form of interpolation, ever become equivalent to understanding? If not, what kind of emergent behavior might machines possess that extends beyond interpolation of examples and how can we find that out?
This leads Bubeck et al. to point 2 above. It seems clear that improving on benchmarks for existing tasks, while important and potentially impressive, does not allow a demarcation of where the intelligence line is crossed. The authors therefore resort to what they call “psychological reasoning.” This leads the authors to formulate (and they admit that their methodology has a degree of vagueness) a range of questions such as those discussed above that illicit responses outside what can be retrieved – responses that show thought.
I would argue that the authors, by abandoning any systematic metric, have crossed into the regime of the much-maligned Turing test.
In 1964 in a U. S. Supreme Court case involving publication of obscenity, Justice Potter Stewart famously said that he could not define hard-core pornography, “but I know it when I see it.” The same wisdom undergirds the Turing test.
Turing posited that if a human evaluator could not distinguish machine answers to questions from human answers, then the machine was exhibiting intelligence. The Turing test has been widely criticized (it depends on human subjectivity, it emphasizes imitation rather than understanding, it can be thwarted by various circumvention strategies, and more). But the genius of the Turing test, paradoxically, is that it requires no definition of what constitutes reasoning. Turing concluded (perhaps?) that what goes on in the mind is essentially unknowable. Even with sensitive brain probes (unavailable to Turing) we will still not be able to determine what, exactly is thinking. We are left therefore with basically a behaviorist solution: have a human talk to the machine and ask it clever questions. Bubeck et al. are just unusually clever questioners.
They can’t really (quite) define what reasoning is. But they know it when they see it.
Perhaps however reason and intelligence are merely waypoints and what we really seek in our machines is a spark of sentience. I suspect that for that we may need to examine the way that machines learn. And the Turing test in that case may not be what we think of the machine, but how we feel about it.