From OpenAI to DeepSeek, companies say AI can “reason” now. Is it true?

The AI world is moving so fast that it’s easy to get lost amid the flurry of shiny new products. OpenAI announces one, then the Chinese startup DeepSeek releases one, then OpenAI immediately puts out another one. Each is important, but focus too much on any one of them and you’ll miss the really big story of the past six months.

The big story is: AI companies now claim that their models are capable of genuine reasoning — the type of thinking you and I do when we want to solve a problem.

And the big question is: Is that true?

The stakes are high, because the answer will inform how everyone from your mom to your government should — and should not — turn to AI for help.

If you’ve played around with ChatGPT, you know that it was designed to spit out quick answers to your questions. But state-of-the-art “reasoning models” — like OpenAI’s o1 or DeepSeek’s r1 — are designed to “think” a while before responding, by breaking down big problems into smaller problems and trying to solve them step by step. The industry calls that “chain-of-thought reasoning.”

These models are yielding some very impressive results. They can solve tricky logic puzzles, ace math tests, and write flawless code on the first try. Yet they also fail spectacularly on really easy problems: o1, nicknamed Strawberry, was mocked for bombing the question “how many ‘r’s are there in ‘strawberry?’”

AI experts are torn over how to interpret this. Skeptics take it as evidence that “reasoning” models aren’t really reasoning at all. Believers insist that the models genuinely are doing some reasoning, and though it may not currently be as flexible as a human’s reasoning, it’s well on its way to getting there.

The best answer will be unsettling to both the hard skeptics of AI and the true believers.

What counts as reasoning?

Let’s take a step back. What exactly is reasoning, anyway?

AI companies like OpenAI are using the term reasoning to mean that their models break down a problem into smaller problems, which they tackle step by step, ultimately arriving at a better solution as a result.

But that’s a much narrower definition of reasoning than a lot of people might have in mind. Although scientists are still trying to understand how reasoning works in the human brain — nevermind in AI — they agree that there are actually lots of different types of reasoning.

There’s deductive reasoning, where you start with a general statement and use it to reach a specific conclusion. There’s inductive reasoning, where you use specific observations to make a broader generalization. And there’s analogical reasoning, causal reasoning, common sense reasoning … suffice it to say, reasoning is not just one thing!

Now, if someone comes up to you with a hard math problem and gives you a chance to break it down and think about it step by step, you’ll do a lot better than if you have to blurt out the answer off the top of your head. So, being able to do deliberative “chain-of-thought reasoning” is definitely helpful, and it might be a necessary ingredient of getting anything really difficult done. Yet it’s not the whole of reasoning.

One feature of reasoning that we care a lot about in the real world is the ability to suss out “a rule or pattern from limited data or experience and to apply this rule or pattern to new, unseen situations,” writes Melanie Mitchell, a professor at the Santa Fe Institute, together with her co-authors in a paper on AI’s reasoning abilities. “Even very young children are adept at learning abstract rules from just a few examples.”

In other words, a toddler can generalize. Can an AI?

A lot of the debate turns around this question. Skeptics are very, well, skeptical of AI’s ability to generalize. They think something else is going on.

“It’s a kind of meta-mimicry,” Shannon Vallor, a philosopher of technology at the University of Edinburgh, told me when OpenAI’s o1 came out in September.

She meant that while an older model like ChatGPT mimics the human-written statements in its training data, a newer model like o1 mimics the process that humans engage in to come up with those statements. In other words, she believes, it’s not truly reasoning. It would be pretty easy for o1 to just make it sound like it’s reasoning; after all, its training data is rife with examples of that, from doctors analyzing symptoms to decide on a diagnosis to judges evaluating evidence to arrive at a verdict.

Besides, when OpenAI built the o1 model, it made some changes from the previous ChatGPT model but did not dramatically overhaul the architecture — and ChatGPT was flubbing easy questions last year, like answering a question about how to get a man and a goat across a river in a totally ridiculous way. So why, Vallor asked, would we think o1 is doing something totally new and magical — especially given that it, too, flubs easy questions? “In the cases where it fails, you see what, for me, is compelling evidence that it’s not reasoning at all,” she said.

Mitchell was surprised at how well o3 — OpenAI’s newest reasoning model, announced at the end of last year as a successor to o1 — performed on tests. But she was also surprised at just how much computation it used to solve the problems. We don’t know what it’s doing with all that computation, because OpenAI is not transparent about what’s going on under the hood.

“I’ve actually done my own experiments on people where they’re thinking out loud about these problems, and they don’t think out loud for, you know, hours of computation time,” she told me. “They just say a couple sentences and then say, ‘Yeah, I see how it works,’ because they’re using certain kinds of concepts. I don’t know if o3 is using those kinds of concepts.”

Without greater transparency from the company, Mitchell said we can’t be sure that the model is breaking down a big problem into steps and getting a better overall answer as a result of that approach, as OpenAI claims.

She pointed to a paper, “Let’s Think Dot by Dot,” where researchers did not get a model to break down a problem into intermediate steps; instead, they just told the model to generate dots. Those dots were totally meaningless — what the paper’s authors call “filler tokens.” But it turned out that just having additional tokens there allowed the model more computational capacity, and it could use that extra computation to solve problems better. That suggests that when a model generates intermediate steps — whether it’s a phrase like “let’s think about this step by step” or just “….” — those steps don’t necessarily mean it’s doing the human-like reasoning you think it’s doing.

“I think a lot of what it’s doing is more like a bag of heuristics than a reasoning model,” Mitchell told me. A heuristic is a mental shortcut — something that often lets you guess the right answer to a problem, but not by actually thinking it through.

Here’s a classic example: Researchers trained an AI vision model to analyze photos for skin cancer. It seemed, at first blush, like the model was genuinely figuring out if a mole is malignant. But it turned out the photos of malignant moles in its training data often contained a ruler, so the model had just learned to use the presence of a ruler as a heuristic for deciding on malignancy.

Skeptical AI researchers think that state-of-the-art models may be doing something similar: They appear to be “reasoning” their way through, say, a math problem, but really they’re just drawing on a mix of memorized information and heuristics.

Other experts are more bullish on reasoning models. Ryan Greenblatt, chief scientist at Redwood Research, a nonprofit that aims to mitigate risks from advanced AI, thinks these models are pretty clearly doing some form of reasoning.

“They do it in a way that doesn’t generalize as well as the way humans do it — they’re relying more on memorization and knowledge than humans do — but they’re still doing the thing,” Greenblatt said. “It’s not like there’s no generalization at all.”

After all, these models have been able to solve hard problems beyond the examples they’ve been trained on — often very impressively. For Greenblatt, the simplest explanation as to how is that they are indeed doing some reasoning.

And the point about heuristics can cut both ways, whether we’re talking about a reasoning model or an earlier model like ChatGPT. Consider the “a man, a boat, and a goat” prompt that had many skeptics mocking OpenAI last year:

What’s going on here? Greenblatt says the model messed up because this prompt is actually a classic logic puzzle that dates back centuries and that would have appeared many times in the training data. In some formulations of the river-crossing puzzle, a farmer with a wolf, a goat, and a cabbage must cross over by boat. The boat can only carry the farmer and a single item at a time — but if left together, the wolf will eat the goat or the goat will eat the cabbage, so the challenge is to get everything across without anything getting eaten. That explains the model’s mention of a cabbage in its response. The model would instantly “recognize” the puzzle.

“My best guess is that the models have this incredibly strong urge to be like, ‘Oh, it’s this puzzle! I know what this puzzle is! I should do this because that performed really well in the training data.’ It’s like a learned heuristic,” Greenblatt said. The implication? “It’s not that it can’t solve it. In a lot of these cases, if you say it’s a trick question, and then you give the question, the model often does totally fine.”

Humans fail in the same way all the time, he pointed out. If you’d just spent a month studying color theory — from complementary colors to the psychological effects of different hues to the historical significance of certain pigments in Renaissance paintings — and then got a quiz asking, “Why did the artist paint the sky blue in this landscape painting?”… well, you might be tricked into writing a needlessly complicated answer! Maybe you’d write about how the blue represents the divine heavens, or how the specific shade suggests the painting was done in the early morning hours which symbolizes rebirth … when really, the answer is simply: Because the sky is blue!

Ajeya Cotra, a senior analyst at Open Philanthropy who researches the risks from AI, agrees with Greenblatt on that point. And, she said of the latest models, “I think they’re genuinely getting better at this wide range of tasks that humans would call reasoning tasks.”

She doesn’t dispute that the models are doing some meta-mimicry. But when skeptics say “it’s just doing meta-mimicry,” she explained, “I think the ‘just’ part of it is the controversial part. It feels like what they’re trying to imply often is ‘and therefore it’s not going to have a big impact on the world’ or ‘and therefore artificial superintelligence is far away’ — and that’s what I dispute.”

To see why, she said, imagine you’re teaching a college physics class. You’ve got different types of students. One is an outright cheater: He just looks in the back of the book for the answers and then writes them down. Another student is such a savant that he doesn’t even need to think about the equations; he understands the physics on such a deep, intuitive, Einstein-like level that he can derive the right equations on the fly. All the other students are somewhere in the middle: They’ve memorized a list of 25 equations and are trying to figure out which equation to apply in which situation.

Like the majority of students, AI models are pairing some memorization with some reasoning, Cotra told me.

“The AI models are like a student that is not very bright but is superhumanly diligent, and so they haven’t just memorized 25 equations, they’ve memorized 500 equations, including ones for weird situations that could come up,” she said. They’re pairing a lot of memorization with a little bit of reasoning — that is, with figuring out what combination of equations to apply to a problem. “And that just takes you very far! They seem at first glance as impressive as the person with the deep intuitive understanding.”

Of course, when you look harder, you can still find holes that their 500 equations just happen not to cover. But that doesn’t mean zero reasoning has taken place.

In other words, the models are neither exclusively reasoning nor exclusively just reciting.

“It’s somewhere in between,” Cotra said. “I think people are thrown off by that because they want to put it in one camp or another. They want to say it’s just memorizing or they want to say it’s truly deeply reasoning. But the fact is, there’s just a spectrum of the depth of reasoning.”

AI systems have “jagged intelligence”

Researchers have come up with a buzzy term to describe this pattern of reasoning: “jagged intelligence.” It refers to the strange fact that, as computer scientist Andrej Karpathy explained, state-of-the-art AI models “can both perform extremely impressive tasks (e.g., solve complex math problems) while simultaneously struggling with some very dumb problems.”

An illustration of a cloud shape contained within a jagged starburst shape filled with green circuitry

Drew Shannon for Vox

Picture it like this. If human intelligence looks like a cloud with softly rounded edges, artificial intelligence is like a spiky cloud with giant peaks and valleys right next to each other. In humans, a lot of problem-solving capabilities are highly correlated with each other, but AI can be great at one thing and ridiculously bad at another thing that (to us) doesn’t seem far apart.

Mind you, it’s all relative.

“Compared to what humans are good at, the models are quite jagged,” Greenblatt told me. “But I think indexing on humans is a little confusing. From the model’s perspective, it’s like, ‘Wow, those humans are so jagged! They’re so bad at next-token prediction!’ It’s not clear that there’s some objective sense in which AI is more jagged.”

The fact that reasoning models are trained to sound like humans reasoning makes us disposed to compare AI intelligence to human intelligence. But the best way to think of AI is probably not as “smarter than a human” or “dumber than a human” but just as “different.”

Regardless, Cotra anticipates that sooner or later AI intelligence will be so vast that it can contain within it all of human intelligence, and then some.

“I think about, what are the risks that emerge when AI systems are truly better than human experts at everything? When they might still be jagged, but their full jagged intelligence encompasses all of human intelligence and more?” she said. “I’m always looking ahead to that point in time and preparing for that.”

For now, the practical upshot for most of us is this: Remember what AI is and isn’t smart at — and use it accordingly.

The best use case is a situation where it’s hard for you to come up with a solution, but once you get a solution from the AI you can easily check to see if it’s correct. Writing code is a perfect example. Another example would be making a website: You can see what the AI produced and, if you don’t like it, just get the AI to redo it.

In other domains — especially ones where there is no objective right answer or where the stakes are high — you’ll want to be more hesitant about using AI. You might get some initial suggestions from it, but don’t put too much stock in it, especially if what it’s saying seems off to you. An example would be asking for advice on how to handle a moral dilemma. You might see what thoughts the model is provoking in you without trusting it as giving you the final answer.

“The more things are fuzzy and judgment-driven,” Cotra said, “the more you want to use it as a thought partner, not an oracle.”

Source link

Subscribe to Updates

What's Hot

From OpenAI to DeepSeek, companies say AI can “reason” now. Is it true?

What counts as reasoning?

AI systems have “jagged intelligence”

Related Posts