New Research Challenges Apple’s Claims on LLM ‘Reasoning Collapse’: What You Need to Know - 9to5Mac

Table of Contents

Apple’s recent AI research paper, “The Illusion of Thinking,” stirred up quite a discussion. It concluded that even advanced Large Reasoning Models (LRMs) struggle with complex tasks. But not everyone sees it that way.

Alex Lawsen from Open Philanthropy published a counterargument, claiming that Apple’s findings stem more from experimental design issues than from true reasoning failures. His paper credits Anthropic’s Claude Opus model as a co-author.

The Rebuttal: Examining the Details

Lawsen’s critique, dubbed “The Illusion of the Illusion of Thinking,” acknowledges that LRMs face difficulties with tricky problems. However, he argues that Apple mixed up constraints on output with real reasoning failures.

Here are three key points from Lawsen’s argument:

Token Limits Ignored: Apple suggested models “collapse” on complex tasks like the Tower of Hanoi. However, models were bumping against output limits, with some saying, “I’ll stop here to save tokens.”
Unsolvable Puzzles: Apple’s River Crossing test included puzzles that were impossible to solve. Models were penalized for recognizing this and not attempting impossible solutions.
Flawed Evaluations: Apple’s automated evaluations judged models only by complete actions. This unfairly classified partial responses as total failures.

A New Approach: Let the Model Write Code

Lawsen re-tested the Tower of Hanoi problem using a different method, asking models to create a recursive Lua function instead of listing all moves. Surprisingly, models like Claude and OpenAI’s o3 handled complex tasks easily, suggesting that when output constraints are lifted, LRMs can reason better than previously thought.

Importance of the Debate

At first glance, this might seem like an academic debate, but it matters. Apple’s paper has been widely cited as proof that today’s LLMs lack scalable reasoning. Lawsen’s response indicates a more balanced view: while LRMs may struggle under strict conditions, their reasoning capabilities aren’t as weak as suggested.

Yet, there’s still a challenge ahead. Lawsen agrees that true algorithmic generalization is tough. He recommends better testing methods in future studies:

Create evaluations that separate reasoning abilities from output limits.
Ensure puzzles are solvable before testing models.
Use metrics that reflect complexity, not just the length of solutions.
Consider various ways to show understanding versus execution.

The core issue isn’t if LRMs can reason but if our evaluations can truly measure it.

Lawsen’s insights remind us to look closely at how we assess AI capabilities. Before concluding that reasoning isn’t there, it’s worth examining the standards we use to measure it.

For ongoing updates on AI research, you can check out sources like OpenAI Research or MIT Technology Review.

Source link

Post Views: 25