Apple researchers identify fundamental limits in advanced AI models, question industry progress


WEB DESK: Apple has published a study highlighting significant challenges facing cutting-edge artificial intelligence systems, raising concerns about the tech industry’s pursuit of increasingly powerful AI technologies. The research indicates that large reasoning models (LRMs)—a sophisticated form of AI designed to handle complex tasks—experience a “complete accuracy collapse” when confronted with highly intricate problems.

According to the paper, standard AI models outperform LRMs on simpler tasks, but both types of models struggle and ultimately fail when tackling more complex puzzles. LRMs attempt to address difficult questions by generating detailed, step-by-step reasoning processes; however, as problems grow more complex, these models tend to reduce their reasoning efforts, leading to performance breakdowns. The researchers found this reduction in effort particularly troubling.

Gary Marcus, a prominent critic and academic in AI, described the findings as “pretty devastating” and questioned the industry’s race toward artificial general intelligence (AGI)—the hypothetical ability of AI systems to perform any intellectual task humans can do. In his Substack newsletter, Marcus argued that the study’s results cast doubt on claims that large language models (LLMs), such as ChatGPT, are a direct pathway to AGI capable of transforming society.

Abu Dhabi’s MBZUAI poised to become Gulf’s Stanford in artificial intelligence

The study also revealed that reasoning models tend to waste computational resources on simple problems by prematurely settling on solutions. When faced with slightly more complex tasks, these models explore incorrect options before eventually arriving at correct solutions. However, under even higher complexity, they often fail entirely, unable to generate valid solutions. Notably, the models exhibited counterintuitive behavior: as problem difficulty increased, they reduced their reasoning efforts despite the rising challenge.

The researchers highlighted that these findings point to a “fundamental scaling limitation” in the current reasoning capabilities of AI models. To test these limitations, they used puzzle challenges such as the Tower of Hanoi and River Crossing puzzles, though they acknowledged that focusing solely on puzzles limits the scope of their conclusions.

The paper examined models including OpenAI’s o3, Google’s Gemini Thinking, Anthropic’s Claude 3.7 Sonnet-Thinking, and DeepSeek-R1. Requests for comments from OpenAI, Google, and DeepSeek went unanswered; OpenAI declined to comment.

Regarding “generalizable reasoning”—the ability of AI to apply narrow conclusions broadly—the study states that these insights challenge existing assumptions about LRM capabilities and suggest that current approaches may be hitting fundamental barriers.

Andrew Rogoyski from the University of Surrey’s Institute for People-Centred AI commented that the findings indicate the industry is still exploring approaches to AGI and may have reached a “cul-de-sac” with existing methods. He noted that the models’ strong performance on simpler tasks but failure on complex ones suggests a potential dead end for current strategies in developing truly general reasoning AI.

You May Also Like