Skip to main content Skip to footer


November 13, 2025

Shattering the Illusion: MAKER Achieves Million-Step, Zero-Error LLM Reasoning

A new approach from the Lab shows how breaking reasoning across millions of AI agents can achieve unprecedented reliability, pointing to a practical path for scaling LLM intelligence to organizational and societal levels


Large Language Models (LLMs) have achieved remarkable breakthroughs in reasoning, insight generation, and tool use. They can plan multi-step actions, generate creative solutions, and assist in complex decision-making. Yet these strengths fade when tasks stretch over long, dependent sequences. Even small per-step error rates compound quickly, turning an impressive short-term performance into complete long-term failure.

That fragility poses a fundamental obstacle for real-world systems. Most large-scale human and organizational processes – from manufacturing and logistics to finance, healthcare, and governance – depend on millions of actions executed precisely and in order. A single mistake can cascade through an entire pipeline. For AI to become a reliable participant in such processes, it must do more than reason well. It must maintain flawless execution over time, sustaining accuracy across millions of interdependent steps.

Apple’s recent study, The Illusion of Thinking, captured this challenge vividly. Researchers tested advanced reasoning models such as Claude 3.7 Thinking and DeepSeek-R1 on structured puzzles like Towers of Hanoi, where each additional disk doubles the number of required moves. The results revealed a sharp reliability cliff: models performed perfectly on simple problems but failed completely once the task crossed about eight disks, even when token budgets were sufficient. In short, more “thinking” led to less consistent reasoning.

Illusion of thinking accuracy collapse

Figure 1. Accuracy collapse in reasoning models as task complexity increases. Frontier reasoning models such as Claude 3.7 Thinking and DeepSeek R1 perform well at low complexity but fail completely beyond eight disks in Towers of Hanoi.

Adapted from “The Illusion of Thinking,” Apple AI Research (2024).

What if the problem isn’t how models think, but how their work is structured?

At our AI Lab, in collaboration with UT Austin, we explored that question in our new research, Solving a Million-Step LLM Task with Zero Errors. The result is MAKER (Maximal Agentic decomposition, K-threshold Error mitigation, and Red-flagging), a system that achieves reliability through extreme decomposition and local error correction. Rather than relying on a single monolithic agent to reason flawlessly across the entire process, MAKER distributes the task across millions of focused microagents, each responsible for one atomic action.

Using this structure, MAKER became the first system to complete a task requiring over one million LLM steps with zero errors, and the analysis shows it can, in principle, scale much further.

The core idea: smash the task, focus the model, and vote

MAKER is built on the idea that intelligence can scale through structure. Instead of building ever-larger models, it organizes many smaller ones into a precise, error-tolerant system. The framework combines three key mechanisms that together make long-horizon reliability possible:

Maximal Agentic Decomposition (MAD): The task is divided into the smallest possible subproblems, often a single decision per agent. Each agent receives only the minimal context needed for its assigned step. This modularity prevents context drift, isolates errors, and enables efficient error correction..

First-to-ahead-by-k-voting: Several agents attempt the same step in parallel. The system accepts the first action to achieve k more votes than any other, forming a rapid local consensus. Small gains in absolute accuracy per step compound exponentially across thousands or millions of steps, turning local agreement into global reliability.

Red-Flagging: Some model outputs show structural signs of confusion, such as excessive length or incorrect formatting. MAKER automatically discards these responses before they influence voting, then resamples. This not only improves accuracy but also reduces correlated errors, ensuring that localized failures do not propagate through the process.

These three mechanisms work together to deliver formal scaling predictions. The number of required votes increases only logarithmically with the total number of steps, while cost grows roughly linearly. By contrast, if each agent handles multiple steps, cost grows exponentially. The conclusion is clear: smashing the problem into atomic steps allows reliability to scale where increasing model size alone cannot.

MAKER comparison chart

Figure 2. Orthogonal directions to scaling AI

MAKER takes an orthogonal path, scaling reliability instead of model size. The chart compares leading LLMs by their error-free step length versus cost, showing that while single models fail after hundreds of steps, MAKER completes over one million with zero errors.

A million-step evaluation: Towers of Hanoi, zero errors

To test the theory, the team applied MAKER to the Towers of Hanoi puzzle with 20 disks, a version requiring exactly 1,048,575 dependent steps. Every move had to be correct.

Each microagent was assigned a single move, using the same global strategy but local context. Before scaling up, smaller experiments measured each model’s per-step accuracy and cost to choose the most cost-effective setup. Surprisingly, smaller, non-reasoning models such as gpt-4.1-mini and gpt-oss-20B provided the best reliability-per-dollar.

Using first-to-ahead-by-3 voting and red-flag filtering for long or misformatted outputs, MAKER successfully executed all 220 - 1 >  steps with zero errors — the first such demonstration of its kind.

millions of agents
Convergence diagram

Figure 3. Exponential convergence to a zero-error solution.

The number of undecided steps falls exponentially across voting rounds, confirming the efficiency of race-to-k voting for local reliability. 

These results provide direct empirical evidence for MAKER’s theoretical advantages:

  • Error-free execution across one million dependent steps

  • Exponential convergence toward perfect consensus

  • Practical scaling through extreme decomposition and localized error mitigation

This marks a clear demonstration of multi-agent advantage (akin to quantum advantage), where a coordinated multi-agent system achieves a result that no single-agent model can.

Why this matters and what’s next

MAKER reframes the problem of AI reliability as a systems challenge, not a model-size race. The findings show that massively decomposed agentic processes (MDAPs) can scale reliability to levels unreachable by single LLMs.

Beyond the immediate achievement, this architecture carries broader implications:

  • Reliability through structure: small, local gains in accuracy compound into global perfection.

  • Transparency and control: each microagent’s output is interpretable and limited in scope.

  • Safety and efficiency: smaller models can perform the majority of work, lowering risk and cost.

MAKER focuses on flawless execution of a known plan. The next step is applying the same principles to creative reasoning – decomposing not just actions, but also idea generation, planning, and verification. This could allow systems that both think and act at organizational scale, with the precision and reliability that real-world applications demand.

By smashing large problems into minimal, cooperating agents and recombining their work through principled error correction, AI can scale far beyond the limits of single models. MAKER demonstrates that dependable, large-scale intelligence can be achieved with systems that are smaller, safer, and more controllable.

This architecture opens the door to real-world deployments where reliability, cost, and data sovereignty are critical. Millions of lightweight, locally hosted agents could operate collaboratively across hospitals, factories, or government systems, maintaining precision without exposing sensitive data to massive centralized models. The future of large-scale intelligence may lie not in building ever-bigger models, but in designing smarter, distributed systems that simply do not fail.



Elliot Meyerson

Principal Research Scientist

Author Image

Elliot is a research scientist who is oriented around improving creative AI through open-endedness, evolutionary computation, neural networks, and multi-agent systems.



Subscribe to our newsletter

Get our latest research and insights on AI innovation


Latest posts

Related topics