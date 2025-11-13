Figure 1. Accuracy collapse in reasoning models as task complexity increases. Frontier reasoning models such as Claude 3.7 Thinking and DeepSeek R1 perform well at low complexity but fail completely beyond eight disks in Towers of Hanoi.

Adapted from “The Illusion of Thinking,” Apple AI Research (2024).

What if the problem isn’t how models think, but how their work is structured?

At our AI Lab, in collaboration with UT Austin, we explored that question in our new research, Solving a Million-Step LLM Task with Zero Errors. The result is MAKER (Maximal Agentic decomposition, K-threshold Error mitigation, and Red-flagging), a system that achieves reliability through extreme decomposition and local error correction. Rather than relying on a single monolithic agent to reason flawlessly across the entire process, MAKER distributes the task across millions of focused microagents, each responsible for one atomic action.

Using this structure, MAKER became the first system to complete a task requiring over one million LLM steps with zero errors, and the analysis shows it can, in principle, scale much further.

The core idea: smash the task, focus the model, and vote

MAKER is built on the idea that intelligence can scale through structure. Instead of building ever-larger models, it organizes many smaller ones into a precise, error-tolerant system. The framework combines three key mechanisms that together make long-horizon reliability possible:

Maximal Agentic Decomposition (MAD): The task is divided into the smallest possible subproblems, often a single decision per agent. Each agent receives only the minimal context needed for its assigned step. This modularity prevents context drift, isolates errors, and enables efficient error correction..

First-to-ahead-by-k-voting: Several agents attempt the same step in parallel. The system accepts the first action to achieve k more votes than any other, forming a rapid local consensus. Small gains in absolute accuracy per step compound exponentially across thousands or millions of steps, turning local agreement into global reliability.

Red-Flagging: Some model outputs show structural signs of confusion, such as excessive length or incorrect formatting. MAKER automatically discards these responses before they influence voting, then resamples. This not only improves accuracy but also reduces correlated errors, ensuring that localized failures do not propagate through the process.

These three mechanisms work together to deliver formal scaling predictions. The number of required votes increases only logarithmically with the total number of steps, while cost grows roughly linearly. By contrast, if each agent handles multiple steps, cost grows exponentially. The conclusion is clear: smashing the problem into atomic steps allows reliability to scale where increasing model size alone cannot.