February 9, 2026
Building Confidence in Agent Networks: Agents Testing Agents in neuro-san
An overview of neuro-san's testing framework for agent networks, combining language-aware evaluation, statistical consistency checks, and agents testing agents to support dependable deployment.
As agentic systems mature, confidence becomes just as important as capability. Even well-designed agents can produce different outputs across runs, despite receiving the same inputs. This variability is not a flaw; it is a natural consequence of how large language models operate. We need to know that the same agent network behaves as you want it when run repeatedly, continues to do so as it changes, and can be evaluated in a practical, repeatable way rather than through ad hoc manual review. This raises an important question: How do you know an agent network is behaving consistently enough to trust?
In this post, we introduce neuro-san’s testing framework and how it supports consistent, language-aware evaluation of agent networks.
Why testing agent networks requires a different approach
Agent networks are built on large language models, and language models are statistical systems. Given the same prompt and context, an LLM may produce multiple valid responses that differ in phrasing, structure, or reasoning path. This flexibility is often what makes agents useful, but it also means that traditional testing approaches fall short.
Exact string matching is too brittle. Keyword checks miss nuance. Manual evaluation does not scale and is difficult to repeat consistently. These methods work well for deterministic software, but agent networks are not deterministic systems.
If variability is intrinsic, then testing must focus on intent, correctness, and consistency over time rather than exact outputs.
neuro-san addresses this by extending its configuration-driven philosophy to testing. Tests can be created to explicitly check whether claims made by an agent are grounded in its source material, whether that is a document-based workflow, an agent network that uses tools to access a database or web service, or facts that the underlying LLM itself was trained on. This makes it possible to detect the rate of an agent producing hallucinated content without relying on manual review.
Tests can also be interactive. Instead of evaluating a single response, a test can engage in a controlled, multi-turn interaction with an agent network. This allows validation of behavior across reasoning steps and follow-up questions, which more closely reflects real usage.
Agents testing agents and statistical confidence
Even with structured tests, language remains flexible. LLMs can express the same correct idea in many different ways. To handle this, neuro-san introduces what we call the gist operator.
The gist operator allows a test to describe, in natural language, what the output should be like. For example, a test might specify that a response should reflect a generally positive sentiment, correctly summarize a document, or answer a question accurately.
This description is passed to a discriminator agent whose job is to decide whether the agent we are testing satisfied the expectation. In effect, one agent evaluates the output of another – agents testing agents.
This approach allows tests to be tolerant of paraphrasing and stylistic variation while still enforcing the intended meaning of the output. It aligns testing with how humans evaluate language, but in a way that is automated and repeatable.
Because reaching perfection in agent responses is its own optimization problem, neuro-san allows you to define what "good enough for now" means to your development process through the concept of a success_ratio. A test can be run multiple times, with a specified threshold for how many runs must pass to be considered acceptable. This provides a practical way to measure consistency and accuracy statistically, rather than assuming that every run must succeed.
To help prioritize improvements, neuro-san includes the Assessor application. The Assessor executes tests repeatedly, tracks pass rates, records failing outputs, and groups failure patterns using agent-based classification. This makes it easier to see which failure modes occur most often and where effort is best spent. In practice, this might look like a test passing 7 out of 10 runs. Of the three failures, two may involve the agent introducing claims not grounded in the source information, while one reflects an incomplete or overly generic response. This kind of breakdown helps prioritize which issues to address first.
From experimentation to dependable systems
Taken together, these capabilities allow agent networks to be treated as engineered systems rather than fragile prototypes. The neuro-san testing framework supports this by enabling:
Statistical evaluation of consistency and accuracy across repeated runs, replacing anecdotal success with measurable performance.
Language-aware validation of agent outputs, allowing tests to tolerate paraphrasing and stylistic variation while still enforcing intended meaning.
Detection of hallucinated or ungrounded content, especially in document-based workflows
Validation of structured outputs or private information in sly_data.
Safe iteration as systems evolve, making it possible to change prompts, modify workflows, or switch to smaller or more cost-effective language models and verify that behavior remains acceptable.
Reduced reliance on manual evaluation, enabling agent networks to scale without proportional increases in human review effort.
By combining language-aware evaluation with statistical measurement, neuro-san provides a practical path for moving agent networks from experimentation toward dependable deployment. The goal is not perfect predictability, but a clear and measurable understanding of how an agent system behaves and how reliably it can be trusted as it evolves.
This shift, from merely hoping an agent works to being able to evaluate how well it works and how consistently, is essential as agent networks become more capable, more complex, and more deeply integrated into real systems.
Daniel Fink is an AI engineering expert with 15+ years in AI and 30+ years in software — spanning CGI, audio, consumer devices, and AI.