Multi-Agent Evaluation System

August 27, 2025

Best Practices for Building a Multi-Agent Evaluation System for Hackathon Events

In Part 1, we shared how Cognizant set a Guinness World Record and used Neuro AI Multi-Agent Accelerator to evaluate over 30,000 hackathon submissions in under a day. In this post, we take a closer look at the architecture and multi-agent system that made it possible.

Visit the Github repo neuro-san-studio to join a community building the future of multi-agent systems

Introduction: The Challenge of Modern Hackathon Assessment

Picture this: thousands of coding submissions flooding in from a hackathon, each containing not just code, but video pitches, project documentation, and complex repository structures. How do you evaluate them fairly, consistently, and at scale? Traditional manual review becomes a bottleneck, prone to bias and fatigue. Enter Vibe Coding Evaluation System (VibeCodingEval) – a sophisticated distributed architecture that leverages the power of multi-agent AI orchestration (neuro-san) to transform how we assess coding projects.

This isn't just another auto-grader. This multi-agent system represents a paradigm shift in technical evaluation, combining cutting-edge AI with distributed systems engineered to create an evaluation pipeline that's both intelligent and massively scalable. What makes it revolutionary? It doesn't just check the positive or negatives of a submission – it evaluates on different dimensions - innovation, user experience, scalability potential, and business viability through a sophisticated multi-agent AI system that processes multi-modal data - text, video, and code simultaneously.

The Pain Points

Multi-Modal Assessment Gap: Most evaluation systems focus solely on code. This system recognizes that modern submissions include video pitches, documentation, and architectural decisions that all contribute to project quality.
Scale vs. Quality Dilemma: Manual review provides quality but doesn't scale. Automated testing scales but lacks nuance. The evaluation system achieves both through AI-powered multi-agent orchestration.
Consistency Challenge: Different evaluators bring different biases (including human biases) and standards. The AI agents apply consistent rubrics across thousands of submissions.
Explainability Requirements: Black-box scoring undermines trust. VibeCodingEval provides detailed rationale for every score, making the evaluation process transparent.

If you've ever wondered how to build an evaluation system that can handle enterprise-scale events while maintaining nuanced, human-like assessment capabilities, this deep dive into VibeCodingEval's architecture will show you exactly how it's done.

Deep Dive: Architecture and Technical Excellence

System Architecture Overview

The evaluation system employs a sophisticated multi-tier architecture that separates concerns elegantly:

Core Components Deep Dive

1. Input Processing Pipeline (process_inputs.py)

The input processing pipeline is the piece of engineering that handles multi-modal data extraction:

Text Extraction: Supports multiple document formats (PDF, PPTX, TXT, XLSX, CSV etc.) through a unified TextExtractor interface. The system intelligently handles encoding issues and format-specific quirks.

Video Transcription: Leverages ffmpeg and OpenAI's Whisper model for accurate speech-to-text conversion. The transcripts are sanitized to remove non-ASCII characters and normalized for consistent processing.

Repository Analysis: The RepoMetadataExtractor performs sophisticated code analysis:

Builds directory trees with file counts
Identifies programming languages used
Locates key configuration files (Dockerfile, requirements.txt, etc.)
Calculates lines of code metrics
Detects CI/CD configurations
Handles both local and S3-hosted repositories

# Example of the sophisticated metadata extraction
metadata = {
    "total_files": 245,
    "total_loc": 15420,
    "languages": {
        "python": 8500,
        "javascript": 4200,
        "html": 1500,
        "css": 1220
    },
    "test_dirs": 3,
    "ci_cd_present": True,
    "dir_tree_with_file_count": {...}
}

2. Database Architecture (eval_database.py)

VibeCoding uses SQLAlchemy ORM with a well-designed schema that supports both SQLite (dev and non-concurrent) and PostgreSQL (prod and concurrent) databases:

Submission Model: Stores the extracted content from all three input types and captures processing time

project_details: Extracted text from documents
video_transcript: Transcribed audio from video
repo_metadata: Analyzed repository structure
processing_time_in_seconds: Performance tracking

Evaluation Model: Captures multi-dimensional scoring

Seven score dimensions (innovation, UX, scalability, etc.)
Performance metrics: Token accounting, time and cost for all agent-powered evaluations
Detailed rationale for each score
Support for multiple evaluation types per submission

The database layer includes sophisticated query methods for finding incomplete evaluations, batch processing, and performance analytics.

3. Distributed Task Processing

The system leverages Celery with Elasticache for massive parallelization:

Task Queues:

inputs_queue: Handles submission processing
eval_queue: Manages AI evaluation tasks

Retry Mechanisms:

@celery_app.task(
    bind=True,
    autoretry_for=(Exception,),
    max_retries=3,
    retry_backoff=True
)

Concurrency Control: Semaphores limit concurrent AI requests to prevent overwhelming the evaluation service while maximizing throughput.

4. Multi-Agent AI Evaluation System

The crown jewel is the multi-agent evaluation system powered by the Neuro AI Multi-Agent Accelerator (neuro-san):

Orchestration Pattern: A parent agent (create_eval) coordinates respective specialized sub-agents, each focusing on a specific evaluation dimension.

Specialized Agents:

Innovation Agent: Evaluates novelty and uniqueness
UX Agent: Assesses user experience and interface design
Scalability Agent: Analyzes architectural decisions for growth
Market Potential Agent: Estimates business opportunities
Implementation Agent: Evaluates practical deployment readiness
Financial Agent: Assesses cost-benefit ratios
Complexity Agent: Measures technical sophistication

Each agent operates autonomously but contributes to a unified evaluation result, providing both numerical scores and textual rationale. The evaluation system is extensible in that more agents can be added to evaluate on additional dimensions without breaking the entire pipeline.

Technical Highlights

Neuro-san powered multi-agent setup

Utilizing sly_data from neuro-san for deterministic outcomes

Sly_data is a JSON schema dictionary describing what specific information the agent needs as input arguments over the private sly_data dictionary channel when it is called. The sly_data itself is generally considered to be private information that does not belong in the chat stream, for example: credential information. With sly-data enabled, we could use it along with the agents to perform all mathematical operations such as adding and storing scores and calculating averages.

This approach drastically reduces fallout rate considering the agents are LLM-powered and it's easy to mess up the text response.

"allow": {
    "to_upstream": {
        "sly_data": {
            "evaluation": true,
        }
    }
},
# This is used further down the chain in python coded-tool to manage the evaluation

Scoring mechanism

Agents utilize json schema all along - to internally talk to each other and to even come up with the scores and reasoning behind their scoring. Example:

Draft the following json dict:
{{
  "score": <[a list of above 10 scores, all between 1 and 100]>,
  "brief_description": "<Provide a brief description of your scoring rationale.>"
}}

Async/Await Architecture

async def process_submission(self, sub: Tuple[str, str, str, str]):
    async with self.semaphore:
        results = await self.evaluate_granular_scores(
            sub_id=sub_id,
            user_input=user_input,
            clients=clients
        )

Smart Resource Management

Connection pooling for database operations
Automatic cleanup of temporary files
Memory-efficient streaming for large files

Graceful handling of timeouts and failures

Smart Resource Management

The system seamlessly handles both local and cloud storage:

if repo_path.startswith("s3://"):
    with smart_open(repo_path, "rb") as s3_file:
        # Process directly from S3

Batch Processing Strategies

For massive scale (10,000+ submissions):

# Implement batching with range processing
python process_inputs.py --range 1-1000
python process_inputs.py --range 1001-2000

Custom Evaluation Agents

As mentioned earlier, creating specialized agents for domain-specific evaluation is a breeze with neuro-san and the database schema we created for example Automated plagiarism detection, responsible-AI agent to report on metrics such as power usage and biases.

# An example custom class that can be used by process_eval class
class SecurityAuditAgent(SimpleClient):
    def __init__(self):
        super().__init__(agent_name="security_auditor")
    
    def evaluate_security(self, repo_metadata):
        # Custom security analysis logic
        vulnerabilities = self.scan_dependencies(repo_metadata)
        security_score = self.calculate_score(vulnerabilities)
        return {
            "security_score": security_score,
            "vulnerabilities_found": vulnerabilities
        }

Walkthrough: Processing a Real Hackathon Submission

Let's trace a submission through its lifecycle:

Stage 1: Input Processing

When a user submits an evaluation, with id say sub_001, it arrives with a text/ppt/pdf proposal, MP4 pitch video, and a GitHub repository:

Document Processing: The text is parsed, extracting the project description (average ~1000 words)
Video Transcription: Whisper transcribes the ~3-minute pitch, producing ~400 words
Repository Analysis:
- 125 Python files detected
- 8,500 lines of code counted
- Key config files found
- Test coverage identified

Stage 2: Database Storage

The processed data is stored:

{
    "sub_id": "sub_001",
    "project_details": "An AI-powered code review system...",
    "video_transcript": "Hello, today I'm presenting CodeMentor...",
    "repo_metadata": "languages: python: 8500 LOC, javascript: 2000 LOC...",
    "processing_time": 45.3
}

Stage 3: Multi-Agent Evaluation

The orchestrator agent receives the submission and delegates to specialists:

Innovativeness Agent analyzes:

Searches for novel patterns in the approach
Compares against known solutions
Internally assess on 10 different sub-dimensions (e.g. novelty, uniqueness, originality, mention of patented works, papers, novel algorithm etc.) to arrive at 10 scores out of 100 each [78, 85, 87, 76, 91, 45, 43, 66, 34, 59]
Returns: Score 66.4/100 with rationale on why the submission is scored that way in the innovativeness dimension

Scalability Agent similarly examines on 10 different sub-dimensions:

Microservices architecture detected
Kubernetes configurations present
Configurability, reusability, cloud deployment flexibility etc.
Internally assess on 10 different sub-dimensions (e.g. modularity, horizontal, vertical scalability, configurations, standards, cloud practices etc.) to arrive at 10 scores out of 100 each for example [84, 75, 71, 86, 81, 65, 63, 61, 64, 89]
Returns: Score 73.9/10 with detailed analysis

This goes on for all the 7 different agents.

Each agent thus processes independently, then the results are aggregated:

{
    "innovation_score": 66.4,
    "ux_score": 74.5,
    "scalability_score": 73.9,
    "market_potential_score": 81.5,
    "ease_of_implementation_score": 71.0,
    "financial_feasibility_score": 68.0,
    "complexity_score": 58.5,
    "brief_description": "Innovative AI-powered code review system with strong scalability..."
}

These individual scores across dimensions could further be used to arrive at stack ranking of all the submissions.

Stage 4: Results Dashboard

A Streamlit dashboard that visualizes:

Score distributions across all submissions
Radar charts comparing multiple projects
Processing performance metrics
Detailed submission explorer

# Performance metrics

| Metric                   |       Value       |
|--------------------------|-------------------|
| Submissions Count        | 30,601            |
| Total Cost               | $6,340            |
| Total Tokens             | 1,571,228,651     |
| Total Calls to LLM       | 466,041           |
| Avg Evaluation Time (s)  | 180               |

Comparative analysis in terms of human effort:

Total evaluations processed

Number of evaluations: 30k+ (let’s assume 30,000 for calculation).
Assuming human time per evaluation: 30 minutes = 0.5 hours.
Total hours = 30,000 × 0.5 = 15,000 hours.

Full-Time Equivalent (FTE) effort (Assuming 7 hours of daily effort and 220 workdays in a year)

Required FTEs = 15,000 / 1,540 = ~9 FTE

Challenges and Limitations

Known Trade-offs

Video Processing Constraints: Large video files (>500MB) may timeout during transcription
Repository Size Limits: Repositories over 100MB or 20,000 files are truncated
Language Support: Currently optimized for English-language submissions

Conclusion: The Future of Code Evaluation

The multi-agent evaluation system represents more than just a tool – it's a glimpse into the future of technical assessment. By combining multi-modal analysis, distributed processing, and intelligent AI orchestration, it demonstrates that we can evaluate code not just for correctness, but for creativity, potential, and real-world viability.

The system's architecture offers valuable lessons for anyone building evaluation systems:

Separate concerns between processing, storage, and evaluation
Embrace asynchronous patterns for scalability
Design for explainability from the start
Build in flexibility for different evaluation criteria
Best Practices and strategies that make the system work:
- Keep one agent per rubric dimension for explainability.
- Provide each agent with clear prompts and role definitions.
- Explainability is a must-have feature.
- Aggregate outputs via an orchestrator (ProcessEvaluation) that enforces consistency.
- Persist results in a database (SQLite for dev, Postgres for production) for better observability.
- Capture not just scores but also token usage + cost and processing time per submission.
- Composable by design: Massive parallelization using Celery workers with Redis as a message broker.

As coding competitions evolve and technical hiring becomes more sophisticated, systems like VibeCodingEval will become essential infrastructure. The ability to process thousands of complex submissions while maintaining nuanced, fair evaluation isn't just nice to have – it's becoming a competitive necessity.

Next Steps for Readers

Try It Out: GitHub - cognizant-ai-lab/neuro-san-studio: A playground for neuro-san Clone the repository and build your own agents
Customize: Adapt the agents for your custom use case
Contribute: The project welcomes contributions
Star the Repository: Show your support and stay updated with new features

The future of code evaluation is here, and it's intelligent, scalable, and surprisingly sophisticated. VibeCoding proves that we can build systems that evaluate not just code, but the complete vision behind it.

Glossary

Multi-Agent System: An AI architecture where multiple specialized agents collaborate to solve complex problems

Celery: Python distributed task queue for handling background job processing

SQLAlchemy: Python SQL toolkit and Object-Relational Mapping (ORM) library

Whisper: OpenAI's automatic speech recognition system

Redis: In-memory data structure store used as message broker

Flower: Web-based tool for monitoring and administrating Celery clusters

Streamlit: Python framework for creating data applications and dashboards

Token Accounting: Tracking API usage and costs in AI systems

Semaphore: Concurrency primitive that limits the number of simultaneous operations

ORM (Object-Relational Mapping): Technique for converting data between incompatible type systems

Frequently Asked Questions (FAQ)

General Questions

Q: What makes VibeCodingEval different from traditional code evaluation platforms like HackerRank or LeetCode?

A: Unlike traditional platforms that focus on algorithmic correctness and test cases, this system evaluates complete project submissions across multiple dimensions including innovation, UX, scalability, and business viability. It processes not just code metadata, but also video presentations and documentation, providing a holistic assessment of a project's potential with explainability.

Q: How long does it take to evaluate a single submission?

A: On average, a complete evaluation takes approximately 180 seconds (3 minutes). This includes processing all three input types (text, video, code) and running all seven evaluation agents. The system can be massively parallelized, significantly reducing overall evaluation time for large batches. In the particular use case discussed above the average evaluation time turned out to be just 1.8 seconds.

Q: What's the cost of running evaluations at scale?

A: Based on our performance metrics, evaluating 30,000+ submissions costs approximately $6,340 in AI token usage. This translates to roughly $0.21 per submission – significantly more cost-effective than human evaluation which would require ~9 full-time employees for a year for the same volume.

Q: Can the system handle non-English submissions?

A: Currently, the system is optimized for English-language submissions. While the analysis works regardless of language, the text and video transcription components perform best with English content. Multi-language support is good as the underlying LLM is.

Technical Questions

Q: What are the maximum file size limits for submissions?

A: The system has the following limits:

Video files: Maximum 500MB (larger files may timeout during transcription)
Repository ZIP files: Maximum 100MB or 20,000 files
Document files: No strict limit, but very large PDFs (>50MB) may process slowly
Total submission package: Recommended under 1GB for optimal performance

Q: Can I use SQLite in production instead of PostgreSQL?

A: While technically possible, we strongly recommend PostgreSQL for production deployments. SQLite lacks the concurrent write capabilities needed for high-volume processing and doesn't scale well beyond a few thousand submissions. PostgreSQL provides better performance, reliability, and support for concurrent operations.

Q: How does the system handle repository dependencies and security scanning?

A: The current version extracts metadata and structure from repositories but doesn't execute code or install dependencies for security reasons. Security scanning can be added through custom agents (see the SecurityAuditAgent example in the blog), but it's not included in the default configuration.

Q: What happens if the Neuro SAN service is temporarily unavailable?

A: Neuro SAN service is run locally along with the evaluation system which reduces chances of failure. However, the system implements retry logic with exponential backoff. Failed evaluations are automatically retried up to 3 times with increasing delays. If the service remains unavailable, submissions are queued and can be reprocessed once the service is restored. The Celery task queue ensures no submissions are lost.

Operational Questions

Q: How do I monitor the evaluation pipeline in real-time? A: The system provides multiple monitoring options:

Flower Dashboard (http://localhost:5555): Real-time Celery task monitoring
Streamlit Dashboard (http://localhost:8501): Analytics and submission explorer
Database queries: Direct SQL queries for custom metrics
Log files: Detailed logs in the logs/processing/ directory

Q: Can I re-evaluate submissions with updated rubrics or agents? A: Yes! Use the --override flag when running evaluations:

python deploy/enqueue_eval_tasks.py --override --db-url postgresql://...

This will re-process all submissions with the current agent configurations. You can also target specific submissions using the --filter-source parameter.

Q: How do I scale the system for a 100,000+ submission event?

A: For massive scale:

Deploy multiple Celery workers across different machines
Use Redis or Amazon ElastiCache for the message broker
Implement database read replicas for the analytics dashboard
Consider using S3 for submission storage instead of local files
Increase the semaphore concurrency limit for AI requests
Deploy evaluation agents on GPU-enabled instances for faster processing

Q: What's the best way to handle partial failures in batch processing?

A: The system is designed to be resilient:

Failed tasks are automatically retried
Successfully processed submissions are marked in the database
Use --range parameter to process specific batches
The system automatically skips already-processed submissions unless --override is specified
Monitor incomplete evaluations using: python eval_database.py --inc

Customization Questions

Q: How do I add a new evaluation dimension (e.g., security or performance)?

A: Adding new evaluation dimensions involves:

Create a new agent using the Neuro-San data-driven agent setup
Add the new score field to the database schema
Update the SCORE_FIELDS list in process_eval.py
Modify the orchestration logic to include the new agent
Update the dashboard to visualize the new dimension

Q: Can I use different multi-agent platform instead of the Neuro SAN?

A: Yes, the system is designed to be modular. You can replace the SimpleClient class with your own implementation that interfaces with OpenAI, Anthropic, CrewAI or any other multi-agent service. The key is maintaining the same response format (scores + rationale).

Q: How do I customize the scoring rubric for my specific use case?

A: Rubrics are defined in the agent prompts within the agent definitions. To customize:

Access your agent configurations
Modify the prompt to include your specific criteria
Adjust the scoring scale if needed (default is 1-100)
Update the aggregation logic if you want weighted scores

Q: Can I integrate this with my existing CI/CD pipeline?

A: Absolutely! The system provides CLI interfaces that can be integrated into any pipeline:

# Example GitHub Actions integration

- name: Process Submissions

run: python process_inputs.py --input-source ${{ github.event.inputs.csv_path }}

- name: Run Evaluations

run: python deploy/enqueue_eval_tasks.py --filter-source new_submissions.csv

Performance and Optimization Questions

Q: How many concurrent evaluations can the system handle?

A: This depends on your infrastructure:

Default configuration: 8 concurrent AI requests (via semaphore)
Celery workers: Limited by CPU cores and memory
Database: PostgreSQL can handle hundreds of concurrent connections
Practical limit: 1000s of concurrent evaluations with proper resource allocation

Q: What's the most effective way to reduce evaluation costs?

A: Several strategies can reduce costs:

Implement caching for handling observability data
Truncate video transcripts to essential portions
Use smaller AI models for initial screening
Implement smart sampling for very large repositories

Q: How do I debug a stuck evaluation?

A: Follow these steps:

Check Flower dashboard for task status
Review logs in logs/processing/process_eval_*.log
Query the database for incomplete evaluations: python eval_database.py --inc
Check Redis queue length: redis-cli LLEN eval_queue
Examine agent thinking files in logs/ directory
Restart stuck workers if necessary

Data and Privacy Questions

Q: How is sensitive data handled in submissions? A: The system implements several privacy measures:

Submissions are processed in isolated environments
Temporary files are automatically cleaned up
Database connections use SSL in production
S3 storage uses encryption at rest
No code is executed, only metadata is analyzed statically

Q: Can I export evaluation results for further analysis?

A: Yes, multiple export options are available:

# Export to pandas DataFrame

df = db.get_all_evaluations_as_df()

df.to_csv('evaluations_export.csv')

# Direct SQL export

python eval_database.py --query "SELECT * FROM evaluations" > results.json

Q: How long is evaluation data retained?

A: By default, all evaluation data is retained indefinitely. You can implement data retention policies by:

Adding timestamp-based cleanup jobs
Archiving old evaluations to cold storage
Implementing GDPR-compliant deletion endpoints

Deepak Singh

Senior Data Scientist

Deepak is a data scientist that specializes in machine learning, applied statistics, and data-driven applications such as multi-agent systems.

Microsites

Best Practices for Building a Multi-Agent Evaluation System for Hackathon Events

Introduction: The Challenge of Modern Hackathon Assessment

The Pain Points

Deep Dive: Architecture and Technical Excellence

System Architecture Overview

Core Components Deep Dive

1. Input Processing Pipeline (process_inputs.py)

2. Database Architecture (eval_database.py)

3. Distributed Task Processing

4. Multi-Agent AI Evaluation System

Technical Highlights

Neuro-san powered multi-agent setup

Utilizing sly_data from neuro-san for deterministic outcomes

Scoring mechanism

Async/Await Architecture

Smart Resource Management

Smart Resource Management

Batch Processing Strategies

Custom Evaluation Agents

Walkthrough: Processing a Real Hackathon Submission

Stage 1: Input Processing

Stage 2: Database Storage

Stage 3: Multi-Agent Evaluation

Stage 4: Results Dashboard

Challenges and Limitations

Known Trade-offs

Conclusion: The Future of Code Evaluation

Next Steps for Readers

Glossary

Frequently Asked Questions (FAQ)

General Questions

Technical Questions

Operational Questions

Customization Questions

Performance and Optimization Questions

Data and Privacy Questions

Deepak Singh

Latest posts

Related topics