Skip to main content Skip to footer


August 27, 2025

Best Practices for Building a Multi-Agent Evaluation System for Hackathon Events

In Part 1, we shared how Cognizant set a Guinness World Record and used Neuro AI Multi-Agent Accelerator to evaluate over 30,000 hackathon submissions in under a day. In this post, we take a closer look at the architecture and multi-agent system that made it possible. 


Introduction: The Challenge of Modern Hackathon Assessment

Picture this: thousands of coding submissions flooding in from a hackathon, each containing not just code, but video pitches, project documentation, and complex repository structures. How do you evaluate them fairly, consistently, and at scale? Traditional manual review becomes a bottleneck, prone to bias and fatigue. Enter Vibe Coding Evaluation System (VibeCodingEval) – a sophisticated distributed architecture that leverages the power of multi-agent AI orchestration (neuro-san) to transform how we assess coding projects.

This isn't just another auto-grader. This multi-agent system represents a paradigm shift in technical evaluation, combining cutting-edge AI with distributed systems engineered to create an evaluation pipeline that's both intelligent and massively scalable. What makes it revolutionary? It doesn't just check the positive or negatives of a submission – it evaluates on different dimensions - innovation, user experience, scalability potential, and business viability through a sophisticated multi-agent AI system that processes multi-modal data - text, video, and code simultaneously.

The Pain Points

  1. Multi-Modal Assessment Gap: Most evaluation systems focus solely on code. This system recognizes that modern submissions include video pitches, documentation, and architectural decisions that all contribute to project quality.

  2. Scale vs. Quality Dilemma: Manual review provides quality but doesn't scale. Automated testing scales but lacks nuance. The evaluation system achieves both through AI-powered multi-agent orchestration.

  3. Consistency Challenge: Different evaluators bring different biases (including human biases) and standards. The AI agents apply consistent rubrics across thousands of submissions.

  4. Explainability Requirements: Black-box scoring undermines trust. VibeCodingEval provides detailed rationale for every score, making the evaluation process transparent.

If you've ever wondered how to build an evaluation system that can handle enterprise-scale events while maintaining nuanced, human-like assessment capabilities, this deep dive into VibeCodingEval's architecture will show you exactly how it's done.

Deep Dive: Architecture and Technical Excellence

System Architecture Overview

The evaluation system employs a sophisticated multi-tier architecture that separates concerns elegantly:

Core Components Deep Dive

1. Input Processing Pipeline (process_inputs.py)

The input processing pipeline is the piece of engineering that handles multi-modal data extraction:

Text Extraction: Supports multiple document formats (PDF, PPTX, TXT, XLSX, CSV etc.) through a unified TextExtractor interface. The system intelligently handles encoding issues and format-specific quirks.

Video Transcription: Leverages ffmpeg and OpenAI's Whisper model for accurate speech-to-text conversion. The transcripts are sanitized to remove non-ASCII characters and normalized for consistent processing.

Repository Analysis: The RepoMetadataExtractor performs sophisticated code analysis:

  • Builds directory trees with file counts

  • Identifies programming languages used

  • Locates key configuration files (Dockerfile, requirements.txt, etc.)

  • Calculates lines of code metrics

  • Detects CI/CD configurations

  • Handles both local and S3-hosted repositories

 

# Example of the sophisticated metadata extraction
metadata = {
    "total_files": 245,
    "total_loc": 15420,
    "languages": {
        "python": 8500,
        "javascript": 4200,
        "html": 1500,
        "css": 1220
    },
    "test_dirs": 3,
    "ci_cd_present": True,
    "dir_tree_with_file_count": {...}
}
2. Database Architecture (eval_database.py) 

VibeCoding uses SQLAlchemy ORM with a well-designed schema that supports both SQLite (dev and non-concurrent) and PostgreSQL (prod and concurrent) databases:

Submission Model: Stores the extracted content from all three input types and captures processing time

  • project_details: Extracted text from documents

  • video_transcript: Transcribed audio from video

  • repo_metadata: Analyzed repository structure

  • processing_time_in_seconds: Performance tracking

Evaluation Model: Captures multi-dimensional scoring

  • Seven score dimensions (innovation, UX, scalability, etc.)

  • Performance metrics: Token accounting, time and cost for all agent-powered evaluations

  • Detailed rationale for each score

  • Support for multiple evaluation types per submission

The database layer includes sophisticated query methods for finding incomplete evaluations, batch processing, and performance analytics.

3. Distributed Task Processing

The system leverages Celery with Elasticache for massive parallelization:

Task Queues:

  • inputs_queue: Handles submission processing

  • eval_queue: Manages AI evaluation tasks

Retry Mechanisms:

@celery_app.task(
    bind=True,
    autoretry_for=(Exception,),
    max_retries=3,
    retry_backoff=True
)

Concurrency Control: Semaphores limit concurrent AI requests to prevent overwhelming the evaluation service while maximizing throughput. 

4. Multi-Agent AI Evaluation System

The crown jewel is the multi-agent evaluation system powered by the Neuro AI Multi-Agent Accelerator (neuro-san):

Orchestration Pattern: A parent agent (create_eval) coordinates respective specialized sub-agents, each focusing on a specific evaluation dimension.

Specialized Agents:

  • Innovation Agent: Evaluates novelty and uniqueness

  • UX Agent: Assesses user experience and interface design

  • Scalability Agent: Analyzes architectural decisions for growth

  • Market Potential Agent: Estimates business opportunities

  • Implementation Agent: Evaluates practical deployment readiness

  • Financial Agent: Assesses cost-benefit ratios

  • Complexity Agent: Measures technical sophistication

Each agent operates autonomously but contributes to a unified evaluation result, providing both numerical scores and textual rationale. The evaluation system is extensible in that more agents can be added to evaluate on additional dimensions without breaking the entire pipeline.

Technical Highlights

Neuro-san powered multi-agent setup

 

Utilizing sly_data from neuro-san for deterministic outcomes

Sly_data is a JSON schema dictionary describing what specific information the agent needs as input arguments over the private sly_data dictionary channel when it is called. The sly_data itself is generally considered to be private information that does not belong in the chat stream, for example: credential information. With sly-data enabled, we could use it along with the agents to perform all mathematical operations such as adding and storing scores and calculating averages.

This approach drastically reduces fallout rate considering the agents are LLM-powered and it's easy to mess up the text response.

"allow": {
    "to_upstream": {
        "sly_data": {
            "evaluation": true,
        }
    }
},
# This is used further down the chain in python coded-tool to manage the evaluation
Scoring mechanism

Agents utilize json schema all along - to internally talk to each other and to even come up with the scores and reasoning behind their scoring. Example:

Draft the following json dict:
{{
  "score": <[a list of above 10 scores, all between 1 and 100]>,
  "brief_description": "<Provide a brief description of your scoring rationale.>"
}}
Async/Await Architecture 
async def process_submission(self, sub: Tuple[str, str, str, str]):
    async with self.semaphore:
        results = await self.evaluate_granular_scores(
            sub_id=sub_id,
            user_input=user_input,
            clients=clients
        )
Smart Resource Management 
  • Connection pooling for database operations

  • Automatic cleanup of temporary files

  • Memory-efficient streaming for large files

  • Graceful handling of timeouts and failures
Smart Resource Management 

The system seamlessly handles both local and cloud storage:

if repo_path.startswith("s3://"):
    with smart_open(repo_path, "rb") as s3_file:
        # Process directly from S3
Batch Processing Strategies 

For massive scale (10,000+ submissions): 

# Implement batching with range processing
python process_inputs.py --range 1-1000
python process_inputs.py --range 1001-2000
Custom Evaluation Agents 

As mentioned earlier, creating specialized agents for domain-specific evaluation is a breeze with neuro-san and the database schema we created for example Automated plagiarism detection, responsible-AI agent to report on metrics such as power usage and biases. 

# An example custom class that can be used by process_eval class
class SecurityAuditAgent(SimpleClient):
    def __init__(self):
        super().__init__(agent_name="security_auditor")
    
    def evaluate_security(self, repo_metadata):
        # Custom security analysis logic
        vulnerabilities = self.scan_dependencies(repo_metadata)
        security_score = self.calculate_score(vulnerabilities)
        return {
            "security_score": security_score,
            "vulnerabilities_found": vulnerabilities
        }

Walkthrough: Processing a Real Hackathon Submission

Let's trace a submission through its lifecycle:

Stage 1: Input Processing

When a user submits an evaluation, with id say sub_001, it arrives with a text/ppt/pdf proposal, MP4 pitch video, and a GitHub repository:

  1. Document Processing: The text is parsed, extracting the project description (average ~1000 words)

  2. Video Transcription: Whisper transcribes the ~3-minute pitch, producing ~400 words

  3. Repository Analysis:

    • 125 Python files detected

    • 8,500 lines of code counted

    • Key config files found

    • Test coverage identified

Stage 2: Database Storage

The processed data is stored:

{
    "sub_id": "sub_001",
    "project_details": "An AI-powered code review system...",
    "video_transcript": "Hello, today I'm presenting CodeMentor...",
    "repo_metadata": "languages: python: 8500 LOC, javascript: 2000 LOC...",
    "processing_time": 45.3
}
Stage 3: Multi-Agent Evaluation

The orchestrator agent receives the submission and delegates to specialists:

Innovativeness Agent analyzes:

  • Searches for novel patterns in the approach

  • Compares against known solutions

  • Internally assess on 10 different sub-dimensions (e.g. novelty, uniqueness, originality, mention of patented works, papers, novel algorithm etc.) to arrive at 10 scores out of 100 each [78, 85, 87, 76, 91, 45, 43, 66, 34, 59]

  • Returns: Score 66.4/100 with rationale on why the submission is scored that way in the innovativeness dimension

Scalability Agent similarly examines on 10 different sub-dimensions:

  • Microservices architecture detected

  • Kubernetes configurations present

  • Configurability, reusability, cloud deployment flexibility etc.

  • Internally assess on 10 different sub-dimensions (e.g. modularity, horizontal, vertical scalability, configurations, standards, cloud practices etc.) to arrive at 10 scores out of 100 each for example [84, 75, 71, 86, 81, 65, 63, 61, 64, 89]

  • Returns: Score 73.9/10 with detailed analysis

This goes on for all the 7 different agents.

Each agent thus processes independently, then the results are aggregated:

{
    "innovation_score": 66.4,
    "ux_score": 74.5,
    "scalability_score": 73.9,
    "market_potential_score": 81.5,
    "ease_of_implementation_score": 71.0,
    "financial_feasibility_score": 68.0,
    "complexity_score": 58.5,
    "brief_description": "Innovative AI-powered code review system with strong scalability..."
}

These individual scores across dimensions could further be used to arrive at stack ranking of all the submissions.

Stage 4: Results Dashboard

A Streamlit dashboard that visualizes:

  • Score distributions across all submissions

  • Radar charts comparing multiple projects

  • Processing performance metrics

  • Detailed submission explorer

# Performance metrics

| Metric                   |       Value       |
|--------------------------|-------------------|
| Submissions Count        | 30,601            |
| Total Cost               | $6,340            |
| Total Tokens             | 1,571,228,651     |
| Total Calls to LLM       | 466,041           |
| Avg Evaluation Time (s)  | 180               |

Comparative analysis in terms of human effort:

  1. Total evaluations processed

  • Number of evaluations: 30k+ (let’s assume 30,000 for calculation).

  • Assuming human time per evaluation: 30 minutes = 0.5 hours.

  • Total hours = 30,000 × 0.5 = 15,000 hours.

  1. Full-Time Equivalent (FTE) effort (Assuming 7 hours of daily effort and 220 workdays in a year)

  • Required FTEs = 15,000 / 1,540 = ~9 FTE

Challenges and Limitations

Known Trade-offs
  1. Video Processing Constraints: Large video files (>500MB) may timeout during transcription

  2. Repository Size Limits: Repositories over 100MB or 20,000 files are truncated

  3. Language Support: Currently optimized for English-language submissions

Conclusion: The Future of Code Evaluation

The multi-agent evaluation system represents more than just a tool – it's a glimpse into the future of technical assessment. By combining multi-modal analysis, distributed processing, and intelligent AI orchestration, it demonstrates that we can evaluate code not just for correctness, but for creativity, potential, and real-world viability.

The system's architecture offers valuable lessons for anyone building evaluation systems:

  • Separate concerns between processing, storage, and evaluation

  • Embrace asynchronous patterns for scalability

  • Design for explainability from the start

  • Build in flexibility for different evaluation criteria

  • Best Practices and strategies that make the system work:

    • Keep one agent per rubric dimension for explainability.

    • Provide each agent with clear prompts and role definitions.

    • Explainability is a must-have feature.

    • Aggregate outputs via an orchestrator (ProcessEvaluation) that enforces consistency.

    • Persist results in a database (SQLite for dev, Postgres for production) for better observability.

    • Capture not just scores but also token usage + cost and processing time per submission.

    • Composable by design: Massive parallelization using Celery workers with Redis as a message broker.

As coding competitions evolve and technical hiring becomes more sophisticated, systems like VibeCodingEval will become essential infrastructure. The ability to process thousands of complex submissions while maintaining nuanced, fair evaluation isn't just nice to have – it's becoming a competitive necessity.

Next Steps for Readers

  1. Try It Out: GitHub - cognizant-ai-lab/neuro-san-studio: A playground for neuro-san Clone the repository and build your own agents

  2. Customize: Adapt the agents for your custom use case

  3. Contribute: The project welcomes contributions

  4. Star the Repository: Show your support and stay updated with new features

The future of code evaluation is here, and it's intelligent, scalable, and surprisingly sophisticated. VibeCoding proves that we can build systems that evaluate not just code, but the complete vision behind it.


Glossary

Multi-Agent System: An AI architecture where multiple specialized agents collaborate to solve complex problems

Celery: Python distributed task queue for handling background job processing

SQLAlchemy: Python SQL toolkit and Object-Relational Mapping (ORM) library

Whisper: OpenAI's automatic speech recognition system

Redis: In-memory data structure store used as message broker

Flower: Web-based tool for monitoring and administrating Celery clusters

Streamlit: Python framework for creating data applications and dashboards

Token Accounting: Tracking API usage and costs in AI systems

Semaphore: Concurrency primitive that limits the number of simultaneous operations

ORM (Object-Relational Mapping): Technique for converting data between incompatible type systems


Frequently Asked Questions (FAQ)

General Questions

Q: What makes VibeCodingEval different from traditional code evaluation platforms like HackerRank or LeetCode?

A: Unlike traditional platforms that focus on algorithmic correctness and test cases, this system evaluates complete project submissions across multiple dimensions including innovation, UX, scalability, and business viability. It processes not just code metadata, but also video presentations and documentation, providing a holistic assessment of a project's potential with explainability.

Q: How long does it take to evaluate a single submission?

A: On average, a complete evaluation takes approximately 180 seconds (3 minutes). This includes processing all three input types (text, video, code) and running all seven evaluation agents. The system can be massively parallelized, significantly reducing overall evaluation time for large batches. In the particular use case discussed above the average evaluation time turned out to be just 1.8 seconds.

Q: What's the cost of running evaluations at scale?

A: Based on our performance metrics, evaluating 30,000+ submissions costs approximately $6,340 in AI token usage. This translates to roughly $0.21 per submission – significantly more cost-effective than human evaluation which would require ~9 full-time employees for a year for the same volume.

Q: Can the system handle non-English submissions?

A: Currently, the system is optimized for English-language submissions. While the analysis works regardless of language, the text and video transcription components perform best with English content. Multi-language support is good as the underlying LLM is.

Technical Questions

Q: What are the maximum file size limits for submissions?

A: The system has the following limits:

  • Video files: Maximum 500MB (larger files may timeout during transcription)

  • Repository ZIP files: Maximum 100MB or 20,000 files

  • Document files: No strict limit, but very large PDFs (>50MB) may process slowly

  • Total submission package: Recommended under 1GB for optimal performance

Q: Can I use SQLite in production instead of PostgreSQL?

A: While technically possible, we strongly recommend PostgreSQL for production deployments. SQLite lacks the concurrent write capabilities needed for high-volume processing and doesn't scale well beyond a few thousand submissions. PostgreSQL provides better performance, reliability, and support for concurrent operations.

Q: How does the system handle repository dependencies and security scanning?

A: The current version extracts metadata and structure from repositories but doesn't execute code or install dependencies for security reasons. Security scanning can be added through custom agents (see the SecurityAuditAgent example in the blog), but it's not included in the default configuration.

Q: What happens if the Neuro SAN service is temporarily unavailable?

A: Neuro SAN service is run locally along with the evaluation system which reduces chances of failure. However, the system implements retry logic with exponential backoff. Failed evaluations are automatically retried up to 3 times with increasing delays. If the service remains unavailable, submissions are queued and can be reprocessed once the service is restored. The Celery task queue ensures no submissions are lost.

Operational Questions

Q: How do I monitor the evaluation pipeline in real-time? A: The system provides multiple monitoring options:

  • Flower Dashboard (http://localhost:5555): Real-time Celery task monitoring

  • Streamlit Dashboard (http://localhost:8501): Analytics and submission explorer

  • Database queries: Direct SQL queries for custom metrics

  • Log files: Detailed logs in the logs/processing/ directory

Q: Can I re-evaluate submissions with updated rubrics or agents? A: Yes! Use the --override flag when running evaluations:

python deploy/enqueue_eval_tasks.py --override --db-url postgresql://...

This will re-process all submissions with the current agent configurations. You can also target specific submissions using the --filter-source parameter.

Q: How do I scale the system for a 100,000+ submission event?

A: For massive scale:

  1. Deploy multiple Celery workers across different machines

  2. Use Redis or Amazon ElastiCache for the message broker

  3. Implement database read replicas for the analytics dashboard

  4. Consider using S3 for submission storage instead of local files

  5. Increase the semaphore concurrency limit for AI requests

  6. Deploy evaluation agents on GPU-enabled instances for faster processing

Q: What's the best way to handle partial failures in batch processing?

A: The system is designed to be resilient:

  • Failed tasks are automatically retried

  • Successfully processed submissions are marked in the database

  • Use --range parameter to process specific batches

  • The system automatically skips already-processed submissions unless --override is specified

  • Monitor incomplete evaluations using: python eval_database.py --inc

Customization Questions

Q: How do I add a new evaluation dimension (e.g., security or performance)?

A: Adding new evaluation dimensions involves:

  1. Create a new agent using the Neuro-San data-driven agent setup

  2. Add the new score field to the database schema

  3. Update the SCORE_FIELDS list in process_eval.py

  4. Modify the orchestration logic to include the new agent

  5. Update the dashboard to visualize the new dimension

Q: Can I use different multi-agent platform instead of the Neuro SAN?

A: Yes, the system is designed to be modular. You can replace the SimpleClient class with your own implementation that interfaces with OpenAI, Anthropic, CrewAI or any other multi-agent service. The key is maintaining the same response format (scores + rationale).

Q: How do I customize the scoring rubric for my specific use case?

A: Rubrics are defined in the agent prompts within the agent definitions. To customize:

  1. Access your agent configurations

  2. Modify the prompt to include your specific criteria

  3. Adjust the scoring scale if needed (default is 1-100)

  4. Update the aggregation logic if you want weighted scores

Q: Can I integrate this with my existing CI/CD pipeline?

A: Absolutely! The system provides CLI interfaces that can be integrated into any pipeline:

# Example GitHub Actions integration

- name: Process Submissions

 run: python process_inputs.py --input-source ${{ github.event.inputs.csv_path }}

- name: Run Evaluations

 run: python deploy/enqueue_eval_tasks.py --filter-source new_submissions.csv

Performance and Optimization Questions

Q: How many concurrent evaluations can the system handle?

A: This depends on your infrastructure:

  • Default configuration: 8 concurrent AI requests (via semaphore)

  • Celery workers: Limited by CPU cores and memory

  • Database: PostgreSQL can handle hundreds of concurrent connections

  • Practical limit: 1000s of concurrent evaluations with proper resource allocation

Q: What's the most effective way to reduce evaluation costs?

A: Several strategies can reduce costs:

  1. Implement caching for handling observability data

  2. Truncate video transcripts to essential portions

  3. Use smaller AI models for initial screening

  4. Implement smart sampling for very large repositories

Q: How do I debug a stuck evaluation?

A: Follow these steps:

  1. Check Flower dashboard for task status

  2. Review logs in logs/processing/process_eval_*.log

  3. Query the database for incomplete evaluations: python eval_database.py --inc

  4. Check Redis queue length: redis-cli LLEN eval_queue

  5. Examine agent thinking files in logs/ directory

  6. Restart stuck workers if necessary

Data and Privacy Questions

Q: How is sensitive data handled in submissions? A: The system implements several privacy measures:

  • Submissions are processed in isolated environments

  • Temporary files are automatically cleaned up

  • Database connections use SSL in production

  • S3 storage uses encryption at rest

  • No code is executed, only metadata is analyzed statically

Q: Can I export evaluation results for further analysis?

A: Yes, multiple export options are available:

# Export to pandas DataFrame

df = db.get_all_evaluations_as_df()

df.to_csv('evaluations_export.csv')

# Direct SQL export

python eval_database.py --query "SELECT * FROM evaluations" > results.json

Q: How long is evaluation data retained?

A: By default, all evaluation data is retained indefinitely. You can implement data retention policies by:

  • Adding timestamp-based cleanup jobs

  • Archiving old evaluations to cold storage

  • Implementing GDPR-compliant deletion endpoints



Deepak Singh

Senior Data Scientist

Author Image

Deepak is a data scientist that specializes in machine learning, applied statistics, and data-driven applications such as multi-agent systems.



Latest posts

Related topics