Skip to content

Adding Benchmarks [Start Here] #10

@nitink23

Description

@nitink23

How to Create New Benchmarks

Welcome everyone and thank you for deciding to add to ze-benchmarks. Below you will find guidance on how to contribute to ze-benchmarks.

Overview

The benchmark creation process follows these steps:

  1. Plan your benchmark using our comprehensive guides
  2. Create a proposal using our GitHub issue template
  3. Implement your benchmark following our guidelines
  4. Submit a pull request with your implementation
  5. Get feedback and iterate through the review process

Step 1: Plan Your Benchmark

Before creating anything, familiarize yourself with our documentation:

Essential Reading

Key Concepts

  • Suites: Collections of related benchmarks (e.g., "dependency-management")
  • Scenarios: Individual test cases within a suite (e.g., "react-update")
  • Prompts: Difficulty tiers (L0-L3, Lx) for each scenario
  • Repository Fixtures: Real codebases with intentional issues
  • Oracle Answers: Expected outcomes for validation

Step 2: Create a Benchmark Proposal

Use our GitHub issue template to propose your benchmark:

How to Create a Proposal

  1. Go to GitHub Issues
  2. Click "New Issue"
  3. Select "Adding Benchmarks Template"
  4. Fill out all sections with your benchmark details

What the Template Asks For

  • What is being added: Suite name, scenario names, description
  • Objective goal: What capability you're trying to evaluate
  • Validation & testing: Commands you used to test (pnpm bench <suite> <scenario> L1 echo)
  • Scenario configuration: Complete scenario.yaml template
  • Repository fixture structure: Directory layout and intentional issues
  • Prompt tier content: Examples for L0, L1, L2 difficulty levels
  • Oracle answers: Expected responses to agent questions

Example Proposal Structure

Suite: "framework-migration"
Scenario: "react-to-solid"
Description: "Test agent's ability to migrate React components to Solid.js"
Testing: "pnpm bench framework-migration react-to-solid L1 echo"

Step 3: Get Feedback on Your Proposal

After submitting your proposal:

  1. Wait for review (1-2 business days)
  2. Address feedback from maintainers
  3. Refine your design based on suggestions
  4. Get approval before implementing

Common Feedback Areas

  • Realism: Is this a real-world scenario developers face?
  • Complexity: Is the difficulty level appropriate?
  • Completeness: Are all required files and configurations included?
  • Testing: Have you validated it works with different agents?

Step 4: Implement Your Benchmark

Once your proposal is approved:

File Structure

suites/YOUR-SUITE/
├── prompts/YOUR-SCENARIO/
│   ├── L0-minimal.md
│   ├── L1-basic.md
│   ├── L2-directed.md
│   └── L3-migration.md (optional)
└── scenarios/YOUR-SCENARIO/
    ├── scenario.yaml
    ├── oracle-answers.json
    └── repo-fixture/
        ├── package.json
        ├── [source files]
        └── [config files]

Implementation Checklist

  • Create suite directory structure
  • Write scenario.yaml with your configuration
  • Create repository fixture with intentional issues
  • Write prompts for each difficulty tier
  • Create oracle answers for common questions
  • Test with multiple agents (echo, anthropic)
  • Test with multiple tiers (L0, L1, L2)
  • Verify validation commands work
  • Update documentation

Step 5: Submit a Pull Request

Creating Your PR

  1. Fork and clone the repository
  2. Create a branch: git checkout -b feature/your-benchmark
  3. Implement your benchmark following the approved proposal
  4. Test thoroughly with multiple agents and tiers
  5. Create pull request linking to your original issue

PR Requirements

  • Link to issue: Reference your benchmark proposal issue
  • Complete implementation: All files from your proposal
  • Repository fixture builds successfully
  • Tested with at least 2 different agents
  • Tested with multiple prompt tiers
  • Validation commands work correctly
  • Documentation is clear and complete

Example PR Title

[BENCHMARK] Add framework-migration suite with react-to-solid scenario

Step 6: Review and Iteration

Review Process

  1. Initial review (1-2 business days)
  2. Feedback on implementation, testing, documentation
  3. Address feedback and make requested changes
  4. Final review and approval
  5. Merge into main branch

Common Areas for Improvement

  • Testing: Add more comprehensive test coverage
  • Documentation: Improve clarity and examples
  • Performance: Optimize benchmark execution time
  • Edge cases: Handle more error conditions
  • Code quality: Improve organization and readability

Best Practices

Benchmark Design

  • Realistic: Use scenarios developers actually face
  • Challenging: Test agent capabilities appropriately
  • Complete: Include all necessary files and configurations
  • Tested: Validate with multiple agents and tiers
  • Documented: Clear documentation and examples

Testing Strategy

# Test with echo agent first (fastest)
pnpm bench your-suite your-scenario L1 echo

# Test with anthropic agent
pnpm bench your-suite your-scenario L1 anthropic

# Test all tiers
pnpm bench your-suite your-scenario --batch echo

# Test specific combinations
pnpm bench your-suite your-scenario L0,L1,L2 anthropic

Testing strategy (interactive)

# run this and follow instructions
pnpm bench
Screen.Recording.2025-10-28.115901.mp4

Quality Standards

  • Repository fixture: Minimal but complete, realistic structure
  • Prompts: Clear progression from minimal to detailed
  • Validation: Commands that actually test the requirements
  • Oracle answers: Comprehensive responses to common questions

Getting Help

Resources

Support Channels

  • GitHub Issues: For specific problems or questions
  • GitHub Discussions: For general questions and community help
  • Pull Request Comments: For feedback on your implementation

Timeline Expectations

  • Proposal Review: 1-2 business days
  • Implementation: Varies (1-5 days depending on complexity)
  • PR Review: 1-2 business days
  • Total: 3-9 business days typically

Recognition

  • Contributors: Recognized in project README
  • Significant contributors: May be invited to join maintainer team
  • Release notes: Contributors acknowledged in releases

Ready to Get Started?

  1. Read the documentation linked above
  2. Create a benchmark proposal using our issue template
  3. Start the discussion in GitHub issues
  4. Link your PR to your proposal issue
  5. Follow the workflow outlined above

Thank you for contributing to ze-benchmarks! Your benchmarks help make AI agent evaluation more comprehensive and useful for the entire community.

Questions? Feel free to ask in the comments below or create a new discussion!

Metadata

Metadata

Assignees

No one assigned

    Labels

    New Benchmarkslabel for issue when it's a new benchmark

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions