-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
New Benchmarkslabel for issue when it's a new benchmarklabel for issue when it's a new benchmark
Description
How to Create New Benchmarks
Welcome everyone and thank you for deciding to add to ze-benchmarks. Below you will find guidance on how to contribute to ze-benchmarks.
Overview
The benchmark creation process follows these steps:
- Plan your benchmark using our comprehensive guides
- Create a proposal using our GitHub issue template
- Implement your benchmark following our guidelines
- Submit a pull request with your implementation
- Get feedback and iterate through the review process
Step 1: Plan Your Benchmark
Before creating anything, familiarize yourself with our documentation:
Essential Reading
- Contributing Guide - Complete contribution guidelines and code of conduct
- Adding Benchmarks - Comprehensive benchmark creation guide
- Benchmark Checklist - Quality validation checklist (STILL WORK IN PROGRESS)
Key Concepts
- Suites: Collections of related benchmarks (e.g., "dependency-management")
- Scenarios: Individual test cases within a suite (e.g., "react-update")
- Prompts: Difficulty tiers (L0-L3, Lx) for each scenario
- Repository Fixtures: Real codebases with intentional issues
- Oracle Answers: Expected outcomes for validation
Step 2: Create a Benchmark Proposal
Use our GitHub issue template to propose your benchmark:
How to Create a Proposal
- Go to GitHub Issues
- Click "New Issue"
- Select "Adding Benchmarks Template"
- Fill out all sections with your benchmark details
What the Template Asks For
- What is being added: Suite name, scenario names, description
- Objective goal: What capability you're trying to evaluate
- Validation & testing: Commands you used to test (
pnpm bench <suite> <scenario> L1 echo) - Scenario configuration: Complete
scenario.yamltemplate - Repository fixture structure: Directory layout and intentional issues
- Prompt tier content: Examples for L0, L1, L2 difficulty levels
- Oracle answers: Expected responses to agent questions
Example Proposal Structure
Suite: "framework-migration"
Scenario: "react-to-solid"
Description: "Test agent's ability to migrate React components to Solid.js"
Testing: "pnpm bench framework-migration react-to-solid L1 echo"
Step 3: Get Feedback on Your Proposal
After submitting your proposal:
- Wait for review (1-2 business days)
- Address feedback from maintainers
- Refine your design based on suggestions
- Get approval before implementing
Common Feedback Areas
- Realism: Is this a real-world scenario developers face?
- Complexity: Is the difficulty level appropriate?
- Completeness: Are all required files and configurations included?
- Testing: Have you validated it works with different agents?
Step 4: Implement Your Benchmark
Once your proposal is approved:
File Structure
suites/YOUR-SUITE/
├── prompts/YOUR-SCENARIO/
│ ├── L0-minimal.md
│ ├── L1-basic.md
│ ├── L2-directed.md
│ └── L3-migration.md (optional)
└── scenarios/YOUR-SCENARIO/
├── scenario.yaml
├── oracle-answers.json
└── repo-fixture/
├── package.json
├── [source files]
└── [config files]
Implementation Checklist
- Create suite directory structure
- Write
scenario.yamlwith your configuration - Create repository fixture with intentional issues
- Write prompts for each difficulty tier
- Create oracle answers for common questions
- Test with multiple agents (
echo,anthropic) - Test with multiple tiers (L0, L1, L2)
- Verify validation commands work
- Update documentation
Step 5: Submit a Pull Request
Creating Your PR
- Fork and clone the repository
- Create a branch:
git checkout -b feature/your-benchmark - Implement your benchmark following the approved proposal
- Test thoroughly with multiple agents and tiers
- Create pull request linking to your original issue
PR Requirements
- Link to issue: Reference your benchmark proposal issue
- Complete implementation: All files from your proposal
- Repository fixture builds successfully
- Tested with at least 2 different agents
- Tested with multiple prompt tiers
- Validation commands work correctly
- Documentation is clear and complete
Example PR Title
[BENCHMARK] Add framework-migration suite with react-to-solid scenario
Step 6: Review and Iteration
Review Process
- Initial review (1-2 business days)
- Feedback on implementation, testing, documentation
- Address feedback and make requested changes
- Final review and approval
- Merge into main branch
Common Areas for Improvement
- Testing: Add more comprehensive test coverage
- Documentation: Improve clarity and examples
- Performance: Optimize benchmark execution time
- Edge cases: Handle more error conditions
- Code quality: Improve organization and readability
Best Practices
Benchmark Design
- Realistic: Use scenarios developers actually face
- Challenging: Test agent capabilities appropriately
- Complete: Include all necessary files and configurations
- Tested: Validate with multiple agents and tiers
- Documented: Clear documentation and examples
Testing Strategy
# Test with echo agent first (fastest)
pnpm bench your-suite your-scenario L1 echo
# Test with anthropic agent
pnpm bench your-suite your-scenario L1 anthropic
# Test all tiers
pnpm bench your-suite your-scenario --batch echo
# Test specific combinations
pnpm bench your-suite your-scenario L0,L1,L2 anthropicTesting strategy (interactive)
# run this and follow instructions
pnpm bench
Screen.Recording.2025-10-28.115901.mp4
Quality Standards
- Repository fixture: Minimal but complete, realistic structure
- Prompts: Clear progression from minimal to detailed
- Validation: Commands that actually test the requirements
- Oracle answers: Comprehensive responses to common questions
Getting Help
Resources
- Contributing Guide - Complete contribution guidelines
- Adding Benchmarks - Detailed benchmark guide
- Benchmark Checklist - Quality validation
- GitHub Discussions - Community help
Support Channels
- GitHub Issues: For specific problems or questions
- GitHub Discussions: For general questions and community help
- Pull Request Comments: For feedback on your implementation
Timeline Expectations
- Proposal Review: 1-2 business days
- Implementation: Varies (1-5 days depending on complexity)
- PR Review: 1-2 business days
- Total: 3-9 business days typically
Recognition
- Contributors: Recognized in project README
- Significant contributors: May be invited to join maintainer team
- Release notes: Contributors acknowledged in releases
Ready to Get Started?
- Read the documentation linked above
- Create a benchmark proposal using our issue template
- Start the discussion in GitHub issues
- Link your PR to your proposal issue
- Follow the workflow outlined above
Thank you for contributing to ze-benchmarks! Your benchmarks help make AI agent evaluation more comprehensive and useful for the entire community.
Questions? Feel free to ask in the comments below or create a new discussion!
Metadata
Metadata
Assignees
Labels
New Benchmarkslabel for issue when it's a new benchmarklabel for issue when it's a new benchmark