Adding Benchmarks [Start Here]




# How to Create New Benchmarks


Welcome everyone and thank you for deciding to add to ze-benchmarks. Below you will find guidance on how to contribute to ze-benchmarks.

## Overview

The benchmark creation process follows these steps:
1. **Plan your benchmark** using our comprehensive guides
2. **Create a proposal** using our GitHub issue template
3. **Implement your benchmark** following our guidelines
4. **Submit a pull request** with your implementation
5. **Get feedback and iterate** through the review process

## Step 1: Plan Your Benchmark

Before creating anything, familiarize yourself with our documentation:

### Essential Reading
- **[Contributing Guide](https://github.com/ZephyrCloudIO/ze-benchmarks/blob/main/CONTRIBUTING.md)** - Complete contribution guidelines and code of conduct
- **[Adding Benchmarks](https://github.com/ZephyrCloudIO/ze-benchmarks/blob/main/docs/ADDING-BENCHMARKS.md)** - Comprehensive benchmark creation guide
- **[Benchmark Checklist](../docs/BENCHMARK-CHECKLIST.md)** - Quality validation checklist **(STILL WORK IN PROGRESS)**

### Key Concepts
- **Suites**: Collections of related benchmarks (e.g., "dependency-management")
- **Scenarios**: Individual test cases within a suite (e.g., "react-update")
- **Prompts**: Difficulty tiers (L0-L3, Lx) for each scenario
- **Repository Fixtures**: Real codebases with intentional issues
- **Oracle Answers**: Expected outcomes for validation

## Step 2: Create a Benchmark Proposal

Use our GitHub issue template to propose your benchmark:

### How to Create a Proposal
1. Go to [GitHub Issues](https://github.com/your-org/ze-benchmarks/issues)
2. Click **"New Issue"**
3. Select **"Adding Benchmarks Template"** 
4. Fill out all sections with your benchmark details

### What the Template Asks For
- **What is being added**: Suite name, scenario names, description
- **Objective goal**: What capability you're trying to evaluate
- **Validation & testing**: Commands you used to test (`pnpm bench <suite> <scenario> L1 echo`)
- **Scenario configuration**: Complete `scenario.yaml` template
- **Repository fixture structure**: Directory layout and intentional issues
- **Prompt tier content**: Examples for L0, L1, L2 difficulty levels
- **Oracle answers**: Expected responses to agent questions

### Example Proposal Structure
```
Suite: "framework-migration"
Scenario: "react-to-solid"
Description: "Test agent's ability to migrate React components to Solid.js"
Testing: "pnpm bench framework-migration react-to-solid L1 echo"
```

## Step 3: Get Feedback on Your Proposal

After submitting your proposal:

1. **Wait for review** (1-2 business days)
2. **Address feedback** from maintainers
3. **Refine your design** based on suggestions
4. **Get approval** before implementing

### Common Feedback Areas
- **Realism**: Is this a real-world scenario developers face?
- **Complexity**: Is the difficulty level appropriate?
- **Completeness**: Are all required files and configurations included?
- **Testing**: Have you validated it works with different agents?

## Step 4: Implement Your Benchmark

Once your proposal is approved:

### File Structure
```
suites/YOUR-SUITE/
├── prompts/YOUR-SCENARIO/
│   ├── L0-minimal.md
│   ├── L1-basic.md
│   ├── L2-directed.md
│   └── L3-migration.md (optional)
└── scenarios/YOUR-SCENARIO/
    ├── scenario.yaml
    ├── oracle-answers.json
    └── repo-fixture/
        ├── package.json
        ├── [source files]
        └── [config files]
```

### Implementation Checklist
- [ ] Create suite directory structure
- [ ] Write `scenario.yaml` with your configuration
- [ ] Create repository fixture with intentional issues
- [ ] Write prompts for each difficulty tier
- [ ] Create oracle answers for common questions
- [ ] Test with multiple agents (`echo`, `anthropic`)
- [ ] Test with multiple tiers (L0, L1, L2)
- [ ] Verify validation commands work
- [ ] Update documentation

## Step 5: Submit a Pull Request

### Creating Your PR
1. **Fork and clone** the repository
2. **Create a branch**: `git checkout -b feature/your-benchmark`
3. **Implement your benchmark** following the approved proposal
4. **Test thoroughly** with multiple agents and tiers
5. **Create pull request** linking to your original issue

### PR Requirements
- **Link to issue**: Reference your benchmark proposal issue
- **Complete implementation**: All files from your proposal
- [ ] Repository fixture builds successfully
- [ ] Tested with at least 2 different agents
- [ ] Tested with multiple prompt tiers
- [ ] Validation commands work correctly
- [ ] Documentation is clear and complete

### Example PR Title
```
[BENCHMARK] Add framework-migration suite with react-to-solid scenario
```

## Step 6: Review and Iteration

### Review Process
1. **Initial review** (1-2 business days)
2. **Feedback** on implementation, testing, documentation
3. **Address feedback** and make requested changes
4. **Final review** and approval
5. **Merge** into main branch

### Common Areas for Improvement
- **Testing**: Add more comprehensive test coverage
- **Documentation**: Improve clarity and examples
- **Performance**: Optimize benchmark execution time
- **Edge cases**: Handle more error conditions
- **Code quality**: Improve organization and readability

## Best Practices

### Benchmark Design
- **Realistic**: Use scenarios developers actually face
- **Challenging**: Test agent capabilities appropriately
- **Complete**: Include all necessary files and configurations
- **Tested**: Validate with multiple agents and tiers
- **Documented**: Clear documentation and examples

### Testing Strategy
```bash
# Test with echo agent first (fastest)
pnpm bench your-suite your-scenario L1 echo

# Test with anthropic agent
pnpm bench your-suite your-scenario L1 anthropic

# Test all tiers
pnpm bench your-suite your-scenario --batch echo

# Test specific combinations
pnpm bench your-suite your-scenario L0,L1,L2 anthropic
```

### Testing strategy (interactive)

```bash
# run this and follow instructions
pnpm bench

```

https://github.com/user-attachments/assets/2d087959-47d9-4c96-9616-631d43513565

### Quality Standards
- **Repository fixture**: Minimal but complete, realistic structure
- **Prompts**: Clear progression from minimal to detailed
- **Validation**: Commands that actually test the requirements
- **Oracle answers**: Comprehensive responses to common questions

## Getting Help

### Resources
- **[Contributing Guide](../CONTRIBUTING.md)** - Complete contribution guidelines
- **[Adding Benchmarks](../docs/ADDING-BENCHMARKS.md)** - Detailed benchmark guide
- **[Benchmark Checklist](../docs/BENCHMARK-CHECKLIST.md)** - Quality validation
- **[GitHub Discussions](https://github.com/your-org/ze-benchmarks/discussions)** - Community help

### Support Channels
- **GitHub Issues**: For specific problems or questions
- **GitHub Discussions**: For general questions and community help
- **Pull Request Comments**: For feedback on your implementation

## Timeline Expectations

- **Proposal Review**: 1-2 business days
- **Implementation**: Varies (1-5 days depending on complexity)
- **PR Review**: 1-2 business days
- **Total**: 3-9 business days typically

## Recognition

- **Contributors**: Recognized in project README
- **Significant contributors**: May be invited to join maintainer team
- **Release notes**: Contributors acknowledged in releases

---

## Ready to Get Started?

1. **Read the documentation** linked above
2. **Create a benchmark proposal** using our issue template
3. **Start the discussion** in GitHub issues
4. **Link your PR** to your proposal issue
5. **Follow the workflow** outlined above

Thank you for contributing to ze-benchmarks! Your benchmarks help make AI agent evaluation more comprehensive and useful for the entire community.

**Questions?** Feel free to ask in the comments below or create a new discussion!


Adding Benchmarks [Start Here] #10

Description

How to Create New Benchmarks

Overview

Step 1: Plan Your Benchmark

Essential Reading

Key Concepts

Step 2: Create a Benchmark Proposal

How to Create a Proposal

What the Template Asks For

Example Proposal Structure

Step 3: Get Feedback on Your Proposal

Common Feedback Areas

Step 4: Implement Your Benchmark

File Structure

Implementation Checklist

Step 5: Submit a Pull Request

Creating Your PR

PR Requirements

Example PR Title

Step 6: Review and Iteration

Review Process

Common Areas for Improvement

Best Practices

Benchmark Design

Testing Strategy

Testing strategy (interactive)

Quality Standards

Getting Help

Resources

Support Channels

Timeline Expectations

Recognition

Ready to Get Started?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions