Microsot Foundry Evaluation GitHub Action

This GitHub Action enables offline evaluation of Microsoft Foundry Agents within your CI/CD pipelines. It is designed to streamline the offline evaluation process, allowing you to identify potential issues and make improvements before releasing an update to production.

To use this action, all you need to provide is a data set with test queries and a list of evaluators. This action will invoke your agent(s) with the queries, collect the performance data including latency and token counts, run the evaluations, and generate a summary report.

Features

Agent Evaluation: Automate pre-production assessment of Microsoft Foundry agents in your CI/CD workflow.
Evaluators: Leverage any evaluators from the Foundry evaluator catalog.
Statistical Analysis: Evaluation results include confidence intervals and test for statistical significance to determine if changes are meaningful and not due to random variation.

Evaluator categories

Agent evaluators: Process and system-level evaluators for agent workflows
RAG evaluators: Evaluate end-to-end and retrieval processes in RAG systems
Risk and safety evaluators: Assess risks and safety concerns in responses
General purpose evaluators: Quality evaluation such as coherence and fluency
OpenAI-based graders: Leverage OpenAI graders including string check, text simularity, score/label model
Custom evaluators: Define your own custom evaluators using Python code or LLM-as-a-judge patterns

Inputs

Parameters

Name	Required?	Description
azure-ai-project-endpoint	Yes	Endpoint of your Microsoft Foundry Project
deployment-name	Yes	The name of the Azure AI model deployment to use for evaluation
data-path	Yes	Path to the data file that contains the evaluators and input queries for evaluations
agent-ids	Yes	ID of the agent(s) to evaluate in format `agent-name:version` (e.g., `my-agent:1` or `my-agent:1,my-agent:2`). Multiple agents are comma-separated and compared with statistical test results
baseline-agent-id	No	ID of the baseline agent to compare against when evaluating multiple agents. If not provided, the first agent is used

Data file

The input data file should be a JSON file with the following structure:

Field	Type	Required?	Description
name	string	Yes	Name of the evaluation dataset
evaluators	string[]	Yes	List of evaluator names to use. Check out the list of available evaluators in your project's evaluator catalog in Foundry portal: Build > Evaluations > Evaluator catalog
data	object[]	Yes	Array of input objects with `query` and optional evaluator fields like `ground_truth`, `context`. Auto-mapped to evaluators; use `data_mapping` to override
openai_graders	object	No	Configuration for OpenAI-based evaluators (label_model, score_model, string_check, etc)
evaluator_parameters	object	No	Evaluator-specific initialization parameters (e.g., thresholds, custom settings)
data_mapping	object	No	Custom data field mappings (auto-generated from data if not provided)

Basic sample data file

{
  "name": "test-data",
  "evaluators": [
    "builtin.fluency",
    "builtin.task_adherence",
    "builtin.violence",
  ],
  "data": [
    {
      "query": "Tell me about Tokyo disneyland"
    },
    {
      "query": "How do I install Python?"
    }
  ]
}

Additional sample data files

Filename	Description
samples/data/dataset-tiny.json	Dataset with small number of test queries and evaluators
samples/data/dataset.json	Dataset with all supported evaluator types and enough queries for confidence interval calculation and statistical test.
samples/data/dataset-builtin-evaluators.json	Built-in Foundry evaluators example (e.g., coherence, fluency, relevance, groundedness, metrics)
samples/data/dataset-openai-graders.json	OpenAI-based graders example (label models, score models, text similarity, string checks)
samples/data/dataset-custom-evaluators.json	Custom evaluators example with evaluator parameters
samples/data/dataset-data-mapping.json	Data mapping example showing how to override automatic field mappings with custom data column names

Sample workflow

To use this GitHub Action, add this GitHub Action to your CI/CD workflows and specify the trigger criteria (e.g., on commit).

name: "AI Agent Evaluation"

on:
  workflow_dispatch:
  push:
    branches:
      - main

permissions:
  id-token: write
  contents: read

jobs:
  run-action:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Azure login using Federated Credentials
        uses: azure/login@v2
        with:
          client-id: ${{ vars.AZURE_CLIENT_ID }}
          tenant-id: ${{ vars.AZURE_TENANT_ID }}
          subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}

      - name: Run Evaluation
        uses: microsoft/ai-agent-evals@v3-beta
        with:
          # Replace placeholders with values for your Foundry Project
          azure-ai-project-endpoint: "<your-ai-project-endpoint>"
          deployment-name: "<your-deployment-name>"
          agent-ids: "<your-ai-agent-ids>"
          data-path: ${{ github.workspace }}/path/to/your/data-file

Evaluation Outputs

Evaluation results will be output to the summary section for each AI Evaluation GitHub Action run under Actions in GitHub.com.

Below is a sample report for comparing two agents.

Learn More

For more information about Foundry agent service and observability, see:

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit here.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.azure-devops		.azure-devops
.github		.github
analysis		analysis
samples		samples
scripts		scripts
tasks		tasks
tests		tests
.checkov.yml		.checkov.yml
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
action.py		action.py
action.yml		action.yml
logo.png		logo.png
overview.md		overview.md
pyproject.toml		pyproject.toml
sample-output.jpeg		sample-output.jpeg
vss-extension.json		vss-extension.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Microsot Foundry Evaluation GitHub Action

Features

Evaluator categories

Inputs

Parameters

Data file

Basic sample data file

Additional sample data files

Sample workflow

Evaluation Outputs

Learn More

Contributing

Trademarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Languages

License

microsoft/ai-agent-evals

Folders and files

Latest commit

History

Repository files navigation

Microsot Foundry Evaluation GitHub Action

Features

Evaluator categories

Inputs

Parameters

Data file

Basic sample data file

Additional sample data files

Sample workflow

Evaluation Outputs

Learn More

Contributing

Trademarks

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Languages

Packages