Skip to main content

CI中的得分手

🌐 Running Scorers in CI

在 CI 流水线中运行评分器可以提供可量化的指标,用于衡量代理的长期质量。runEvals 函数将多个测试用例通过你的代理或工作流处理,并返回综合得分。

🌐 Running scorers in your CI pipeline provides quantifiable metrics for measuring agent quality over time. The runEvals function processes multiple test cases through your agent or workflow and returns aggregate scores.

基础设置
Direct link to 基础设置

🌐 Basic Setup

你可以使用任何支持 ESM 模块的测试框架,例如 VitestJestMocha

🌐 You can use any testing framework that supports ESM modules, such as Vitest, Jest, or Mocha.

创建测试用例
Direct link to 创建测试用例

🌐 Creating Test Cases

使用 runEvals 来评估你的代理在多个测试用例中的表现。该函数接受一个数据项数组,每个数据项包含一个 input,以及可选的 groundTruth 用于评分器验证。

🌐 Use runEvals to evaluate your agent against multiple test cases. The function accepts an array of data items, each containing an input and optional groundTruth for scorer validation.

src/mastra/agents/weather-agent.test.ts
import { describe, it, expect } from 'vitest';
import { createScorer, runEvals } from "@mastra/core/evals";
import { weatherAgent } from "./weather-agent";
import { locationScorer } from "../scorers/location-scorer";

describe('Weather Agent Tests', () => {
it('should correctly extract locations from queries', async () => {
const result = await runEvals({
data: [
{
input: 'weather in Berlin',
groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'DE' }
},
{
input: 'weather in Berlin, Maryland',
groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'US' }
},
{
input: 'weather in Berlin, Russia',
groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'RU' }
},
],
target: weatherAgent,
scorers: [locationScorer]
});

// Assert aggregate score meets threshold
expect(result.scores['location-accuracy']).toBe(1);
expect(result.summary.totalItems).toBe(3);
});
});

理解结果
Direct link to 理解结果

🌐 Understanding Results

runEvals 函数返回一个包含以下内容的对象:

🌐 The runEvals function returns an object with:

  • scores:每个评分者在所有测试用例中的平均分
  • summary.totalItems:处理的测试用例总数
{
scores: {
'location-accuracy': 1.0, // Average score across all items
'another-scorer': 0.85
},
summary: {
totalItems: 3
}
}

多种测试场景
Direct link to 多种测试场景

🌐 Multiple Test Scenarios

为不同的评估场景创建单独的测试用例:

🌐 Create separate test cases for different evaluation scenarios:

src/mastra/agents/weather-agent.test.ts
describe('Weather Agent Tests', () => {
const locationScorer = createScorer({ /* ... */ });

it('should handle location disambiguation', async () => {
const result = await runEvals({
data: [
{ input: 'weather in Berlin', groundTruth: { /* ... */ } },
{ input: 'weather in Berlin, Maryland', groundTruth: { /* ... */ } },
],
target: weatherAgent,
scorers: [locationScorer]
});

expect(result.scores['location-accuracy']).toBe(1);
});

it('should handle typos and misspellings', async () => {
const result = await runEvals({
data: [
{ input: 'weather in Berln', groundTruth: { expectedLocation: 'Berlin', expectedCountry: 'DE' } },
{ input: 'weather in Parris', groundTruth: { expectedLocation: 'Paris', expectedCountry: 'FR' } },
],
target: weatherAgent,
scorers: [locationScorer]
});

expect(result.scores['location-accuracy']).toBe(1);
});
});

下一步
Direct link to 下一步

🌐 Next Steps