runEvals
runEvals 函数通过同时对多个评分器运行多个测试用例,实现了对代理和工作流的批量评估。这对于系统化测试、性能分析以及 AI 系统的验证是至关重要的。
🌐 The runEvals function enables batch evaluation of agents and workflows by running multiple test cases against scorers concurrently. This is essential for systematic testing, performance analysis, and validation of AI systems.
使用示例Direct link to 使用示例
🌐 Usage Example
import { runEvals } from "@mastra/core/evals";
import { myAgent } from "./agents/my-agent";
import { myScorer1, myScorer2 } from "./scorers";
const result = await runEvals({
target: myAgent,
data: [
{ input: "What is machine learning?" },
{ input: "Explain neural networks" },
{ input: "How does AI work?" },
],
scorers: [myScorer1, myScorer2],
concurrency: 2,
onItemComplete: ({ item, targetResult, scorerResults }) => {
console.log(`Completed: ${item.input}`);
console.log(`Scores:`, scorerResults);
},
});
console.log(`Average scores:`, result.scores);
console.log(`Processed ${result.summary.totalItems} items`);
参数Direct link to 参数
🌐 Parameters
target:
Agent | Workflow
The agent or workflow to evaluate.
data:
RunEvalsDataItem[]
Array of test cases with input data and optional ground truth.
scorers:
MastraScorer[] | WorkflowScorerConfig
Array of scorers for agents, or configuration object for workflows specifying scorers for the workflow and individual steps.
concurrency?:
number
= 1
Number of test cases to run concurrently.
onItemComplete?:
function
Callback function called after each test case completes. Receives item, target result, and scorer results.
数据项结构Direct link to 数据项结构
🌐 Data Item Structure
input:
string | string[] | CoreMessage[] | any
Input data for the target. For agents: messages or strings. For workflows: workflow input data.
groundTruth?:
any
Expected or reference output for comparison during scoring.
requestContext?:
RequestContext
Request Context to pass to the target during execution.
tracingContext?:
TracingContext
Tracing context for observability and debugging.
工作流评分器配置Direct link to 工作流评分器配置
🌐 Workflow Scorer Configuration
对于工作流,你可以使用 WorkflowScorerConfig 在不同级别指定评分器:
🌐 For workflows, you can specify scorers at different levels using WorkflowScorerConfig:
workflow?:
MastraScorer[]
Array of scorers to evaluate the entire workflow output.
steps?:
Record<string, MastraScorer[]>
Object mapping step IDs to arrays of scorers for evaluating individual step outputs.
返回Direct link to 返回
🌐 Returns
scores:
Record<string, any>
Average scores across all test cases, organized by scorer name.
summary:
object
Summary information about the experiment execution.
summary.totalItems:
number
Total number of test cases processed.
示例Direct link to 示例
🌐 Examples
代理评估Direct link to 代理评估
🌐 Agent Evaluation
import { createScorer, runEvals } from "@mastra/core/evals";
const myScorer = createScorer({
id: "my-scorer",
description: "Check if Agent's response contains ground truth",
type: "agent",
}).generateScore(({ run }) => {
const response = run.output[0]?.content || "";
const expectedResponse = run.groundTruth;
return response.includes(expectedResponse) ? 1 : 0;
});
const result = await runEvals({
target: chatAgent,
data: [
{
input: "What is AI?",
groundTruth:
"AI is a field of computer science that creates intelligent machines.",
},
{
input: "How does machine learning work?",
groundTruth:
"Machine learning uses algorithms to learn patterns from data.",
},
],
scorers: [relevancyScorer],
concurrency: 3,
});
工作流程评估Direct link to 工作流程评估
🌐 Workflow Evaluation
const workflowResult = await runEvals({
target: myWorkflow,
data: [
{ input: { query: "Process this data", priority: "high" } },
{ input: { query: "Another task", priority: "low" } },
],
scorers: {
workflow: [outputQualityScorer],
steps: {
"validation-step": [validationScorer],
"processing-step": [processingScorer],
},
},
onItemComplete: ({ item, targetResult, scorerResults }) => {
console.log(`Workflow completed for: ${item.inputData.query}`);
if (scorerResults.workflow) {
console.log("Workflow scores:", scorerResults.workflow);
}
if (scorerResults.steps) {
console.log("Step scores:", scorerResults.steps);
}
},
});
相关Direct link to 相关
🌐 Related
- createScorer() - 为实验创建自定义评分器
- MastraScorer - 了解评分器的结构和方法
- 自定义评分器 - 构建评估逻辑的指南
- 得分者概览 - 了解得分概念