runEvals

runEvals 函数通过同时对多个评分器运行多个测试用例，实现了对代理和工作流的批量评估。这对于系统化测试、性能分析以及 AI 系统的验证是至关重要的。

🌐 The runEvals function enables batch evaluation of agents and workflows by running multiple test cases against scorers concurrently. This is essential for systematic testing, performance analysis, and validation of AI systems.

使用示例
Direct link to 使用示例

🌐 Usage Example

import { runEvals } from "@mastra/core/evals";
import { myAgent } from "./agents/my-agent";
import { myScorer1, myScorer2 } from "./scorers";

const result = await runEvals({
  target: myAgent,
  data: [
    { input: "What is machine learning?" },
    { input: "Explain neural networks" },
    { input: "How does AI work?" },
  ],
  scorers: [myScorer1, myScorer2],
  concurrency: 2,
  onItemComplete: ({ item, targetResult, scorerResults }) => {
    console.log(`Completed: ${item.input}`);
    console.log(`Scores:`, scorerResults);
  },
});

console.log(`Average scores:`, result.scores);
console.log(`Processed ${result.summary.totalItems} items`);

参数
Direct link to 参数

🌐 Parameters

target:

Agent | Workflow

The agent or workflow to evaluate.

data:

RunEvalsDataItem[]

Array of test cases with input data and optional ground truth.

scorers:

MastraScorer[] | WorkflowScorerConfig

Array of scorers for agents, or configuration object for workflows specifying scorers for the workflow and individual steps.

concurrency?:

number

= 1

Number of test cases to run concurrently.

onItemComplete?:

function

Callback function called after each test case completes. Receives item, target result, and scorer results.

数据项结构
Direct link to 数据项结构

🌐 Data Item Structure

input:

string | string[] | CoreMessage[] | any

Input data for the target. For agents: messages or strings. For workflows: workflow input data.

groundTruth?:

any

Expected or reference output for comparison during scoring.

requestContext?:

RequestContext

Request Context to pass to the target during execution.

tracingContext?:

TracingContext

Tracing context for observability and debugging.

工作流评分器配置
Direct link to 工作流评分器配置

🌐 Workflow Scorer Configuration

对于工作流，你可以使用 WorkflowScorerConfig 在不同级别指定评分器：

🌐 For workflows, you can specify scorers at different levels using WorkflowScorerConfig:

workflow?:

MastraScorer[]

Array of scorers to evaluate the entire workflow output.

steps?:

Record<string, MastraScorer[]>

Object mapping step IDs to arrays of scorers for evaluating individual step outputs.

返回
Direct link to 返回

🌐 Returns

scores:

Record<string, any>

Average scores across all test cases, organized by scorer name.

summary:

object

Summary information about the experiment execution.

summary.totalItems:

number

Total number of test cases processed.

示例
Direct link to 示例

🌐 Examples

代理评估
Direct link to 代理评估

🌐 Agent Evaluation

import { createScorer, runEvals } from "@mastra/core/evals";

const myScorer = createScorer({
  id: "my-scorer",
  description: "Check if Agent's response contains ground truth",
  type: "agent",
}).generateScore(({ run }) => {
  const response = run.output[0]?.content || "";
  const expectedResponse = run.groundTruth;
  return response.includes(expectedResponse) ? 1 : 0;
});

const result = await runEvals({
  target: chatAgent,
  data: [
    {
      input: "What is AI?",
      groundTruth:
        "AI is a field of computer science that creates intelligent machines.",
    },
    {
      input: "How does machine learning work?",
      groundTruth:
        "Machine learning uses algorithms to learn patterns from data.",
    },
  ],
  scorers: [relevancyScorer],
  concurrency: 3,
});

工作流程评估
Direct link to 工作流程评估

🌐 Workflow Evaluation

const workflowResult = await runEvals({
  target: myWorkflow,
  data: [
    { input: { query: "Process this data", priority: "high" } },
    { input: { query: "Another task", priority: "low" } },
  ],
  scorers: {
    workflow: [outputQualityScorer],
    steps: {
      "validation-step": [validationScorer],
      "processing-step": [processingScorer],
    },
  },
  onItemComplete: ({ item, targetResult, scorerResults }) => {
    console.log(`Workflow completed for: ${item.inputData.query}`);
    if (scorerResults.workflow) {
      console.log("Workflow scores:", scorerResults.workflow);
    }
    if (scorerResults.steps) {
      console.log("Step scores:", scorerResults.steps);
    }
  },
});

🌐 Related

createScorer() - 为实验创建自定义评分器
MastraScorer - 了解评分器的结构和方法
自定义评分器 - 构建评估逻辑的指南
得分者概览 - 了解得分概念

使用示例Direct link to 使用示例

参数Direct link to 参数