得分概览
🌐 Scorers overview
虽然传统软件测试有明确的通过/失败条件,但人工智能的输出是非确定性的——即使相同的输入,输出也可能不同。评分器通过提供可量化的指标来衡量智能体的质量,从而弥合了这一差距。
🌐 While traditional software tests have clear pass/fail conditions, AI outputs are non-deterministic — they can vary with the same input. Scorers help bridge this gap by providing quantifiable metrics for measuring agent quality.
评分器是使用模型评分、基于规则和统计方法来评估智能体输出的自动化测试工具。评分器会返回分数:数值(通常在0到1之间),用于量化输出满足评估标准的程度。这些分数使你能够客观地追踪性能、比较不同的方法,并识别AI系统中需要改进的字段。评分器可以使用你自己的提示和评分函数进行自定义。
🌐 Scorers are automated tests that evaluate Agents outputs using model-graded, rule-based, and statistical methods. Scorers return scores: numerical values (typically between 0 and 1) that quantify how well an output meets your evaluation criteria. These scores enable you to objectively track performance, compare different approaches, and identify areas for improvement in your AI systems. Scorers can be customized with your own prompts and scoring functions.
评分器可以在云端运行,捕获实时结果。但评分器也可以作为你的 CI/CD 流水线的一部分,让你能够随时间测试和监控你的代理。
🌐 Scorers can be run in the cloud, capturing real-time results. But scorers can also be part of your CI/CD pipeline, allowing you to test and monitor your agents over time.
得分类型Direct link to 得分类型
🌐 Types of Scorers
有不同类型的记分员,每种都有特定的用途。以下是一些常见类型:
🌐 There are different kinds of scorers, each serving a specific purpose. Here are some common types:
- 文本评分器:评估代理响应的准确性、可靠性和上下文理解
- 分类评分器:根据预定义类别衡量数据分类的准确性
- 提示工程评分器:探索不同指令和输入格式的影响
安装Direct link to 安装
🌐 Installation
要使用 Mastra 的评分功能,请安装 @mastra/evals 包。
🌐 To access Mastra's scorers feature install the @mastra/evals package.
- npm
- pnpm
- Yarn
- Bun
npm install @mastra/evals@latest
pnpm add @mastra/evals@latest
yarn add @mastra/evals@latest
bun add @mastra/evals@latest
现场评估Direct link to 现场评估
🌐 Live evaluations
实时评估 允许你在代理和工作流运行时自动实时评分 AI 输出。与手动或批量执行评估不同,评分器会异步与你的 AI 系统同时运行,从而提供持续的质量监控。
向代理添加评分器Direct link to 向代理添加评分器
🌐 Adding scorers to agents
你可以为你的代理添加内置评分器,以自动评估它们的输出。请参阅完整的内置评分器列表了解所有可用选项。
🌐 You can add built-in scorers to your agents to automatically evaluate their outputs. See the full list of built-in scorers for all available options.
import { Agent } from "@mastra/core/agent";
import {
createAnswerRelevancyScorer,
createToxicityScorer,
} from "@mastra/evals/scorers/prebuilt";
export const evaluatedAgent = new Agent({
scorers: {
relevancy: {
scorer: createAnswerRelevancyScorer({ model: "openai/gpt-4.1-nano" }),
sampling: { type: "ratio", rate: 0.5 },
},
safety: {
scorer: createToxicityScorer({ model: "openai/gpt-4.1-nano" }),
sampling: { type: "ratio", rate: 1 },
},
},
});
将评分者添加到工作流程步骤Direct link to 将评分者添加到工作流程步骤
🌐 Adding scorers to workflow steps
你还可以为各个工作流程步骤添加评分器,以在流程的特定节点评估输出:
🌐 You can also add scorers to individual workflow steps to evaluate outputs at specific points in your process:
import { createWorkflow, createStep } from "@mastra/core/workflows";
import { z } from "zod";
import { customStepScorer } from "../scorers/custom-step-scorer";
const contentStep = createStep({
scorers: {
customStepScorer: {
scorer: customStepScorer(),
sampling: {
type: "ratio",
rate: 1, // Score every step execution
}
}
},
});
export const contentWorkflow = createWorkflow({ ... })
.then(contentStep)
.commit();
实时评估的运作方式Direct link to 实时评估的运作方式
🌐 How live evaluations work
异步执行:实时评估在后台运行,不会阻塞代理响应或工作流程的执行。这确保了你的人工智能系统在被监控的同时仍能保持其性能。
采样控制:sampling.rate 参数(0-1)控制有多少输出会被评分:
1.0:对每一个回答都评分(100%)0.5:对所有回答评分一半(50%)0.1:得分占回应的10%0.0:禁用评分
自动存储:所有评分结果都会自动存储在你配置的数据库中的 mastra_scorers 表中,使你能够随时间分析性能趋势。
跟踪评估Direct link to 跟踪评估
🌐 Trace evaluations
除了实时评估之外,你还可以使用评分器来评估来自你的代理交互和工作流的历史记录。这对于分析过去的表现、调试问题或进行批量评估特别有用。
🌐 In addition to live evaluations, you can use scorers to evaluate historical traces from your agent interactions and workflows. This is particularly useful for analyzing past performance, debugging issues, or running batch evaluations.
需要可观测性
要对跟踪进行评分,你必须先在 Mastra 实例中配置可观测性以收集跟踪数据。有关设置说明,请参阅 跟踪文档。
🌐 To score traces, you must first configure observability in your Mastra instance to collect trace data. See Tracing documentation for setup instructions.
使用 Studio 评分跟踪Direct link to 使用 Studio 评分跟踪
🌐 Scoring traces with Studio
要评分痕迹,你首先需要在你的 Mastra 实例中注册评分器:
🌐 To score traces, you first need to register your scorers with your Mastra instance:
const mastra = new Mastra({
scorers: {
answerRelevancy: myAnswerRelevancyScorer,
responseQuality: myResponseQualityScorer,
},
});
注册后,你可以在 Studio 的可观察性部分中交互式地对跟踪进行评分。这提供了一个用户友好的界面,用于对历史跟踪运行评分器。
🌐 Once registered, you can score traces interactively within Studio under the Observability section. This provides a user-friendly interface for running scorers against historical traces.
本地测试评分器Direct link to 本地测试评分器
🌐 Testing scorers locally
Mastra 提供了一个 CLI 命令 mastra dev 来测试你的评分器。Studio 包含一个评分器部分,你可以在其中针对测试输入运行单个评分器并查看详细结果。
🌐 Mastra provides a CLI command mastra dev to test your scorers. Studio includes a scorers section where you can run individual scorers against test inputs and view detailed results.
更多详情,请参见 Studio 文档。
🌐 For more details, see Studio docs.
下一步Direct link to 下一步
🌐 Next steps