Skip to main content

答案相似度评分器

🌐 Answer Similarity Scorer

createAnswerSimilarityScorer() 函数创建一个评分器,用于评估一个代理的输出与标准答案的相似程度。该评分器专门用于 CI/CD 测试场景,在这些场景中,你有预期答案,并希望确保随着时间推移结果的一致性。

🌐 The createAnswerSimilarityScorer() function creates a scorer that evaluates how similar an agent's output is to a ground truth answer. This scorer is specifically designed for CI/CD testing scenarios where you have expected answers and want to ensure consistency over time.

参数
Direct link to 参数

🌐 Parameters

model:

LanguageModel
The language model used to evaluate semantic similarity between outputs and ground truth.

options:

AnswerSimilarityOptions
Configuration options for the scorer.

AnswerSimilarityOptions
Direct link to AnswerSimilarityOptions

requireGroundTruth:

boolean
= true
Whether to require ground truth for evaluation. If false, missing ground truth returns score 0.

semanticThreshold:

number
= 0.8
Weight for semantic matches vs exact matches (0-1).

exactMatchBonus:

number
= 0.2
Additional score bonus for exact matches (0-1).

missingPenalty:

number
= 0.15
Penalty per missing key concept from ground truth.

contradictionPenalty:

number
= 1.0
Penalty for contradictory information. High value ensures wrong answers score near 0.

extraInfoPenalty:

number
= 0.05
Mild penalty for extra information not present in ground truth (capped at 0.2).

scale:

number
= 1
Score scaling factor.

此函数返回 MastraScorer 类的一个实例。.run() 方法接受与其他评分器相同的输入(参见 MastraScorer 参考),但需要在运行对象中提供真实值

🌐 This function returns an instance of the MastraScorer class. The .run() method accepts the same input as other scorers (see the MastraScorer reference), but requires ground truth to be provided in the run object.

.run() 返回
Direct link to .run() 返回

🌐 .run() Returns

runId:

string
The id of the run (optional).

score:

number
Similarity score between 0-1 (or 0-scale if custom scale used). Higher scores indicate better similarity to ground truth.

reason:

string
Human-readable explanation of the score with actionable feedback.

preprocessStepResult:

object
Extracted semantic units from output and ground truth.

analyzeStepResult:

object
Detailed analysis of matches, contradictions, and extra information.

preprocessPrompt:

string
The prompt used for semantic unit extraction.

analyzePrompt:

string
The prompt used for similarity analysis.

generateReasonPrompt:

string
The prompt used for generating the explanation.

评分详情
Direct link to 评分详情

🌐 Scoring Details

评分者使用多步骤流程:

🌐 The scorer uses a multi-step process:

  1. 提取:将输出结果和真实结果分解为语义单元
  2. 分析:比较单元并识别匹配点、矛盾和空白
  3. 评分:计算带有矛盾惩罚的加权相似度
  4. 原因:生成可供人类阅读的解释

分数计算:max(0, base_score - contradiction_penalty - missing_penalty - extra_info_penalty) × scale

🌐 Score calculation: max(0, base_score - contradiction_penalty - missing_penalty - extra_info_penalty) × scale

示例
Direct link to 示例

🌐 Example

评估代理在不同情境下与真实答案的相似性:

🌐 Evaluate agent responses for similarity to ground truth across different scenarios:

src/example-answer-similarity.ts
import { runEvals } from "@mastra/core/evals";
import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/prebuilt";
import { myAgent } from "./agent";

const scorer = createAnswerSimilarityScorer({ model: "openai/gpt-4o" });

const result = await runEvals({
data: [
{
input: "What is 2+2?",
groundTruth: "4",
},
{
input: "What is the capital of France?",
groundTruth: "The capital of France is Paris",
},
{
input: "What are the primary colors?",
groundTruth: "The primary colors are red, blue, and yellow",
},
],
scorers: [scorer],
target: myAgent,
onItemComplete: ({ scorerResults }) => {
console.log({
score: scorerResults[scorer.id].score,
reason: scorerResults[scorer.id].reason,
});
},
});

console.log(result.scores);

有关 runEvals 的更多详细信息,请参阅 runEvals 参考

🌐 For more details on runEvals, see the runEvals reference.

要将此评分器添加到代理中,请参阅 评分器概览 指南。

🌐 To add this scorer to an agent, see the Scorers overview guide.