Skip to main content

忠诚评分器

🌐 Faithfulness Scorer

createFaithfulnessScorer() 函数用来评估大型语言模型(LLM)输出内容相对于提供的上下文的事实准确性。它会从输出中提取陈述,并将其与上下文进行核实,这对于测量 RAG 流程响应的可靠性至关重要。

🌐 The createFaithfulnessScorer() function evaluates how factually accurate an LLM's output is compared to the provided context. It extracts claims from the output and verifies them against the context, making it essential to measure RAG pipeline responses' reliability.

参数
Direct link to 参数

🌐 Parameters

createFaithfulnessScorer() 函数接受一个包含以下属性的单一选项对象:

🌐 The createFaithfulnessScorer() function accepts a single options object with the following properties:

model:

LanguageModel
Configuration for the model used to evaluate faithfulness.

context:

string[]
Array of context chunks against which the output's claims will be verified.

scale:

number
= 1
The maximum score value. The final score will be normalized to this scale.

此函数返回 MastraScorer 类的一个实例。.run() 方法接受与其他评分器相同的输入(参见 MastraScorer 参考),但返回值包括如下所述的特定于 LLM 的字段。

🌐 This function returns an instance of the MastraScorer class. The .run() method accepts the same input as other scorers (see the MastraScorer reference), but the return value includes LLM-specific fields as documented below.

.run() 返回
Direct link to .run() 返回

🌐 .run() Returns

runId:

string
The id of the run (optional).

preprocessStepResult:

string[]
Array of extracted claims from the output.

preprocessPrompt:

string
The prompt sent to the LLM for the preprocess step (optional).

analyzeStepResult:

object
Object with verdicts: { verdicts: Array<{ verdict: 'yes' | 'no' | 'unsure', reason: string }> }

analyzePrompt:

string
The prompt sent to the LLM for the analyze step (optional).

score:

number
A score between 0 and the configured scale, representing the proportion of claims that are supported by the context.

reason:

string
A detailed explanation of the score, including which claims were supported, contradicted, or marked as unsure.

generateReasonPrompt:

string
The prompt sent to the LLM for the generateReason step (optional).

评分详情
Direct link to 评分详情

🌐 Scoring Details

评分者通过将陈述与提供的上下文进行验证来评估其可信度。

🌐 The scorer evaluates faithfulness through claim verification against provided context.

评分流程
Direct link to 评分流程

🌐 Scoring Process

  1. 分析主张和背景:
    • 提取所有声明(事实性和推测性)
    • 根据上下文验证每个声明
    • 分配三种判决之一:
      • “是的”——上下文支持的说法
      • “不”——声明与上下文相矛盾
      • “不确定” - 无法核实的声明
  2. 计算忠实度评分:
    • 统计支持的主张
    • 除以总索赔数
    • 按配置范围缩放

最终得分:(supported_claims / total_claims) * scale

🌐 Final score: (supported_claims / total_claims) * scale

分数解释
Direct link to 分数解释

🌐 Score interpretation

介于 0 到 1 之间的忠实度评分:

🌐 A faithfulness score between 0 and 1:

  • 1.0:所有声明都是准确的,并有上下文直接支持。
  • 0.7–0.9:大部分说法是正确的,仅有少量补充或遗漏。
  • 0.4–0.6:有些说法是有依据的,但其他的无法核实。
  • 0.1–0.3:大部分内容不准确或缺乏支持。
  • 0.0:所有主张都是错误的或与上下文相矛盾。

示例
Direct link to 示例

🌐 Example

评估代理的回应是否忠实于提供的上下文:

🌐 Evaluate agent responses for faithfulness to provided context:

src/example-faithfulness.ts
import { runEvals } from "@mastra/core/evals";
import { createFaithfulnessScorer } from "@mastra/evals/scorers/prebuilt";
import { myAgent } from "./agent";

// Context is typically populated from agent tool calls or RAG retrieval
const scorer = createFaithfulnessScorer({
model: "openai/gpt-4o",
});

const result = await runEvals({
data: [
{
input: "Tell me about the Tesla Model 3.",
},
{
input: "What are the key features of this electric vehicle?",
},
],
scorers: [scorer],
target: myAgent,
onItemComplete: ({ scorerResults }) => {
console.log({
score: scorerResults[scorer.id].score,
reason: scorerResults[scorer.id].reason,
});
},
});

console.log(result.scores);

有关 runEvals 的更多详细信息,请参阅 runEvals 参考

🌐 For more details on runEvals, see the runEvals reference.

要将此评分器添加到代理中,请参阅 评分器概览 指南。

🌐 To add this scorer to an agent, see the Scorers overview guide.

🌐 Related