Skip to main content

提示对齐评分器

🌐 Prompt Alignment Scorer

createPromptAlignmentScorerLLM() 函数创建一个评分器,用于评估代理响应在多个维度上与用户提示的匹配程度:意图理解、需求完成、回应完整性以及格式适宜性。

🌐 The createPromptAlignmentScorerLLM() function creates a scorer that evaluates how well agent responses align with user prompts across multiple dimensions: intent understanding, requirement fulfillment, response completeness, and format appropriateness.

参数
Direct link to 参数

🌐 Parameters

model:

MastraModelConfig
The language model to use for evaluating prompt-response alignment

options:

PromptAlignmentOptions
Configuration options for the scorer

.run() 返回
Direct link to .run() 返回

🌐 .run() Returns

score:

number
Multi-dimensional alignment score between 0 and scale (default 0-1)

reason:

string
Human-readable explanation of the prompt alignment evaluation with detailed breakdown

.run() 返回的结果形状如下:

{
runId: string,
score: number,
reason: string,
analyzeStepResult: {
intentAlignment: {
score: number,
primaryIntent: string,
isAddressed: boolean,
reasoning: string
},
requirementsFulfillment: {
requirements: Array<{
requirement: string,
isFulfilled: boolean,
reasoning: string
}>,
overallScore: number
},
completeness: {
score: number,
missingElements: string[],
reasoning: string
},
responseAppropriateness: {
score: number,
formatAlignment: boolean,
toneAlignment: boolean,
reasoning: string
},
overallAssessment: string
}
}

评分详情
Direct link to 评分详情

🌐 Scoring Details

评分器配置
Direct link to 评分器配置

🌐 Scorer configuration

你可以通过调整比例参数和评估模式来自定义提示对齐评分器,以满足你的评分需求。

🌐 You can customize the Prompt Alignment Scorer by adjusting the scale parameter and evaluation mode to fit your scoring needs.

const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: {
scale: 10, // Score from 0-10 instead of 0-1
evaluationMode: "both", // 'user', 'system', or 'both' (default)
},
});

多维分析
Direct link to 多维分析

🌐 Multi-Dimensional Analysis

提示对齐通过四个关键维度评估响应,并根据评估模式调整加权评分:

🌐 Prompt Alignment evaluates responses across four key dimensions with weighted scoring that adapts based on the evaluation mode:

用户模式('user')
Direct link to 用户模式('user')

🌐 User Mode ('user')

仅评估与用户提示的一致性:

🌐 Evaluates alignment with user prompts only:

  1. 意图一致性(权重40%)- 回答是否回应了用户的核心需求
  2. 需求满足度(权重30%)- 如果满足所有用户需求
  3. 完整性(20% 权重)- 回答是否全面满足用户需求
  4. 响应适当性(权重10%)- 如果格式和语气符合用户预期

系统模式('system')
Direct link to 系统模式('system')

🌐 System Mode ('system')

仅评估是否符合系统指南:

🌐 Evaluates compliance with system guidelines only:

  1. 意图一致性(35% 权重)- 回答是否遵循系统行为指南
  2. 需求满足度(权重35%)- 如果所有系统约束条件均被遵守
  3. 完整性(权重15%)- 回答是否遵守所有系统规则
  4. 响应适当性(权重15%)- 如果格式和语气符合系统规范

两种模式('both' - 默认)
Direct link to 两种模式('both' - 默认)

🌐 Both Mode ('both' - default)

结合对用户和系统对齐的评估:

🌐 Combines evaluation of both user and system alignment:

  • 用户对齐:最终得分的70%(使用用户模式权重)
  • 系统合规性:最终得分的30%(使用系统模式权重)
  • 提供用户满意度和系统遵守情况的平衡评估

评分公式
Direct link to 评分公式

🌐 Scoring Formula

用户模式:

Weighted Score = (intent_score × 0.4) + (requirements_score × 0.3) +
(completeness_score × 0.2) + (appropriateness_score × 0.1)
Final Score = Weighted Score × scale

系统模式:

Weighted Score = (intent_score × 0.35) + (requirements_score × 0.35) +
(completeness_score × 0.15) + (appropriateness_score × 0.15)
Final Score = Weighted Score × scale

双模式(默认):

User Score = (user dimensions with user weights)
System Score = (system dimensions with system weights)
Weighted Score = (User Score × 0.7) + (System Score × 0.3)
Final Score = Weighted Score × scale

重量分配原理

  • 用户模式:优先考虑意图(40%)和需求(30%)以提升用户满意度
  • 系统模式:平衡行为合规性(35%)和约束条件(35%)
  • 双重模式:70/30 的划分确保了用户需求为优先,同时保持系统合规性

分数解释
Direct link to 分数解释

🌐 Score Interpretation

  • 0.9-1.0 = 在所有维度上表现出卓越的一致性
  • 0.8-0.9 = 非常好的匹配,仅有少量差距
  • 0.7-0.8 = 对齐良好,但缺少某些要求或完整性
  • 0.6-0.7 = 中等对齐,有明显间隙
  • 0.4-0.6 = 对齐差,存在重大问题
  • 0.0-0.4 = 对齐度非常差,回应未能有效对应提示

何时使用每种模式
Direct link to 何时使用每种模式

🌐 When to Use Each Mode

用户模式 ('user') - 使用场景:

  • 评估客服响应以衡量用户满意度
  • 从用户角度测试内容生成质量
  • 衡量回答对用户问题的解决程度
  • 专注于请求的完成,而不考虑系统约束

系统模式 ('system') - 使用场景:

  • 审核人工智能的安全性和行为准则合规性
  • 确保客服遵循品牌语音和语调要求
  • 验证内容政策和约束的遵守情况
  • 测试系统级行为一致性

双模式 ('both') - 使用场景(默认,推荐):

  • 对整体人工智能代理性能的综合评估
  • 在用户满意度与系统合规性之间取得平衡
  • 生产监控中用户和系统需求同等重要
  • 提示与回应一致性的整体评估

常见用例
Direct link to 常见用例

🌐 Common Use Cases

代码生成评估
Direct link to 代码生成评估

🌐 Code Generation Evaluation

适合评估:

🌐 Ideal for evaluating:

  • 编程任务完成
  • 代码质量与完整性
  • 遵守编码要求
  • 格式规范(函数、类等)
// Example: API endpoint creation
const codePrompt =
"Create a REST API endpoint with authentication and rate limiting";
// Scorer evaluates: intent (API creation), requirements (auth + rate limiting),
// completeness (full implementation), format (code structure)

指令遵循评估
Direct link to 指令遵循评估

🌐 Instruction Following Assessment

非常适合:

🌐 Perfect for:

  • 任务完成验证
  • 多步骤指令遵循
  • 需求合规性检查
  • 教育内容评估
// Example: Multi-requirement task
const taskPrompt =
"Write a Python class with initialization, validation, error handling, and documentation";
// Scorer tracks each requirement individually and provides detailed breakdown

内容格式验证
Direct link to 内容格式验证

🌐 Content Format Validation

适用于:

🌐 Useful for:

  • 格式规范符合性
  • 遵循风格指南
  • 输出结构验证
  • 响应适当性检查
// Example: Structured output
const formatPrompt =
"Explain the differences between let and const in JavaScript using bullet points";
// Scorer evaluates content accuracy AND format compliance

客服响应质量
Direct link to 客服响应质量

🌐 Agent Response Quality

衡量你的 AI 代理遵循用户指令的程度:

🌐 Measure how well your AI agents follow user instructions:

const agent = new Agent({
name: "CodingAssistant",
instructions:
"You are a helpful coding assistant. Always provide working code examples.",
model: "openai/gpt-5.1",
});

// Evaluate comprehensive alignment (default)
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "both" }, // Evaluates both user intent and system guidelines
});

// Evaluate just user satisfaction
const userScorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "user" }, // Focus only on user request fulfillment
});

// Evaluate system compliance
const systemScorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "system" }, // Check adherence to system instructions
});

const result = await scorer.run(agentRun);

提示工程优化
Direct link to 提示工程优化

🌐 Prompt Engineering Optimization

测试不同的提示以改进对齐:

🌐 Test different prompts to improve alignment:

const prompts = [
"Write a function to calculate factorial",
"Create a Python function that calculates factorial with error handling for negative inputs",
"Implement a factorial calculator in Python with: input validation, error handling, and docstring",
];

// Compare alignment scores to find the best prompt
for (const prompt of prompts) {
const result = await scorer.run(createTestRun(prompt, response));
console.log(`Prompt alignment: ${result.score}`);
}

多智能体系统评估
Direct link to 多智能体系统评估

🌐 Multi-Agent System Evaluation

比较不同的代理或模型:

🌐 Compare different agents or models:

const agents = [agent1, agent2, agent3];
const testPrompts = [...]; // Array of test prompts

for (const agent of agents) {
let totalScore = 0;
for (const prompt of testPrompts) {
const response = await agent.run(prompt);
const evaluation = await scorer.run({ input: prompt, output: response });
totalScore += evaluation.score;
}
console.log(`${agent.name} average alignment: ${totalScore / testPrompts.length}`);
}

示例
Direct link to 示例

🌐 Examples

基本配置
Direct link to 基本配置

🌐 Basic Configuration

import { createPromptAlignmentScorerLLM } from "@mastra/evals";

const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
});

// Evaluate a code generation task
const result = await scorer.run({
input: [
{
role: "user",
content:
"Write a Python function to calculate factorial with error handling",
},
],
output: {
role: "assistant",
text: `def factorial(n):
if n < 0:
raise ValueError("Factorial not defined for negative numbers")
if n == 0:
return 1
return n * factorial(n-1)`,
},
});
// Result: { score: 0.95, reason: "Excellent alignment - function addresses intent, includes error handling..." }

自定义配置示例
Direct link to 自定义配置示例

🌐 Custom Configuration Examples

// Configure scale and evaluation mode
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: {
scale: 10, // Score from 0-10 instead of 0-1
evaluationMode: "both", // 'user', 'system', or 'both' (default)
},
});

// User-only evaluation - focus on user satisfaction
const userScorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "user" },
});

// System-only evaluation - focus on compliance
const systemScorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "system" },
});

const result = await scorer.run(testRun);
// Result: { score: 8.5, reason: "Score: 8.5 out of 10 - Good alignment with both user intent and system guidelines..." }

特定格式评估
Direct link to 特定格式评估

🌐 Format-Specific Evaluation

// Evaluate bullet point formatting
const result = await scorer.run({
input: [
{
role: "user",
content: "List the benefits of TypeScript in bullet points",
},
],
output: {
role: "assistant",
text: "TypeScript provides static typing, better IDE support, and enhanced code reliability.",
},
});
// Result: Lower appropriateness score due to format mismatch (paragraph vs bullet points)

优秀的对齐示例
Direct link to 优秀的对齐示例

🌐 Excellent alignment example

在这个例子中,回复完全满足了用户的提示,并且满足了所有要求。

🌐 In this example, the response fully addresses the user's prompt with all requirements met.

src/example-excellent-prompt-alignment.ts
import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/prebuilt";

const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
});

const inputMessages = [
{
role: "user",
content:
"Write a Python function to calculate factorial with error handling for negative numbers",
},
];

const outputMessage = {
text: `def factorial(n):
"""Calculate factorial of a number."""
if n < 0:
raise ValueError("Factorial not defined for negative numbers")
if n == 0 or n == 1:
return 1
return n * factorial(n - 1)`,
};

const result = await scorer.run({
input: inputMessages,
output: outputMessage,
});

console.log(result);

出色的对齐输出
Direct link to 出色的对齐输出

🌐 Excellent alignment output

该输出获得高分,因为它完美地满足了意图,完成了所有要求,并使用了适当的格式。

🌐 The output receives a high score because it perfectly addresses the intent, fulfills all requirements, and uses appropriate format.

{
score: 0.95,
reason: 'The score is 0.95 because the response perfectly addresses the primary intent of creating a factorial function and fulfills all requirements including Python implementation, error handling for negative numbers, and proper documentation. The code format is appropriate and the implementation is complete.'
}

部分对齐示例
Direct link to 部分对齐示例

🌐 Partial alignment example

在这个例子中,回复解决了核心意图,但缺少一些要求或格式存在问题。

🌐 In this example, the response addresses the core intent but misses some requirements or has format issues.

src/example-partial-prompt-alignment.ts
import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/prebuilt";

const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
});

const inputMessages = [
{
role: "user",
content: "List the benefits of TypeScript in bullet points",
},
];

const outputMessage = {
text: "TypeScript provides static typing, better IDE support, and enhanced code reliability through compile-time error checking.",
};

const result = await scorer.run({
input: inputMessages,
output: outputMessage,
});

console.log(result);

部分对齐输出
Direct link to 部分对齐输出

🌐 Partial alignment output

输出得分较低,因为尽管内容是准确的,但没有遵循要求的格式(项目符号)。

🌐 The output receives a lower score because while the content is accurate, it doesn't follow the requested format (bullet points).

{
score: 0.75,
reason: 'The score is 0.75 because the response addresses the intent of explaining TypeScript benefits and provides accurate information, but fails to use the requested bullet point format, resulting in lower appropriateness scoring.'
}

对齐不良的示例
Direct link to 对齐不良的示例

🌐 Poor alignment example

在这个例子中,回复未能满足用户的具体要求。

🌐 In this example, the response fails to address the user's specific requirements.

src/example-poor-prompt-alignment.ts
import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/prebuilt";

const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
});

const inputMessages = [
{
role: "user",
content:
"Write a Python class with initialization, validation, error handling, and documentation",
},
];

const outputMessage = {
text: `class Example:
def __init__(self, value):
self.value = value`,
};

const result = await scorer.run({
input: inputMessages,
output: outputMessage,
});

console.log(result);

对齐输出不良
Direct link to 对齐输出不良

🌐 Poor alignment output

该输出得分较低,因为它仅部分满足要求,缺少验证、错误处理和文档。

🌐 The output receives a low score because it only partially fulfills the requirements, missing validation, error handling, and documentation.

{
score: 0.35,
reason: 'The score is 0.35 because while the response addresses the basic intent of creating a Python class with initialization, it fails to include validation, error handling, and documentation as specifically requested, resulting in incomplete requirement fulfillment.'
}

评估模式示例
Direct link to 评估模式示例

🌐 Evaluation Mode Examples

用户模式 - 仅关注用户提示
Direct link to 用户模式 - 仅关注用户提示

🌐 User Mode - Focus on User Prompt Only

评估响应在多大程度上满足用户的请求,而忽略系统指令:

🌐 Evaluates how well the response addresses the user's request, ignoring system instructions:

src/example-user-mode.ts
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "user" },
});

const result = await scorer.run({
input: {
inputMessages: [
{
role: "user",
content: "Explain recursion with an example",
},
],
systemMessages: [
{
role: "system",
content: "Always provide code examples in Python",
},
],
},
output: {
text: "Recursion is when a function calls itself. For example: factorial(5) = 5 * factorial(4)",
},
});
// Scores high for addressing user request, even without Python code

系统模式 - 仅关注系统指南
Direct link to 系统模式 - 仅关注系统指南

🌐 System Mode - Focus on System Guidelines Only

评估系统行为准则和约束的遵守情况:

🌐 Evaluates compliance with system behavioral guidelines and constraints:

src/example-system-mode.ts
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "system" },
});

const result = await scorer.run({
input: {
systemMessages: [
{
role: "system",
content:
"You are a helpful assistant. Always be polite, concise, and provide examples.",
},
],
inputMessages: [
{
role: "user",
content: "What is machine learning?",
},
],
},
output: {
text: "Machine learning is a subset of AI where computers learn from data. For example, spam filters learn to identify unwanted emails by analyzing patterns in previously marked spam.",
},
});
// Evaluates politeness, conciseness, and example provision

双模式 - 综合评估(默认)
Direct link to 双模式 - 综合评估(默认)

🌐 Both Mode - Combined Evaluation (Default)

使用加权评分评估用户意图的实现和系统合规性(用户占70%,系统占30%):

🌐 Evaluates both user intent fulfillment and system compliance with weighted scoring (70% user, 30% system):

src/example-both-mode.ts
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "both" }, // This is the default
});

const result = await scorer.run({
input: {
systemMessages: [
{
role: "system",
content:
"Always provide code examples when explaining programming concepts",
},
],
inputMessages: [
{
role: "user",
content: "Explain how to reverse a string",
},
],
},
output: {
text: `To reverse a string, you can iterate through it backwards. Here's an example in Python:

def reverse_string(s):
return s[::-1]

# Usage: reverse_string("hello") returns "olleh"`,
},
});
// High score for both addressing the user's request AND following system guidelines

与其他得分手的比较
Direct link to 与其他得分手的比较

🌐 Comparison with Other Scorers

方面提示对齐答案相关性可靠性
关注点多维度提示遵循查询-回答相关性以上下文为基础
评估意图、需求、完整性、格式与查询的语义相似性与上下文的一致性
使用场景一般提示遵循信息检索RAG/基于上下文的系统
维度4 个加权维度单一相关性维度单一可靠性维度

🌐 Related