提示对齐评分器
🌐 Prompt Alignment Scorer
createPromptAlignmentScorerLLM() 函数创建一个评分器,用于评估代理响应在多个维度上与用户提示的匹配程度:意图理解、需求完成、回应完整性以及格式适宜性。
🌐 The createPromptAlignmentScorerLLM() function creates a scorer that evaluates how well agent responses align with user prompts across multiple dimensions: intent understanding, requirement fulfillment, response completeness, and format appropriateness.
参数Direct link to 参数
🌐 Parameters
model:
options:
.run() 返回Direct link to .run() 返回
🌐 .run() Returns
score:
reason:
.run() 返回的结果形状如下:
{
runId: string,
score: number,
reason: string,
analyzeStepResult: {
intentAlignment: {
score: number,
primaryIntent: string,
isAddressed: boolean,
reasoning: string
},
requirementsFulfillment: {
requirements: Array<{
requirement: string,
isFulfilled: boolean,
reasoning: string
}>,
overallScore: number
},
completeness: {
score: number,
missingElements: string[],
reasoning: string
},
responseAppropriateness: {
score: number,
formatAlignment: boolean,
toneAlignment: boolean,
reasoning: string
},
overallAssessment: string
}
}
评分详情Direct link to 评分详情
🌐 Scoring Details
评分器配置Direct link to 评分器配置
🌐 Scorer configuration
你可以通过调整比例参数和评估模式来自定义提示对齐评分器,以满足你的评分需求。
🌐 You can customize the Prompt Alignment Scorer by adjusting the scale parameter and evaluation mode to fit your scoring needs.
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: {
scale: 10, // Score from 0-10 instead of 0-1
evaluationMode: "both", // 'user', 'system', or 'both' (default)
},
});
多维分析Direct link to 多维分析
🌐 Multi-Dimensional Analysis
提示对齐通过四个关键维度评估响应,并根据评估模式调整加权评分:
🌐 Prompt Alignment evaluates responses across four key dimensions with weighted scoring that adapts based on the evaluation mode:
用户模式('user')Direct link to 用户模式('user')
🌐 User Mode ('user')
仅评估与用户提示的一致性:
🌐 Evaluates alignment with user prompts only:
- 意图一致性(权重40%)- 回答是否回应了用户的核心需求
- 需求满足度(权重30%)- 如果满足所有用户需求
- 完整性(20% 权重)- 回答是否全面满足用户需求
- 响应适当性(权重10%)- 如果格式和语气符合用户预期
系统模式('system')Direct link to 系统模式('system')
🌐 System Mode ('system')
仅评估是否符合系统指南:
🌐 Evaluates compliance with system guidelines only:
- 意图一致性(35% 权重)- 回答是否遵循系统行为指南
- 需求满足度(权重35%)- 如果所有系统约束条件均被遵守
- 完整性(权重15%)- 回答是否遵守所有系统规则
- 响应适当性(权重15%)- 如果格式和语气符合系统规范
两种模式('both' - 默认)Direct link to 两种模式('both' - 默认)
🌐 Both Mode ('both' - default)
结合对用户和系统对齐的评估:
🌐 Combines evaluation of both user and system alignment:
- 用户对齐:最终得分的70%(使用用户模式权重)
- 系统合规性:最终得分的30%(使用系统模式权重)
- 提供用户满意度和系统遵守情况的平衡评估
评分公式Direct link to 评分公式
🌐 Scoring Formula
用户模式:
Weighted Score = (intent_score × 0.4) + (requirements_score × 0.3) +
(completeness_score × 0.2) + (appropriateness_score × 0.1)
Final Score = Weighted Score × scale
系统模式:
Weighted Score = (intent_score × 0.35) + (requirements_score × 0.35) +
(completeness_score × 0.15) + (appropriateness_score × 0.15)
Final Score = Weighted Score × scale
双模式(默认):
User Score = (user dimensions with user weights)
System Score = (system dimensions with system weights)
Weighted Score = (User Score × 0.7) + (System Score × 0.3)
Final Score = Weighted Score × scale
重量分配原理:
- 用户模式:优先考虑意图(40%)和需求(30%)以提升用户满意度
- 系统模式:平衡行为合规性(35%)和约束条件(35%)
- 双重模式:70/30 的划分确保了用户需求为优先,同时保持系统合规性
分数解释Direct link to 分数解释
🌐 Score Interpretation
- 0.9-1.0 = 在所有维度上表现出卓越的一致性
- 0.8-0.9 = 非常好的匹配,仅有少量差距
- 0.7-0.8 = 对齐良好,但缺少某些要求或完整性
- 0.6-0.7 = 中等对齐,有明显间隙
- 0.4-0.6 = 对齐差,存在重大问题
- 0.0-0.4 = 对齐度非常差,回应未能有效对应提示
何时使用每种模式Direct link to 何时使用每种模式
🌐 When to Use Each Mode
用户模式 ('user') - 使用场景:
- 评估客服响应以衡量用户满意度
- 从用户角度测试内容生成质量
- 衡量回答对用户问题的解决程度
- 专注于请求的完成,而不考虑系统约束
系统模式 ('system') - 使用场景:
- 审核人工智能的安全性和行为准则合规性
- 确保客服遵循品牌语音和语调要求
- 验证内容政策和约束的遵守情况
- 测试系统级行为一致性
双模式 ('both') - 使用场景(默认,推荐):
- 对整体人工智能代理性能的综合评估
- 在用户满意度与系统合规性之间取得平衡
- 生产监控中用户和系统需求同等重要
- 提示与回应一致性的整体评估
常见用例Direct link to 常见用例
🌐 Common Use Cases
代码生成评估Direct link to 代码生成评估
🌐 Code Generation Evaluation
适合评估:
🌐 Ideal for evaluating:
- 编程任务完成
- 代码质量与完整性
- 遵守编码要求
- 格式规范(函数、类等)
// Example: API endpoint creation
const codePrompt =
"Create a REST API endpoint with authentication and rate limiting";
// Scorer evaluates: intent (API creation), requirements (auth + rate limiting),
// completeness (full implementation), format (code structure)
指令遵循评估Direct link to 指令遵循评估
🌐 Instruction Following Assessment
非常适合:
🌐 Perfect for:
- 任务完成验证
- 多步骤指令遵循
- 需求合规性检查
- 教育内容评估
// Example: Multi-requirement task
const taskPrompt =
"Write a Python class with initialization, validation, error handling, and documentation";
// Scorer tracks each requirement individually and provides detailed breakdown
内容格式验证Direct link to 内容格式验证
🌐 Content Format Validation
适用于:
🌐 Useful for:
- 格式规范符合性
- 遵循风格指南
- 输出结构验证
- 响应适当性检查
// Example: Structured output
const formatPrompt =
"Explain the differences between let and const in JavaScript using bullet points";
// Scorer evaluates content accuracy AND format compliance
客服响应质量Direct link to 客服响应质量
🌐 Agent Response Quality
衡量你的 AI 代理遵循用户指令的程度:
🌐 Measure how well your AI agents follow user instructions:
const agent = new Agent({
name: "CodingAssistant",
instructions:
"You are a helpful coding assistant. Always provide working code examples.",
model: "openai/gpt-5.1",
});
// Evaluate comprehensive alignment (default)
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "both" }, // Evaluates both user intent and system guidelines
});
// Evaluate just user satisfaction
const userScorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "user" }, // Focus only on user request fulfillment
});
// Evaluate system compliance
const systemScorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "system" }, // Check adherence to system instructions
});
const result = await scorer.run(agentRun);
提示工程优化Direct link to 提示工程优化
🌐 Prompt Engineering Optimization
测试不同的提示以改进对齐:
🌐 Test different prompts to improve alignment:
const prompts = [
"Write a function to calculate factorial",
"Create a Python function that calculates factorial with error handling for negative inputs",
"Implement a factorial calculator in Python with: input validation, error handling, and docstring",
];
// Compare alignment scores to find the best prompt
for (const prompt of prompts) {
const result = await scorer.run(createTestRun(prompt, response));
console.log(`Prompt alignment: ${result.score}`);
}
多智能体系统评估Direct link to 多智能体系统评估
🌐 Multi-Agent System Evaluation
比较不同的代理或模型:
🌐 Compare different agents or models:
const agents = [agent1, agent2, agent3];
const testPrompts = [...]; // Array of test prompts
for (const agent of agents) {
let totalScore = 0;
for (const prompt of testPrompts) {
const response = await agent.run(prompt);
const evaluation = await scorer.run({ input: prompt, output: response });
totalScore += evaluation.score;
}
console.log(`${agent.name} average alignment: ${totalScore / testPrompts.length}`);
}
示例Direct link to 示例
🌐 Examples
基本配置Direct link to 基本配置
🌐 Basic Configuration
import { createPromptAlignmentScorerLLM } from "@mastra/evals";
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
});
// Evaluate a code generation task
const result = await scorer.run({
input: [
{
role: "user",
content:
"Write a Python function to calculate factorial with error handling",
},
],
output: {
role: "assistant",
text: `def factorial(n):
if n < 0:
raise ValueError("Factorial not defined for negative numbers")
if n == 0:
return 1
return n * factorial(n-1)`,
},
});
// Result: { score: 0.95, reason: "Excellent alignment - function addresses intent, includes error handling..." }
自定义配置示例Direct link to 自定义配置示例
🌐 Custom Configuration Examples
// Configure scale and evaluation mode
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: {
scale: 10, // Score from 0-10 instead of 0-1
evaluationMode: "both", // 'user', 'system', or 'both' (default)
},
});
// User-only evaluation - focus on user satisfaction
const userScorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "user" },
});
// System-only evaluation - focus on compliance
const systemScorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "system" },
});
const result = await scorer.run(testRun);
// Result: { score: 8.5, reason: "Score: 8.5 out of 10 - Good alignment with both user intent and system guidelines..." }
特定格式评估Direct link to 特定格式评估
🌐 Format-Specific Evaluation
// Evaluate bullet point formatting
const result = await scorer.run({
input: [
{
role: "user",
content: "List the benefits of TypeScript in bullet points",
},
],
output: {
role: "assistant",
text: "TypeScript provides static typing, better IDE support, and enhanced code reliability.",
},
});
// Result: Lower appropriateness score due to format mismatch (paragraph vs bullet points)
优秀的对齐示例Direct link to 优秀的对齐示例
🌐 Excellent alignment example
在这个例子中,回复完全满足了用户的提示,并且满足了所有要求。
🌐 In this example, the response fully addresses the user's prompt with all requirements met.
import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/prebuilt";
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
});
const inputMessages = [
{
role: "user",
content:
"Write a Python function to calculate factorial with error handling for negative numbers",
},
];
const outputMessage = {
text: `def factorial(n):
"""Calculate factorial of a number."""
if n < 0:
raise ValueError("Factorial not defined for negative numbers")
if n == 0 or n == 1:
return 1
return n * factorial(n - 1)`,
};
const result = await scorer.run({
input: inputMessages,
output: outputMessage,
});
console.log(result);
出色的对齐输出Direct link to 出色的对齐输出
🌐 Excellent alignment output
该输出获得高分,因为它完美地满足了意图,完成了所有要求,并使用了适当的格式。
🌐 The output receives a high score because it perfectly addresses the intent, fulfills all requirements, and uses appropriate format.
{
score: 0.95,
reason: 'The score is 0.95 because the response perfectly addresses the primary intent of creating a factorial function and fulfills all requirements including Python implementation, error handling for negative numbers, and proper documentation. The code format is appropriate and the implementation is complete.'
}
部分对齐示例Direct link to 部分对齐示例
🌐 Partial alignment example
在这个例子中,回复解决了核心意图,但缺少一些要求或格式存在问题。
🌐 In this example, the response addresses the core intent but misses some requirements or has format issues.
import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/prebuilt";
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
});
const inputMessages = [
{
role: "user",
content: "List the benefits of TypeScript in bullet points",
},
];
const outputMessage = {
text: "TypeScript provides static typing, better IDE support, and enhanced code reliability through compile-time error checking.",
};
const result = await scorer.run({
input: inputMessages,
output: outputMessage,
});
console.log(result);
部分对齐输出Direct link to 部分对齐输出
🌐 Partial alignment output
输出得分较低,因为尽管内容是准确的,但没有遵循要求的格式(项目符号)。
🌐 The output receives a lower score because while the content is accurate, it doesn't follow the requested format (bullet points).
{
score: 0.75,
reason: 'The score is 0.75 because the response addresses the intent of explaining TypeScript benefits and provides accurate information, but fails to use the requested bullet point format, resulting in lower appropriateness scoring.'
}
对齐不良的示例Direct link to 对齐不良的示例
🌐 Poor alignment example
在这个例子中,回复未能满足用户的具体要求。
🌐 In this example, the response fails to address the user's specific requirements.
import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/prebuilt";
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
});
const inputMessages = [
{
role: "user",
content:
"Write a Python class with initialization, validation, error handling, and documentation",
},
];
const outputMessage = {
text: `class Example:
def __init__(self, value):
self.value = value`,
};
const result = await scorer.run({
input: inputMessages,
output: outputMessage,
});
console.log(result);
对齐输出不良Direct link to 对齐输出不良
🌐 Poor alignment output
该输出得分较低,因为它仅部分满足要求,缺少验证、错误处理和文档。
🌐 The output receives a low score because it only partially fulfills the requirements, missing validation, error handling, and documentation.
{
score: 0.35,
reason: 'The score is 0.35 because while the response addresses the basic intent of creating a Python class with initialization, it fails to include validation, error handling, and documentation as specifically requested, resulting in incomplete requirement fulfillment.'
}
评估模式示例Direct link to 评估模式示例
🌐 Evaluation Mode Examples
用户模式 - 仅关注用户提示Direct link to 用户模式 - 仅关注用户提示
🌐 User Mode - Focus on User Prompt Only
评估响应在多大程度上满足用户的请求,而忽略系统指令:
🌐 Evaluates how well the response addresses the user's request, ignoring system instructions:
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "user" },
});
const result = await scorer.run({
input: {
inputMessages: [
{
role: "user",
content: "Explain recursion with an example",
},
],
systemMessages: [
{
role: "system",
content: "Always provide code examples in Python",
},
],
},
output: {
text: "Recursion is when a function calls itself. For example: factorial(5) = 5 * factorial(4)",
},
});
// Scores high for addressing user request, even without Python code
系统模式 - 仅关注系统指南Direct link to 系统模式 - 仅关注系统指南
🌐 System Mode - Focus on System Guidelines Only
评估系统行为准则和约束的遵守情况:
🌐 Evaluates compliance with system behavioral guidelines and constraints:
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "system" },
});
const result = await scorer.run({
input: {
systemMessages: [
{
role: "system",
content:
"You are a helpful assistant. Always be polite, concise, and provide examples.",
},
],
inputMessages: [
{
role: "user",
content: "What is machine learning?",
},
],
},
output: {
text: "Machine learning is a subset of AI where computers learn from data. For example, spam filters learn to identify unwanted emails by analyzing patterns in previously marked spam.",
},
});
// Evaluates politeness, conciseness, and example provision
双模式 - 综合评估(默认)Direct link to 双模式 - 综合评估(默认)
🌐 Both Mode - Combined Evaluation (Default)
使用加权评分评估用户意图的实现和系统合规性(用户占70%,系统占30%):
🌐 Evaluates both user intent fulfillment and system compliance with weighted scoring (70% user, 30% system):
const scorer = createPromptAlignmentScorerLLM({
model: "openai/gpt-5.1",
options: { evaluationMode: "both" }, // This is the default
});
const result = await scorer.run({
input: {
systemMessages: [
{
role: "system",
content:
"Always provide code examples when explaining programming concepts",
},
],
inputMessages: [
{
role: "user",
content: "Explain how to reverse a string",
},
],
},
output: {
text: `To reverse a string, you can iterate through it backwards. Here's an example in Python:
def reverse_string(s):
return s[::-1]
# Usage: reverse_string("hello") returns "olleh"`,
},
});
// High score for both addressing the user's request AND following system guidelines
与其他得分手的比较Direct link to 与其他得分手的比较
🌐 Comparison with Other Scorers
| 方面 | 提示对齐 | 答案相关性 | 可靠性 |
|---|---|---|---|
| 关注点 | 多维度提示遵循 | 查询-回答相关性 | 以上下文为基础 |
| 评估 | 意图、需求、完整性、格式 | 与查询的语义相似性 | 与上下文的一致性 |
| 使用场景 | 一般提示遵循 | 信息检索 | RAG/基于上下文的系统 |
| 维度 | 4 个加权维度 | 单一相关性维度 | 单一可靠性维度 |
相关Direct link to 相关
🌐 Related
- 答案相关度评分器 - 评估查询与回答的相关性
- 忠实度评分器 - 测量上下文的根植性
- 工具调用准确度评分器 - 评估工具选择
- 自定义评分器 - 创建你自己的评估指标