工具调用准确率评分器
🌐 Tool Call Accuracy Scorers
Mastra 提供了两种工具称为准确性评分器,用于评估大语言模型(LLM)是否从可用选项中选择了正确的工具:
🌐 Mastra provides two tool call accuracy scorers for evaluating whether an LLM selects the correct tools from available options:
- 基于代码的评分器 - 使用精确工具匹配进行确定性评估
- 基于大型语言模型的评分器 - 使用人工智能进行语义评估,以判断适宜性
在射手之间选择Direct link to 在射手之间选择
🌐 Choosing Between Scorers
在以下情况下使用基于代码的评分器:Direct link to 在以下情况下使用基于代码的评分器:
🌐 Use the Code-Based Scorer When:
- 你需要确定性、可重复的结果
- 你想测试精确工具匹配
- 你需要验证特定的工具序列
- 速度和成本是优先考虑的(不使用大型语言模型调用)
- 你正在运行自动化测试
在以下情况下使用基于大型语言模型的评分器:Direct link to 在以下情况下使用基于大型语言模型的评分器:
🌐 Use the LLM-Based Scorer When:
- 你需要具备对适当性的语义理解
- 工具选择取决于上下文和意图
- 你想处理像澄清请求这样的边缘情况
- 你需要对评分决定的解释
- 你正在评估生产代理行为
基于代码的工具调用准确性评分器Direct link to 基于代码的工具调用准确性评分器
🌐 Code-Based Tool Call Accuracy Scorer
@mastra/evals/scorers/prebuilt 中的 createToolCallAccuracyScorerCode() 函数提供基于精确工具匹配的确定性二进制评分,并支持严格和宽松的评估模式,以及工具调用顺序的验证。
🌐 The createToolCallAccuracyScorerCode() function from @mastra/evals/scorers/prebuilt provides deterministic binary scoring based on exact tool matching and supports both strict and lenient evaluation modes, as well as tool calling order validation.
参数Direct link to 参数
🌐 Parameters
expectedTool:
strictMode:
expectedToolOrder:
此函数返回 MastraScorer 类的一个实例。有关 .run() 方法及其输入/输出的详细信息,请参见 MastraScorer 参考。
🌐 This function returns an instance of the MastraScorer class. See the MastraScorer reference for details on the .run() method and its input/output.
评估模式Direct link to 评估模式
🌐 Evaluation Modes
基于代码的评分器有两种不同的工作模式:
🌐 The code-based scorer operates in two distinct modes:
单工具模式Direct link to 单工具模式
🌐 Single Tool Mode
当未提供 expectedToolOrder 时,评分器会评估单一工具的选择:
🌐 When expectedToolOrder is not provided, the scorer evaluates single tool selection:
- 标准模式 (strictMode: false):如果调用了预期的工具,则返回
1,不管其他工具是否被使用 - 严格模式 (strictMode: true):仅当调用的工具正好为一个且与预期工具匹配时,才返回
1
订单查询模式Direct link to 订单查询模式
🌐 Order Checking Mode
当提供 expectedToolOrder 时,评分器会验证工具调用顺序:
🌐 When expectedToolOrder is provided, the scorer validates tool calling sequence:
- 严格顺序 (strictMode: true):工具必须按指定顺序调用,且不能有额外的工具
- 灵活顺序(strictMode: false):期望的工具必须按正确的相对顺序出现(允许出现额外工具)
基于代码的评分详情Direct link to 基于代码的评分详情
🌐 Code-Based Scoring Details
- 二进制分数:始终返回 0 或 1
- 确定性:相同的输入总是产生相同的输出
- 快速:无需外部 API 调用
基于代码的评分选项Direct link to 基于代码的评分选项
🌐 Code-Based Scorer Options
// Standard mode - passes if expected tool is called
const lenientScorer = createCodeScorer({
expectedTool: "search-tool",
strictMode: false,
});
// Strict mode - only passes if exactly one tool is called
const strictScorer = createCodeScorer({
expectedTool: "search-tool",
strictMode: true,
});
// Order checking with strict mode
const strictOrderScorer = createCodeScorer({
expectedTool: "step1-tool",
expectedToolOrder: ["step1-tool", "step2-tool", "step3-tool"],
strictMode: true, // no extra tools allowed
});
基于代码的评分结果Direct link to 基于代码的评分结果
🌐 Code-Based Scorer Results
{
runId: string,
preprocessStepResult: {
expectedTool: string,
actualTools: string[],
strictMode: boolean,
expectedToolOrder?: string[],
hasToolCalls: boolean,
correctToolCalled: boolean,
correctOrderCalled: boolean | null,
toolCallInfos: ToolCallInfo[]
},
score: number // Always 0 or 1
}
基于代码的评分示例Direct link to 基于代码的评分示例
🌐 Code-Based Scorer Examples
基于代码的评分器提供确定性的二元评分(0 或 1),基于精确的工具匹配。
🌐 The code-based scorer provides deterministic, binary scoring (0 or 1) based on exact tool matching.
正确的工具选择Direct link to 正确的工具选择
🌐 Correct tool selection
const scorer = createToolCallAccuracyScorerCode({
expectedTool: "weather-tool",
});
// Simulate LLM input and output with tool call
const inputMessages = [
createTestMessage({
content: "What is the weather like in New York today?",
role: "user",
id: "input-1",
}),
];
const output = [
createTestMessage({
content: "Let me check the weather for you.",
role: "assistant",
id: "output-1",
toolInvocations: [
createToolInvocation({
toolCallId: "call-123",
toolName: "weather-tool",
args: { location: "New York" },
result: { temperature: "72°F", condition: "sunny" },
state: "result",
}),
],
}),
];
const run = createAgentTestRun({ inputMessages, output });
const result = await scorer.run(run);
console.log(result.score); // 1
console.log(result.preprocessStepResult?.correctToolCalled); // true
严格模式评估Direct link to 严格模式评估
🌐 Strict mode evaluation
只有在调用恰好一个工具时才通过:
🌐 Only passes if exactly one tool is called:
const strictScorer = createToolCallAccuracyScorerCode({
expectedTool: "weather-tool",
strictMode: true,
});
// Multiple tools called - fails in strict mode
const output = [
createTestMessage({
content: "Let me help you with that.",
role: "assistant",
id: "output-1",
toolInvocations: [
createToolInvocation({
toolCallId: "call-1",
toolName: "search-tool",
args: {},
result: {},
state: "result",
}),
createToolInvocation({
toolCallId: "call-2",
toolName: "weather-tool",
args: { location: "New York" },
result: { temperature: "20°C" },
state: "result",
}),
],
}),
];
const result = await strictScorer.run(run);
console.log(result.score); // 0 - fails because multiple tools were called
工具订单验证Direct link to 工具订单验证
🌐 Tool order validation
验证工具是否按特定顺序调用:
🌐 Validates that tools are called in a specific sequence:
const orderScorer = createToolCallAccuracyScorerCode({
expectedTool: "auth-tool", // ignored when order is specified
expectedToolOrder: ["auth-tool", "fetch-tool"],
strictMode: true, // no extra tools allowed
});
const output = [
createTestMessage({
content: "I will authenticate and fetch the data.",
role: "assistant",
id: "output-1",
toolInvocations: [
createToolInvocation({
toolCallId: "call-1",
toolName: "auth-tool",
args: { token: "abc123" },
result: { authenticated: true },
state: "result",
}),
createToolInvocation({
toolCallId: "call-2",
toolName: "fetch-tool",
args: { endpoint: "/data" },
result: { data: ["item1"] },
state: "result",
}),
],
}),
];
const result = await orderScorer.run(run);
console.log(result.score); // 1 - correct order
灵活的订单模式Direct link to 灵活的订单模式
🌐 Flexible order mode
允许额外的工具,只要预期的工具保持相对顺序:
🌐 Allows extra tools as long as expected tools maintain relative order:
const flexibleOrderScorer = createToolCallAccuracyScorerCode({
expectedTool: "auth-tool",
expectedToolOrder: ["auth-tool", "fetch-tool"],
strictMode: false, // allows extra tools
});
const output = [
createTestMessage({
content: "Performing comprehensive operation.",
role: "assistant",
id: "output-1",
toolInvocations: [
createToolInvocation({
toolCallId: "call-1",
toolName: "auth-tool",
args: { token: "abc123" },
result: { authenticated: true },
state: "result",
}),
createToolInvocation({
toolCallId: "call-2",
toolName: "log-tool", // Extra tool - OK in flexible mode
args: { message: "Starting fetch" },
result: { logged: true },
state: "result",
}),
createToolInvocation({
toolCallId: "call-3",
toolName: "fetch-tool",
args: { endpoint: "/data" },
result: { data: ["item1"] },
state: "result",
}),
],
}),
];
const result = await flexibleOrderScorer.run(run);
console.log(result.score); // 1 - auth-tool comes before fetch-tool
基于大型语言模型的工具调用准确性评分器Direct link to 基于大型语言模型的工具调用准确性评分器
🌐 LLM-Based Tool Call Accuracy Scorer
@mastra/evals/scorers/prebuilt 中的 createToolCallAccuracyScorerLLM() 函数使用大型语言模型(LLM)来评估代理调用的工具是否适合给定的用户请求,提供的是语义评估而非精确匹配。
🌐 The createToolCallAccuracyScorerLLM() function from @mastra/evals/scorers/prebuilt uses an LLM to evaluate whether the tools called by an agent are appropriate for the given user request, providing semantic evaluation rather than exact matching.
参数Direct link to 参数
🌐 Parameters
model:
availableTools:
功能Direct link to 功能
🌐 Features
基于大语言模型的评分器提供:
🌐 The LLM-based scorer provides:
- 语义评估:理解上下文和用户意图
- 适用性评估:区分“有用的”和“合适的”工具
- 澄清处理:能够识别代理何时适当地请求澄清
- 缺失工具检测:识别本应被调用的工具
- 推断生成:提供评分决策的解释
评估过程Direct link to 评估过程
🌐 Evaluation Process
- 提取工具调用:识别代理输出中提到的工具
- 分析适用性:根据用户请求评估每个工具
- 生成得分:根据适用工具调用与总工具调用计算得分
- 生成推断:提供可读的人类解释
基于大型语言模型的评分详情Direct link to 基于大型语言模型的评分详情
🌐 LLM-Based Scoring Details
- 分数:返回介于 0.0 和 1.0 之间的值
- 上下文感知:考虑用户意图和适宜性
- 说明:提供评分的理由
基于大型语言模型的评分选项Direct link to 基于大型语言模型的评分选项
🌐 LLM-Based Scorer Options
// Basic configuration
const basicLLMScorer = createLLMScorer({
model: 'openai/gpt-5.1',
availableTools: [
{ name: 'tool1', description: 'Description 1' },
{ name: 'tool2', description: 'Description 2' }
]
});
// With different model
const customModelScorer = createLLMScorer({
model: 'openai/gpt-5', // More powerful model for complex evaluations
availableTools: [...]
});
基于大型语言模型的评分结果Direct link to 基于大型语言模型的评分结果
🌐 LLM-Based Scorer Results
{
runId: string,
score: number, // 0.0 to 1.0
reason: string, // Human-readable explanation
analyzeStepResult: {
evaluations: Array<{
toolCalled: string,
wasAppropriate: boolean,
reasoning: string
}>,
missingTools?: string[]
}
}
基于大型语言模型的评分器示例Direct link to 基于大型语言模型的评分器示例
🌐 LLM-Based Scorer Examples
基于大型语言模型的评分器使用人工智能来评估所选工具是否适合用户的请求。
🌐 The LLM-based scorer uses AI to evaluate whether tool selections are appropriate for the user's request.
基础大语言模型评估Direct link to 基础大语言模型评估
🌐 Basic LLM evaluation
const llmScorer = createToolCallAccuracyScorerLLM({
model: "openai/gpt-5.1",
availableTools: [
{
name: "weather-tool",
description: "Get current weather information for any location",
},
{
name: "calendar-tool",
description: "Check calendar events and scheduling",
},
{
name: "search-tool",
description: "Search the web for general information",
},
],
});
const inputMessages = [
createTestMessage({
content: "What is the weather like in San Francisco today?",
role: "user",
id: "input-1",
}),
];
const output = [
createTestMessage({
content: "Let me check the current weather for you.",
role: "assistant",
id: "output-1",
toolInvocations: [
createToolInvocation({
toolCallId: "call-123",
toolName: "weather-tool",
args: { location: "San Francisco", date: "today" },
result: { temperature: "68°F", condition: "foggy" },
state: "result",
}),
],
}),
];
const run = createAgentTestRun({ inputMessages, output });
const result = await llmScorer.run(run);
console.log(result.score); // 1.0 - appropriate tool usage
console.log(result.reason); // "The agent correctly used the weather-tool to address the user's request for weather information."
处理不当使用工具Direct link to 处理不当使用工具
🌐 Handling inappropriate tool usage
const inputMessages = [
createTestMessage({
content: "What is the weather in Tokyo?",
role: "user",
id: "input-1",
}),
];
const inappropriateOutput = [
createTestMessage({
content: "Let me search for that information.",
role: "assistant",
id: "output-1",
toolInvocations: [
createToolInvocation({
toolCallId: "call-456",
toolName: "search-tool", // Less appropriate than weather-tool
args: { query: "Tokyo weather" },
result: { results: ["Tokyo weather data..."] },
state: "result",
}),
],
}),
];
const run = createAgentTestRun({ inputMessages, output: inappropriateOutput });
const result = await llmScorer.run(run);
console.log(result.score); // 0.5 - partially appropriate
console.log(result.reason); // "The agent used search-tool when weather-tool would have been more appropriate for a direct weather query."
评估澄清请求Direct link to 评估澄清请求
🌐 Evaluating clarification requests
大型语言模型评分器能够识别代理何时恰当地请求澄清:
🌐 The LLM scorer recognizes when agents appropriately ask for clarification:
const vagueInput = [
createTestMessage({
content: 'I need help with something',
role: 'user',
id: 'input-1'
})
];
const clarificationOutput = [
createTestMessage({
content: 'I'd be happy to help! Could you please provide more details about what you need assistance with?',
role: 'assistant',
id: 'output-1',
// No tools called - asking for clarification instead
})
];
const run = createAgentTestRun({
inputMessages: vagueInput,
output: clarificationOutput
});
const result = await llmScorer.run(run);
console.log(result.score); // 1.0 - appropriate to ask for clarification
console.log(result.reason); // "The agent appropriately asked for clarification rather than calling tools with insufficient information."
比较两位得分手Direct link to 比较两位得分手
🌐 Comparing Both Scorers
这是一个在相同数据上同时使用两个评分器的示例:
🌐 Here's an example using both scorers on the same data:
import {
createToolCallAccuracyScorerCode as createCodeScorer,
createToolCallAccuracyScorerLLM as createLLMScorer
} from "@mastra/evals/scorers/prebuilt";
// Setup both scorers
const codeScorer = createCodeScorer({
expectedTool: "weather-tool",
strictMode: false,
});
const llmScorer = createLLMScorer({
model: "openai/gpt-5.1",
availableTools: [
{ name: "weather-tool", description: "Get weather information" },
{ name: "search-tool", description: "Search the web" },
],
});
// Test data
const run = createAgentTestRun({
inputMessages: [
createTestMessage({
content: "What is the weather?",
role: "user",
id: "input-1",
}),
],
output: [
createTestMessage({
content: "Let me find that information.",
role: "assistant",
id: "output-1",
toolInvocations: [
createToolInvocation({
toolCallId: "call-1",
toolName: "search-tool",
args: { query: "weather" },
result: { results: ["weather data"] },
state: "result",
}),
],
}),
],
});
// Run both scorers
const codeResult = await codeScorer.run(run);
const llmResult = await llmScorer.run(run);
console.log("Code Scorer:", codeResult.score); // 0 - wrong tool
console.log("LLM Scorer:", llmResult.score); // 0.3 - partially appropriate
console.log("LLM Reason:", llmResult.reason); // Explains why search-tool is less appropriate
相关Direct link to 相关
🌐 Related