参考:.chunk()
🌐 Reference: .chunk()
.chunk() 函数使用多种策略和选项将文档拆分为更小的片段。
🌐 The .chunk() function splits documents into smaller segments using various strategies and options.
示例Direct link to 示例
🌐 Example
import { MDocument } from "@mastra/rag";
const doc = MDocument.fromMarkdown(`
# Introduction
This is a sample document that we want to split into chunks.
## Section 1
Here is the first section with some content.
## Section 2
Here is another section with different content.
`);
// Basic chunking with defaults
const chunks = await doc.chunk();
// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
strategy: "markdown",
headers: [
["#", "title"],
["##", "section"],
],
extract: {
summary: true, // Extract summaries with default settings
keywords: true, // Extract keywords with default settings
},
});
参数Direct link to 参数
🌐 Parameters
以下参数适用于所有分块策略。 重要提示: 每种策略只会使用与其特定使用情境相关的参数子集。
🌐 The following parameters are available for all chunking strategies. Important: Each strategy will only utilize a subset of these parameters relevant to its specific use case.
strategy?:
maxSize?:
overlap?:
lengthFunction?:
separatorPosition?:
addStartIndex?:
stripWhitespace?:
extract?:
有关 extract 参数的详细信息,请参阅 ExtractParams 参考。
🌐 See ExtractParams reference for details on the extract parameter.
特定策略选项Direct link to 特定策略选项
🌐 Strategy-Specific Options
特定策略的选项作为顶层参数传递,与策略参数一起。例如:
🌐 Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:
// Character strategy example
const chunks = await doc.chunk({
strategy: "character",
separator: ".", // Character-specific option
isSeparatorRegex: false, // Character-specific option
maxSize: 300, // general option
});
// Recursive strategy example
const chunks = await doc.chunk({
strategy: "recursive",
separators: ["\n\n", "\n", " "], // Recursive-specific option
language: "markdown", // Recursive-specific option
maxSize: 500, // general option
});
// Sentence strategy example
const chunks = await doc.chunk({
strategy: "sentence",
maxSize: 450, // Required for sentence strategy
minSize: 50, // Sentence-specific option
sentenceEnders: ["."], // Sentence-specific option
fallbackToCharacters: false, // Sentence-specific option
});
// HTML strategy example
const chunks = await doc.chunk({
strategy: "html",
headers: [
["h1", "title"],
["h2", "subtitle"],
], // HTML-specific option
});
// Markdown strategy example
const chunks = await doc.chunk({
strategy: "markdown",
headers: [
["#", "title"],
["##", "section"],
], // Markdown-specific option
stripHeaders: true, // Markdown-specific option
});
// Semantic Markdown strategy example
const chunks = await doc.chunk({
strategy: "semantic-markdown",
joinThreshold: 500, // Semantic Markdown-specific option
modelName: "gpt-3.5-turbo", // Semantic Markdown-specific option
});
// Token strategy example
const chunks = await doc.chunk({
strategy: "token",
encodingName: "gpt2", // Token-specific option
modelName: "gpt-3.5-turbo", // Token-specific option
maxSize: 1000, // general option
});
下面记录的选项是直接传递到配置对象的顶层,而不是嵌套在单独的选项对象中。
🌐 The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.
字符Direct link to 字符
🌐 Character
separators?:
isSeparatorRegex?:
递归Direct link to 递归
🌐 Recursive
separators?:
isSeparatorRegex?:
language?:
句子Direct link to 句子
🌐 Sentence
maxSize:
minSize?:
targetSize?:
sentenceEnders?:
fallbackToWords?:
fallbackToCharacters?:
超文本标记语言Direct link to 超文本标记语言
🌐 HTML
headers:
sections:
returnEachLine?:
重要提示: 使用 HTML 策略时,所有常规选项都会被忽略。使用 headers 进行基于标题的拆分,或使用 sections 进行基于章节的拆分。如果同时使用,sections 将被忽略。
MarkdownDirect link to Markdown
headers?:
stripHeaders?:
returnEachLine?:
重要提示: 使用 headers 选项时,Markdown 策略会忽略所有通用选项,并且内容将根据 Markdown 标题结构进行拆分。若要在 Markdown 中使用基于大小的分块,请省略 headers 参数。
语义化 MarkdownDirect link to 语义化 Markdown
🌐 Semantic Markdown
joinThreshold?:
modelName?:
encodingName?:
allowedSpecial?:
disallowedSpecial?:
令牌Direct link to 令牌
🌐 Token
encodingName?:
modelName?:
allowedSpecial?:
disallowedSpecial?:
JSONDirect link to JSON
maxSize:
minSize?:
ensureAscii?:
convertLists?:
乳胶Direct link to 乳胶
🌐 Latex
LaTeX 策略仅使用上述一般分块选项。它提供针对数学和学术文档优化的 LaTeX 感知分割。
🌐 The Latex strategy uses only the general chunking options listed above. It provides LaTeX-aware splitting optimized for mathematical and academic documents.
返回值Direct link to 返回值
🌐 Return Value
返回一个包含分块文档的 MDocument 实例。每个分块包括:
🌐 Returns a MDocument instance containing the chunked documents. Each chunk includes:
interface DocumentNode {
text: string;
metadata: Record<string, any>;
embedding?: number[];
}