Skip to main content

参考:.chunk()

🌐 Reference: .chunk()

.chunk() 函数使用多种策略和选项将文档拆分为更小的片段。

🌐 The .chunk() function splits documents into smaller segments using various strategies and options.

示例
Direct link to 示例

🌐 Example

import { MDocument } from "@mastra/rag";

const doc = MDocument.fromMarkdown(`
# Introduction
This is a sample document that we want to split into chunks.

## Section 1

Here is the first section with some content.

## Section 2

Here is another section with different content.
`);

// Basic chunking with defaults
const chunks = await doc.chunk();

// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
strategy: "markdown",
headers: [
["#", "title"],
["##", "section"],
],
extract: {
summary: true, // Extract summaries with default settings
keywords: true, // Extract keywords with default settings
},
});

参数
Direct link to 参数

🌐 Parameters

以下参数适用于所有分块策略。 重要提示: 每种策略只会使用与其特定使用情境相关的参数子集。

🌐 The following parameters are available for all chunking strategies. Important: Each strategy will only utilize a subset of these parameters relevant to its specific use case.

strategy?:

'recursive' | 'character' | 'token' | 'markdown' | 'semantic-markdown' | 'html' | 'json' | 'latex' | 'sentence'
The chunking strategy to use. If not specified, defaults based on document type. Depending on the chunking strategy, there are additional optionals. Defaults: .md files → 'markdown', .html/.htm → 'html', .json → 'json', .tex → 'latex', others → 'recursive'

maxSize?:

number
= 4000
Maximum size of each chunk. **Note:** Some strategy configurations (markdown with headers, HTML with headers) ignore this parameter.

overlap?:

number
= 50
Number of characters/tokens that overlap between chunks.

lengthFunction?:

(text: string) => number
Function to calculate text length. Defaults to character count.

separatorPosition?:

'start' | 'end'
Where to position the separator in chunks. 'start' attaches to beginning of next chunk, 'end' attaches to end of current chunk. If not specified, separators are discarded.

addStartIndex?:

boolean
= false
Whether to add start index metadata to chunks.

stripWhitespace?:

boolean
= true
Whether to strip whitespace from chunks.

extract?:

ExtractParams
Metadata extraction configuration.

有关 extract 参数的详细信息,请参阅 ExtractParams 参考

🌐 See ExtractParams reference for details on the extract parameter.

特定策略选项
Direct link to 特定策略选项

🌐 Strategy-Specific Options

特定策略的选项作为顶层参数传递,与策略参数一起。例如:

🌐 Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:

// Character strategy example
const chunks = await doc.chunk({
strategy: "character",
separator: ".", // Character-specific option
isSeparatorRegex: false, // Character-specific option
maxSize: 300, // general option
});

// Recursive strategy example
const chunks = await doc.chunk({
strategy: "recursive",
separators: ["\n\n", "\n", " "], // Recursive-specific option
language: "markdown", // Recursive-specific option
maxSize: 500, // general option
});

// Sentence strategy example
const chunks = await doc.chunk({
strategy: "sentence",
maxSize: 450, // Required for sentence strategy
minSize: 50, // Sentence-specific option
sentenceEnders: ["."], // Sentence-specific option
fallbackToCharacters: false, // Sentence-specific option
});

// HTML strategy example
const chunks = await doc.chunk({
strategy: "html",
headers: [
["h1", "title"],
["h2", "subtitle"],
], // HTML-specific option
});

// Markdown strategy example
const chunks = await doc.chunk({
strategy: "markdown",
headers: [
["#", "title"],
["##", "section"],
], // Markdown-specific option
stripHeaders: true, // Markdown-specific option
});

// Semantic Markdown strategy example
const chunks = await doc.chunk({
strategy: "semantic-markdown",
joinThreshold: 500, // Semantic Markdown-specific option
modelName: "gpt-3.5-turbo", // Semantic Markdown-specific option
});

// Token strategy example
const chunks = await doc.chunk({
strategy: "token",
encodingName: "gpt2", // Token-specific option
modelName: "gpt-3.5-turbo", // Token-specific option
maxSize: 1000, // general option
});

下面记录的选项是直接传递到配置对象的顶层,而不是嵌套在单独的选项对象中。

🌐 The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.

字符
Direct link to 字符

🌐 Character

separators?:

string[]
Array of separators to try in order of preference. The strategy will attempt to split on the first separator, then fall back to subsequent ones.

isSeparatorRegex?:

boolean
= false
Whether the separator is a regex pattern

递归
Direct link to 递归

🌐 Recursive

separators?:

string[]
Array of separators to try in order of preference. The strategy will attempt to split on the first separator, then fall back to subsequent ones.

isSeparatorRegex?:

boolean
= false
Whether the separators are regex patterns

language?:

Language
Programming or markup language for language-specific splitting behavior. See Language enum for supported values.

句子
Direct link to 句子

🌐 Sentence

maxSize:

number
Maximum size of each chunk (required for sentence strategy)

minSize?:

number
= 50
Minimum size of each chunk. Chunks smaller than this will be merged with adjacent chunks when possible.

targetSize?:

number
Preferred target size for chunks. Defaults to 80% of maxSize. The strategy will try to create chunks close to this size.

sentenceEnders?:

string[]
= ['.', '!', '?']
Array of characters that mark sentence endings for splitting boundaries.

fallbackToWords?:

boolean
= true
Whether to fall back to word-level splitting for sentences that exceed maxSize.

fallbackToCharacters?:

boolean
= true
Whether to fall back to character-level splitting for words that exceed maxSize. Only applies if fallbackToWords is enabled.

超文本标记语言
Direct link to 超文本标记语言

🌐 HTML

headers:

Array<[string, string]>
Array of [selector, metadata key] pairs for header-based splitting

sections:

Array<[string, string]>
Array of [selector, metadata key] pairs for section-based splitting

returnEachLine?:

boolean
Whether to return each line as a separate chunk

重要提示: 使用 HTML 策略时,所有常规选项都会被忽略。使用 headers 进行基于标题的拆分,或使用 sections 进行基于章节的拆分。如果同时使用,sections 将被忽略。

Markdown
Direct link to Markdown

headers?:

Array<[string, string]>
Array of [header level, metadata key] pairs

stripHeaders?:

boolean
Whether to remove headers from the output

returnEachLine?:

boolean
Whether to return each line as a separate chunk

重要提示: 使用 headers 选项时,Markdown 策略会忽略所有通用选项,并且内容将根据 Markdown 标题结构进行拆分。若要在 Markdown 中使用基于大小的分块,请省略 headers 参数。

语义化 Markdown
Direct link to 语义化 Markdown

🌐 Semantic Markdown

joinThreshold?:

number
= 500
Maximum token count for merging related sections. Sections exceeding this limit individually are left intact, but smaller sections are merged with siblings or parents if the combined size stays under this threshold.

modelName?:

string
Name of the model for tokenization. If provided, the model's underlying tokenization `encodingName` will be used.

encodingName?:

string
= cl100k_base
Name of the token encoding to use. Derived from `modelName` if available.

allowedSpecial?:

Set<string> | 'all'
Set of special tokens allowed during tokenization, or 'all' to allow all special tokens

disallowedSpecial?:

Set<string> | 'all'
= all
Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens

令牌
Direct link to 令牌

🌐 Token

encodingName?:

string
Name of the token encoding to use

modelName?:

string
Name of the model for tokenization

allowedSpecial?:

Set<string> | 'all'
Set of special tokens allowed during tokenization, or 'all' to allow all special tokens

disallowedSpecial?:

Set<string> | 'all'
Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens

JSON
Direct link to JSON

maxSize:

number
Maximum size of each chunk

minSize?:

number
Minimum size of each chunk

ensureAscii?:

boolean
Whether to ensure ASCII encoding

convertLists?:

boolean
Whether to convert lists in the JSON

乳胶
Direct link to 乳胶

🌐 Latex

LaTeX 策略仅使用上述一般分块选项。它提供针对数学和学术文档优化的 LaTeX 感知分割。

🌐 The Latex strategy uses only the general chunking options listed above. It provides LaTeX-aware splitting optimized for mathematical and academic documents.

返回值
Direct link to 返回值

🌐 Return Value

返回一个包含分块文档的 MDocument 实例。每个分块包括:

🌐 Returns a MDocument instance containing the chunked documents. Each chunk includes:

interface DocumentNode {
text: string;
metadata: Record<string, any>;
embedding?: number[];
}