参考：.chunk()

🌐 Reference: .chunk()

.chunk() 函数使用多种策略和选项将文档拆分为更小的片段。

🌐 The .chunk() function splits documents into smaller segments using various strategies and options.

示例
Direct link to 示例

🌐 Example

import { MDocument } from "@mastra/rag";

const doc = MDocument.fromMarkdown(`
# Introduction
This is a sample document that we want to split into chunks.

## Section 1

Here is the first section with some content.

## Section 2 

Here is another section with different content.
`);

// Basic chunking with defaults
const chunks = await doc.chunk();

// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
  strategy: "markdown",
  headers: [
    ["#", "title"],
    ["##", "section"],
  ],
  extract: {
    summary: true, // Extract summaries with default settings
    keywords: true, // Extract keywords with default settings
  },
});

参数
Direct link to 参数

🌐 Parameters

以下参数适用于所有分块策略。 重要提示： 每种策略只会使用与其特定使用情境相关的参数子集。

🌐 The following parameters are available for all chunking strategies. Important: Each strategy will only utilize a subset of these parameters relevant to its specific use case.

strategy?:

The chunking strategy to use. If not specified, defaults based on document type. Depending on the chunking strategy, there are additional optionals. Defaults: .md files → 'markdown', .html/.htm → 'html', .json → 'json', .tex → 'latex', others → 'recursive'

maxSize?:

number

= 4000

Maximum size of each chunk. **Note:** Some strategy configurations (markdown with headers, HTML with headers) ignore this parameter.

overlap?:

number

= 50

Number of characters/tokens that overlap between chunks.

lengthFunction?:

(text: string) => number

Function to calculate text length. Defaults to character count.

separatorPosition?:

'start' | 'end'

Where to position the separator in chunks. 'start' attaches to beginning of next chunk, 'end' attaches to end of current chunk. If not specified, separators are discarded.

addStartIndex?:

boolean

= false

Whether to add start index metadata to chunks.

stripWhitespace?:

boolean

= true

Whether to strip whitespace from chunks.

extract?:

ExtractParams

Metadata extraction configuration.

有关 extract 参数的详细信息，请参阅 ExtractParams 参考。

🌐 See ExtractParams reference for details on the extract parameter.

特定策略选项
Direct link to 特定策略选项

🌐 Strategy-Specific Options

特定策略的选项作为顶层参数传递，与策略参数一起。例如：

🌐 Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:

// Character strategy example
const chunks = await doc.chunk({
  strategy: "character",
  separator: ".", // Character-specific option
  isSeparatorRegex: false, // Character-specific option
  maxSize: 300, // general option
});

// Recursive strategy example
const chunks = await doc.chunk({
  strategy: "recursive",
  separators: ["\n\n", "\n", " "], // Recursive-specific option
  language: "markdown", // Recursive-specific option
  maxSize: 500, // general option
});

// Sentence strategy example
const chunks = await doc.chunk({
  strategy: "sentence",
  maxSize: 450, // Required for sentence strategy
  minSize: 50, // Sentence-specific option
  sentenceEnders: ["."], // Sentence-specific option
  fallbackToCharacters: false, // Sentence-specific option
});

// HTML strategy example
const chunks = await doc.chunk({
  strategy: "html",
  headers: [
    ["h1", "title"],
    ["h2", "subtitle"],
  ], // HTML-specific option
});

// Markdown strategy example
const chunks = await doc.chunk({
  strategy: "markdown",
  headers: [
    ["#", "title"],
    ["##", "section"],
  ], // Markdown-specific option
  stripHeaders: true, // Markdown-specific option
});

// Semantic Markdown strategy example
const chunks = await doc.chunk({
  strategy: "semantic-markdown",
  joinThreshold: 500, // Semantic Markdown-specific option
  modelName: "gpt-3.5-turbo", // Semantic Markdown-specific option
});

// Token strategy example
const chunks = await doc.chunk({
  strategy: "token",
  encodingName: "gpt2", // Token-specific option
  modelName: "gpt-3.5-turbo", // Token-specific option
  maxSize: 1000, // general option
});

下面记录的选项是直接传递到配置对象的顶层，而不是嵌套在单独的选项对象中。

🌐 The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.

字符
Direct link to 字符

🌐 Character

separators?:

string[]

Array of separators to try in order of preference. The strategy will attempt to split on the first separator, then fall back to subsequent ones.

isSeparatorRegex?:

boolean

= false

Whether the separator is a regex pattern

递归
Direct link to 递归

🌐 Recursive

separators?:

string[]

Array of separators to try in order of preference. The strategy will attempt to split on the first separator, then fall back to subsequent ones.

isSeparatorRegex?:

boolean

= false

Whether the separators are regex patterns

language?:

Language

Programming or markup language for language-specific splitting behavior. See Language enum for supported values.

句子
Direct link to 句子

🌐 Sentence

maxSize:

number

Maximum size of each chunk (required for sentence strategy)

minSize?:

number

= 50

Minimum size of each chunk. Chunks smaller than this will be merged with adjacent chunks when possible.

targetSize?:

number

Preferred target size for chunks. Defaults to 80% of maxSize. The strategy will try to create chunks close to this size.

sentenceEnders?:

string[]

= ['.', '!', '?']

Array of characters that mark sentence endings for splitting boundaries.

fallbackToWords?:

boolean

= true

Whether to fall back to word-level splitting for sentences that exceed maxSize.

fallbackToCharacters?:

boolean

= true

Whether to fall back to character-level splitting for words that exceed maxSize. Only applies if fallbackToWords is enabled.

超文本标记语言
Direct link to 超文本标记语言

🌐 HTML

headers:

Array<[string, string]>

Array of [selector, metadata key] pairs for header-based splitting

sections:

Array<[string, string]>

Array of [selector, metadata key] pairs for section-based splitting

returnEachLine?:

boolean

Whether to return each line as a separate chunk

重要提示： 使用 HTML 策略时，所有常规选项都会被忽略。使用 headers 进行基于标题的拆分，或使用 sections 进行基于章节的拆分。如果同时使用，sections 将被忽略。

Markdown
Direct link to Markdown

headers?:

Array<[string, string]>

Array of [header level, metadata key] pairs

stripHeaders?:

boolean

Whether to remove headers from the output

returnEachLine?:

boolean

Whether to return each line as a separate chunk

重要提示： 使用 headers 选项时，Markdown 策略会忽略所有通用选项，并且内容将根据 Markdown 标题结构进行拆分。若要在 Markdown 中使用基于大小的分块，请省略 headers 参数。

语义化 Markdown
Direct link to 语义化 Markdown

🌐 Semantic Markdown

joinThreshold?:

number

= 500

Maximum token count for merging related sections. Sections exceeding this limit individually are left intact, but smaller sections are merged with siblings or parents if the combined size stays under this threshold.

modelName?:

string

Name of the model for tokenization. If provided, the model's underlying tokenization `encodingName` will be used.

encodingName?:

string

= cl100k_base

Name of the token encoding to use. Derived from `modelName` if available.

allowedSpecial?:

Set<string> | 'all'

Set of special tokens allowed during tokenization, or 'all' to allow all special tokens

disallowedSpecial?:

Set<string> | 'all'

= all

Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens

令牌
Direct link to 令牌

🌐 Token

encodingName?:

string

Name of the token encoding to use

modelName?:

string

Name of the model for tokenization

allowedSpecial?:

Set<string> | 'all'

Set of special tokens allowed during tokenization, or 'all' to allow all special tokens

disallowedSpecial?:

Set<string> | 'all'

Set of special tokens to disallow during tokenization, or 'all' to disallow all special tokens

JSON
Direct link to JSON

maxSize:

number

Maximum size of each chunk

minSize?:

number

Minimum size of each chunk

ensureAscii?:

boolean

Whether to ensure ASCII encoding

convertLists?:

boolean

Whether to convert lists in the JSON

乳胶
Direct link to 乳胶

🌐 Latex

LaTeX 策略仅使用上述一般分块选项。它提供针对数学和学术文档优化的 LaTeX 感知分割。

🌐 The Latex strategy uses only the general chunking options listed above. It provides LaTeX-aware splitting optimized for mathematical and academic documents.

返回值
Direct link to 返回值

🌐 Return Value

返回一个包含分块文档的 MDocument 实例。每个分块包括：

🌐 Returns a MDocument instance containing the chunked documents. Each chunk includes:

interface DocumentNode {
  text: string;
  metadata: Record<string, any>;
  embedding?: number[];
}

示例Direct link to 示例

参数Direct link to 参数

strategy?:

maxSize?:

overlap?:

lengthFunction?:

separatorPosition?:

addStartIndex?:

stripWhitespace?:

extract?:

特定策略选项Direct link to 特定策略选项

字符Direct link to 字符

separators?:

isSeparatorRegex?:

递归Direct link to 递归

separators?:

isSeparatorRegex?:

language?:

句子Direct link to 句子

maxSize:

minSize?:

targetSize?:

sentenceEnders?:

fallbackToWords?:

fallbackToCharacters?:

超文本标记语言Direct link to 超文本标记语言

headers:

sections:

returnEachLine?:

MarkdownDirect link to Markdown

headers?:

stripHeaders?:

returnEachLine?:

语义化 MarkdownDirect link to 语义化 Markdown

joinThreshold?:

modelName?:

encodingName?:

allowedSpecial?:

disallowedSpecial?:

令牌Direct link to 令牌

encodingName?:

modelName?:

allowedSpecial?:

disallowedSpecial?:

JSONDirect link to JSON

maxSize:

minSize?:

ensureAscii?:

convertLists?:

乳胶Direct link to 乳胶

返回值Direct link to 返回值

示例
Direct link to 示例

参数
Direct link to 参数

特定策略选项
Direct link to 特定策略选项

字符
Direct link to 字符

递归
Direct link to 递归

句子
Direct link to 句子

超文本标记语言
Direct link to 超文本标记语言

Markdown
Direct link to Markdown

语义化 Markdown
Direct link to 语义化 Markdown

令牌
Direct link to 令牌

JSON
Direct link to JSON

乳胶
Direct link to 乳胶

返回值
Direct link to 返回值