Skip to main content

文档分块与嵌入

🌐 Chunking and Embedding Documents

在处理之前,请从你的内容创建一个 MDocument 实例。你可以从多种格式初始化它:

🌐 Before processing, create a MDocument instance from your content. You can initialize it from various formats:

const docFromText = MDocument.fromText("Your plain text content...");
const docFromHTML = MDocument.fromHTML("<html>Your HTML content...</html>");
const docFromMarkdown = MDocument.fromMarkdown("# Your Markdown content...");
const docFromJSON = MDocument.fromJSON(`{ "key": "value" }`);

第1步:文档处理
Direct link to 第1步:文档处理

🌐 Step 1: Document Processing

使用 chunk 将文档拆分为可管理的部分。Mastra 支持针对不同文档类型优化的多种分块策略:

🌐 Use chunk to split documents into manageable pieces. Mastra supports multiple chunking strategies optimized for different document types:

  • recursive:基于内容结构的智能分割
  • character:基于简单字符的拆分
  • token:基于令牌的拆分
  • markdown:支持Markdown的拆分
  • semantic-markdown:基于相关标题族的 Markdown 分割
  • html:HTML 结构感知拆分
  • json:JSON 结构感知拆分
  • latex:LaTeX 结构感知拆分
  • sentence:句子感知拆分
note

每种策略接受针对其分块方法优化的不同参数。

下面是如何使用 recursive 策略的示例:

🌐 Here's an example of how to use the recursive strategy:

const chunks = await doc.chunk({
strategy: "recursive",
maxSize: 512,
overlap: 50,
separators: ["\n"],
extract: {
metadata: true, // Optionally extract metadata
},
});

对于需要保留句子结构的文本,下面是如何使用 sentence 策略的示例:

🌐 For text where preserving sentence structure is important, here's an example of how to use the sentence strategy:

const chunks = await doc.chunk({
strategy: "sentence",
maxSize: 450,
minSize: 50,
overlap: 0,
sentenceEnders: ["."],
});

对于需要保留各部分语义关系的 Markdown 文档,下面是如何使用 semantic-markdown 策略的示例:

🌐 For markdown documents where preserving the semantic relationships between sections is important, here's an example of how to use the semantic-markdown strategy:

const chunks = await doc.chunk({
strategy: "semantic-markdown",
joinThreshold: 500,
modelName: "gpt-3.5-turbo",
});
note

元数据提取可能会使用大型语言模型调用,因此请确保已设置你的 API 密钥。

我们在我们的 chunk() 参考文档 中深入探讨了分块策略。

🌐 We go deeper into chunking strategies in our chunk() reference documentation.

步骤 2:嵌入生成
Direct link to 步骤 2:嵌入生成

🌐 Step 2: Embedding Generation

使用你首选的提供商将块转换为嵌入。Mastra 通过模型路由支持嵌入模型。

🌐 Transform chunks into embeddings using your preferred provider. Mastra supports embedding models through the model router.

使用模型路由
Direct link to 使用模型路由

🌐 Using the Model Router

最简单的方法是使用 Mastra 的模型路由与 provider/model 字符串:

🌐 The simplest way is to use Mastra's model router with provider/model strings:

import { ModelRouterEmbeddingModel } from "@mastra/core/llm";
import { embedMany } from "ai";

const { embeddings } = await embedMany({
model: new ModelRouterEmbeddingModel("openai/text-embedding-3-small"),
values: chunks.map((chunk) => chunk.text),
});

Mastra 支持 OpenAI 和 Google 的嵌入模型。有关支持的嵌入模型完整列表,请参见 嵌入参考

🌐 Mastra supports OpenAI and Google embedding models. For a complete list of supported embedding models, see the embeddings reference.

模型路由会自动处理来自环境变量的 API 密钥检测。

🌐 The model router automatically handles API key detection from environment variables.

嵌入函数返回向量,即表示文本语义含义的数字数组,可用于在向量数据库中进行相似性搜索。

🌐 The embedding functions return vectors, arrays of numbers representing the semantic meaning of your text, ready for similarity searches in your vector database.

配置嵌入维度
Direct link to 配置嵌入维度

🌐 Configuring Embedding Dimensions

嵌入模型通常会输出具有固定维度的向量(例如,OpenAI 的 text-embedding-3-small 为 1536 维)。 一些模型支持降低这种维度,这可能会有帮助:

🌐 Embedding models typically output vectors with a fixed number of dimensions (e.g., 1536 for OpenAI's text-embedding-3-small). Some models support reducing this dimensionality, which can help:

  • 降低向量数据库中的存储需求
  • 降低相似性搜索的计算成本

以下是一些支持的型号:

🌐 Here are some supported models:

OpenAI(text-embedding-3 模型):

🌐 OpenAI (text-embedding-3 models):

import { ModelRouterEmbeddingModel } from "@mastra/core/llm";

const { embeddings } = await embedMany({
model: new ModelRouterEmbeddingModel("openai/text-embedding-3-small"),
options: {
dimensions: 256, // Only supported in text-embedding-3 and later
},
values: chunks.map((chunk) => chunk.text),
});

Google(文本嵌入-001):

🌐 Google (text-embedding-001):

const { embeddings } = await embedMany({
model: "google/gemini-embedding-001", {
outputDimensionality: 256, // Truncates excessive values from the end
}),
values: chunks.map((chunk) => chunk.text),
});
Vector Database Compatibility

When storing embeddings, the vector database index must be configured to match the output size of your embedding model. If the dimensions do not match, you may get errors or data corruption.

示例:完整流程
Direct link to 示例:完整流程

🌐 Example: Complete Pipeline

下面是一个示例,展示了使用两家供应商进行文档处理和嵌入生成:

🌐 Here's an example showing document processing and embedding generation with both providers:

import { embedMany } from "ai";

import { MDocument } from "@mastra/rag";

// Initialize document
const doc = MDocument.fromText(`
Climate change poses significant challenges to global agriculture.
Rising temperatures and changing precipitation patterns affect crop yields.
`);

// Create chunks
const chunks = await doc.chunk({
strategy: "recursive",
maxSize: 256,
overlap: 50,
});

// Generate embeddings with OpenAI
import { ModelRouterEmbeddingModel } from "@mastra/core/llm";

const { embeddings } = await embedMany({
model: new ModelRouterEmbeddingModel("openai/text-embedding-3-small"),
values: chunks.map((chunk) => chunk.text),
});

// OR

// Generate embeddings with Cohere
const { embeddings } = await embedMany({
model: "cohere/embed-english-v3.0",
values: chunks.map((chunk) => chunk.text),
});

// Store embeddings in your vector database
await vectorStore.upsert({
indexName: "embeddings",
vectors: embeddings,
});

有关不同分块策略和嵌入配置的更多示例,请参见:

🌐 For more examples of different chunking strategies and embedding configurations, see:

有关向量数据库和嵌入的更多详细信息,请参见:

🌐 For more details on vector databases and embeddings, see: