Skip to main content

将嵌入存储在向量数据库中

🌐 Storing Embeddings in A Vector Database

在生成嵌入后,你需要将它们存储在支持向量相似性搜索的数据库中。Mastra 提供了一个一致的接口,用于在各种向量数据库中存储和查询嵌入。

🌐 After generating embeddings, you need to store them in a database that supports vector similarity search. Mastra provides a consistent interface for storing and querying embeddings across various vector databases.

支持的数据库
Direct link to 支持的数据库

🌐 Supported Databases

vector-store.ts
import { MongoDBVector } from "@mastra/mongodb";

const store = new MongoDBVector({
id: 'mongodb-vector',
uri: process.env.MONGODB_URI,
dbName: process.env.MONGODB_DATABASE,
});
await store.createIndex({
indexName: "myCollection",
dimension: 1536,
});
await store.upsert({
indexName: "myCollection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});

使用 MongoDB Atlas 向量搜索

有关详细的设置说明和最佳实践,请参阅 官方 MongoDB Atlas 向量搜索文档

🌐 For detailed setup instructions and best practices, see the official MongoDB Atlas Vector Search documentation.

使用向量存储
Direct link to 使用向量存储

🌐 Using Vector Storage

一旦初始化,所有向量存储都共享相同的接口,用于创建索引、更新嵌入和查询。

🌐 Once initialized, all vector stores share the same interface for creating indexes, upserting embeddings, and querying.

创建索引
Direct link to 创建索引

🌐 Creating Indexes

在存储嵌入之前,你需要为你的嵌入模型创建一个具有适当维度大小的索引:

🌐 Before storing embeddings, you need to create an index with the appropriate dimension size for your embedding model:

store-embeddings.ts
// Create an index with dimension 1536 (for text-embedding-3-small)
await store.createIndex({
indexName: "myCollection",
dimension: 1536,
});

维度大小必须与你选择的嵌入模型的输出维度相匹配。常见的维度大小有:

🌐 The dimension size must match the output dimension of your chosen embedding model. Common dimension sizes are:

  • OpenAI text-embedding-3-small:1536 维(或自定义,例如 256)
  • Cohere embed-multilingual-v3:1024 维
  • Google gemini-embedding-001:768 维(或自定义)
warning

索引维度在创建后无法更改。若要使用不同的模型,请删除索引并使用新的维度大小重新创建。

数据库命名规则
Direct link to 数据库命名规则

🌐 Naming Rules for Databases

每个向量数据库都对索引和集合实现特定的命名规范,以确保兼容性并防止冲突。

🌐 Each vector database enforces specific naming conventions for indexes and collections to ensure compatibility and prevent conflicts.

集合(索引)名称必须:

  • 以字母或下划线开头
  • 最多可达120字节
  • 只能包含字母、数字、下划线或点
  • 不能包含 $ 或空字符
  • 示例:my_collection.123 是有效的
  • 示例:my-index 无效(包含连字符)
  • 示例:My$Collection 无效(包含 $

更新或插入嵌入
Direct link to 更新或插入嵌入

🌐 Upserting Embeddings

创建索引后,你可以将嵌入及其基本元数据一起存储:

🌐 After creating an index, you can store embeddings along with their basic metadata:

store-embeddings.ts
// Store embeddings with their corresponding metadata
await store.upsert({
indexName: "myCollection", // index name
vectors: embeddings, // array of embedding vectors
metadata: chunks.map((chunk) => ({
text: chunk.text, // The original text content
id: chunk.id, // Optional unique identifier
})),
});

更新或插入操作:

🌐 The upsert operation:

  • 接受一组嵌入向量及其对应的元数据
  • 如果向量具有相同的 ID,则更新现有向量
  • 如果向量不存在则创建新的向量
  • 自动处理大数据集的批处理

添加元数据
Direct link to 添加元数据

🌐 Adding Metadata

向量存储支持丰富的元数据(任何可 JSON 序列化的字段)用于过滤和组织。由于元数据没有固定的模式存储,请使用一致的字段命名以避免意外的查询结果。

🌐 Vector stores support rich metadata (any JSON-serializable fields) for filtering and organization. Since metadata is stored with no fixed schema, use consistent field naming to avoid unexpected query results.

warning

元数据对于向量存储至关重要——没有它,你将只有数值嵌入,而无法返回原始文本或筛选结果。始终至少将源文本作为元数据存储。

// Store embeddings with rich metadata for better organization and filtering
await store.upsert({
indexName: "myCollection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({
// Basic content
text: chunk.text,
id: chunk.id,

// Document organization
source: chunk.source,
category: chunk.category,

// Temporal metadata
createdAt: new Date().toISOString(),
version: "1.0",

// Custom fields
language: chunk.language,
author: chunk.author,
confidenceScore: chunk.score,
})),
});

关键元数据考虑因素:

🌐 Key metadata considerations:

  • 对字段命名要严格——像 'category' 和 'Category' 这样的不一致会影响查询
  • 只包含你打算筛选或排序的字段——额外的字段会增加负担
  • 添加时间戳(例如 'createdAt'、'lastUpdated')以跟踪内容的新鲜度

删除向量
Direct link to 删除向量

🌐 Deleting Vectors

在构建 RAG 应用时,当文档被删除或更新时,你经常需要清理过时的向量。Mastra 提供了 deleteVectors 方法,该方法支持通过元数据过滤器删除向量,使得删除与特定文档相关的所有嵌入变得很容易。

🌐 When building RAG applications, you often need to clean up stale vectors when documents are deleted or updated. Mastra provides the deleteVectors method that supports deleting vectors by metadata filters, making it easy to remove all embeddings associated with a specific document.

按元数据筛选删除
Direct link to 按元数据筛选删除

🌐 Delete by Metadata Filter

最常见的用例是在用户删除文档时,删除该文档的所有向量:

🌐 The most common use case is deleting all vectors for a specific document when a user deletes it:

delete-vectors.ts
// Delete all vectors for a specific document
await store.deleteVectors({
indexName: "myCollection",
filter: { docId: "document-123" },
});

这在以下情况下特别有用:

🌐 This is particularly useful when:

  • 用户删除了一个文档,你需要移除它的所有片段
  • 你正在重新索引文档,并希望先删除旧向量
  • 你需要为特定用户或租户清理向量

删除多个文档
Direct link to 删除多个文档

🌐 Delete Multiple Documents

你也可以使用复杂的筛选器来删除符合多个条件的向量:

🌐 You can also use complex filters to delete vectors matching multiple conditions:

delete-vectors-advanced.ts
// Delete all vectors for multiple documents
await store.deleteVectors({
indexName: "myCollection",
filter: {
docId: { $in: ["doc-1", "doc-2", "doc-3"] },
},
});

// Delete vectors for a specific user's documents
await store.deleteVectors({
indexName: "myCollection",
filter: {
$and: [
{ userId: "user-123" },
{ status: "archived" },
],
},
});

按向量ID删除
Direct link to 按向量ID删除

🌐 Delete by Vector IDs

如果你有特定的向量 ID 需要删除,可以直接传入它们:

🌐 If you have specific vector IDs to delete, you can pass them directly:

delete-by-ids.ts
// Delete specific vectors by their IDs
await store.deleteVectors({
indexName: "myCollection",
ids: ["vec-1", "vec-2", "vec-3"],
});

最佳实践
Direct link to 最佳实践

🌐 Best Practices

  • 在批量插入之前创建索引
  • 对于大批量插入使用批处理操作(upsert 方法会自动处理批处理)
  • 只存储你会查询的元数据
  • 将嵌入维度与你的模型匹配(例如,text-embedding-3-small 为 1536)