将嵌入存储在向量数据库中
🌐 Storing Embeddings in A Vector Database
在生成嵌入后,你需要将它们存储在支持向量相似性搜索的数据库中。Mastra 提供了一个一致的接口,用于在各种向量数据库中存储和查询嵌入。
🌐 After generating embeddings, you need to store them in a database that supports vector similarity search. Mastra provides a consistent interface for storing and querying embeddings across various vector databases.
支持的数据库Direct link to 支持的数据库
🌐 Supported Databases
- MongoDB
- PgVector
- Pinecone
- Qdrant
- Chroma
- Astra
- libSQL
- Upstash
- Cloudflare
- OpenSearch
- ElasticSearch
- Couchbase
- Lance
- S3 Vectors
import { MongoDBVector } from "@mastra/mongodb";
const store = new MongoDBVector({
id: 'mongodb-vector',
uri: process.env.MONGODB_URI,
dbName: process.env.MONGODB_DATABASE,
});
await store.createIndex({
indexName: "myCollection",
dimension: 1536,
});
await store.upsert({
indexName: "myCollection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
使用 MongoDB Atlas 向量搜索
有关详细的设置说明和最佳实践,请参阅 官方 MongoDB Atlas 向量搜索文档。
🌐 For detailed setup instructions and best practices, see the official MongoDB Atlas Vector Search documentation.
import { PgVector } from "@mastra/pg";
const store = new PgVector({
id: 'pg-vector',
connectionString: process.env.POSTGRES_CONNECTION_STRING,
});
await store.createIndex({
indexName: "myCollection",
dimension: 1536,
});
await store.upsert({
indexName: "myCollection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
在 PostgreSQL 中使用 pgvector
对于已经使用 PostgreSQL 的团队来说,带有 pgvector 扩展的 PostgreSQL 是一个很好的解决方案,它可以帮助减少基础设施的复杂性。有关详细的安装说明和最佳实践,请参阅 官方 pgvector 仓库。
🌐 PostgreSQL with the pgvector extension is a good solution for teams already using PostgreSQL who want to minimize infrastructure complexity. For detailed setup instructions and best practices, see the official pgvector repository.
import { PineconeVector } from "@mastra/pinecone";
const store = new PineconeVector({
id: 'pinecone-vector',
apiKey: process.env.PINECONE_API_KEY,
});
await store.createIndex({
indexName: "myCollection",
dimension: 1536,
});
await store.upsert({
indexName: "myCollection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
import { QdrantVector } from "@mastra/qdrant";
const store = new QdrantVector({
id: 'qdrant-vector',
url: process.env.QDRANT_URL,
apiKey: process.env.QDRANT_API_KEY,
});
await store.createIndex({
indexName: "myCollection",
dimension: 1536,
});
await store.upsert({
indexName: "myCollection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
import { ChromaVector } from "@mastra/chroma";
// Running Chroma locally
// const store = new ChromaVector()
// Running on Chroma Cloud
const store = new ChromaVector({
id: 'chroma-vector',
apiKey: process.env.CHROMA_API_KEY,
tenant: process.env.CHROMA_TENANT,
database: process.env.CHROMA_DATABASE,
});
await store.createIndex({
indexName: "myCollection",
dimension: 1536,
});
await store.upsert({
indexName: "myCollection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
import { AstraVector } from "@mastra/astra";
const store = new AstraVector({
id: 'astra-vector',
token: process.env.ASTRA_DB_TOKEN,
endpoint: process.env.ASTRA_DB_ENDPOINT,
keyspace: process.env.ASTRA_DB_KEYSPACE,
});
await store.createIndex({
indexName: "myCollection",
dimension: 1536,
});
await store.upsert({
indexName: "myCollection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
import { LibSQLVector } from "@mastra/core/vector/libsql";
const store = new LibSQLVector({
id: 'libsql-vector',
url: process.env.DATABASE_URL,
authToken: process.env.DATABASE_AUTH_TOKEN, // Optional: for Turso cloud databases
});
await store.createIndex({
indexName: "myCollection",
dimension: 1536,
});
await store.upsert({
indexName: "myCollection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
import { UpstashVector } from "@mastra/upstash";
// In upstash they refer to the store as an index
const store = new UpstashVector({
id: 'upstash-vector',
url: process.env.UPSTASH_URL,
token: process.env.UPSTASH_TOKEN,
});
// There is no store.createIndex call here, Upstash creates indexes (known as namespaces in Upstash) automatically
// when you upsert if that namespace does not exist yet.
await store.upsert({
indexName: "myCollection", // the namespace name in Upstash
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
import { CloudflareVector } from "@mastra/vectorize";
const store = new CloudflareVector({
id: 'cloudflare-vector',
accountId: process.env.CF_ACCOUNT_ID,
apiToken: process.env.CF_API_TOKEN,
});
await store.createIndex({
indexName: "myCollection",
dimension: 1536,
});
await store.upsert({
indexName: "myCollection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
import { OpenSearchVector } from "@mastra/opensearch";
const store = new OpenSearchVector({ id: "opensearch", node: process.env.OPENSEARCH_URL });
await store.createIndex({
indexName: "my-collection",
dimension: 1536,
});
await store.upsert({
indexName: "my-collection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
import { ElasticSearchVector } from "@mastra/elasticsearch";
const store = new ElasticSearchVector({
id: 'elasticsearch-vector',
url: process.env.ELASTICSEARCH_URL,
auth: {
apiKey : process.env.ELASTICSEARCH_API_KEY
}
});
await store.createIndex({
indexName: "my-collection",
dimension: 1536,
});
await store.upsert({
indexName: "my-collection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
Using Elasticsearch
有关详细的设置说明和最佳实践,请参阅 官方 Elasticsearch 文档。
🌐 For detailed setup instructions and best practices, see the official Elasticsearch documentation.
import { CouchbaseVector } from "@mastra/couchbase";
const store = new CouchbaseVector({
id: 'couchbase-vector',
connectionString: process.env.COUCHBASE_CONNECTION_STRING,
username: process.env.COUCHBASE_USERNAME,
password: process.env.COUCHBASE_PASSWORD,
bucketName: process.env.COUCHBASE_BUCKET,
scopeName: process.env.COUCHBASE_SCOPE,
collectionName: process.env.COUCHBASE_COLLECTION,
});
await store.createIndex({
indexName: "myCollection",
dimension: 1536,
});
await store.upsert({
indexName: "myCollection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
import { LanceVectorStore } from "@mastra/lance";
const store = await LanceVectorStore.create("/path/to/db");
await store.createIndex({
tableName: "myVectors",
indexName: "myCollection",
dimension: 1536,
});
await store.upsert({
tableName: "myVectors",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
使用 LanceDB
LanceDB 是一个基于 Lance 列存格式构建的嵌入式向量数据库,适用于本地开发或云部署。有关详细的安装说明和最佳实践,请参阅 官方 LanceDB 文档。
🌐 LanceDB is an embedded vector database built on the Lance columnar format, suitable for local development or cloud deployment. For detailed setup instructions and best practices, see the official LanceDB documentation.
import { S3Vectors } from "@mastra/s3vectors";
const store = new S3Vectors({
id: 's3-vectors',
vectorBucketName: "my-vector-bucket",
clientConfig: {
region: "us-east-1",
},
nonFilterableMetadataKeys: ["content"],
});
await store.createIndex({
indexName: "my-index",
dimension: 1536,
});
await store.upsert({
indexName: "my-index",
vectors: embeddings,
metadata: chunks.map((chunk) => ({ text: chunk.text })),
});
使用向量存储Direct link to 使用向量存储
🌐 Using Vector Storage
一旦初始化,所有向量存储都共享相同的接口,用于创建索引、更新嵌入和查询。
🌐 Once initialized, all vector stores share the same interface for creating indexes, upserting embeddings, and querying.
创建索引Direct link to 创建索引
🌐 Creating Indexes
在存储嵌入之前,你需要为你的嵌入模型创建一个具有适当维度大小的索引:
🌐 Before storing embeddings, you need to create an index with the appropriate dimension size for your embedding model:
// Create an index with dimension 1536 (for text-embedding-3-small)
await store.createIndex({
indexName: "myCollection",
dimension: 1536,
});
维度大小必须与你选择的嵌入模型的输出维度相匹配。常见的维度大小有:
🌐 The dimension size must match the output dimension of your chosen embedding model. Common dimension sizes are:
- OpenAI text-embedding-3-small:1536 维(或自定义,例如 256)
- Cohere embed-multilingual-v3:1024 维
- Google gemini-embedding-001:768 维(或自定义)
索引维度在创建后无法更改。若要使用不同的模型,请删除索引并使用新的维度大小重新创建。
数据库命名规则Direct link to 数据库命名规则
🌐 Naming Rules for Databases
每个向量数据库都对索引和集合实现特定的命名规范,以确保兼容性并防止冲突。
🌐 Each vector database enforces specific naming conventions for indexes and collections to ensure compatibility and prevent conflicts.
- MongoDB
- PgVector
- Pinecone
- Qdrant
- Chroma
- Astra
- libSQL
- Upstash
- Cloudflare
- OpenSearch
- ElasticSearch
- S3 Vectors
集合(索引)名称必须:
- 以字母或下划线开头
- 最多可达120字节
- 只能包含字母、数字、下划线或点
- 不能包含
$或空字符 - 示例:
my_collection.123是有效的 - 示例:
my-index无效(包含连字符) - 示例:
My$Collection无效(包含$)
索引名称必须:
- 以字母或下划线开头
- 仅包含字母、数字和下划线
- 示例:
my_index_123是有效的 - 示例:
my-index无效(包含连字符)
索引名称必须:
- 只使用小写字母、数字和连接符
- 不得包含点(用于DNS路由)
- 不要使用非拉丁字符或表情符号
- 总长度(包括项目 ID)不超过 52 个字符
- 示例:
my-index-123是有效的 - 示例:
my.index无效(包含点)
- 示例:
集合名称必须:
- 长度为1到255个字符
- 不包含任何这些特殊字符:
< > : " / \ | ? *- 空字符 (
\0) - 单元分隔符 (
\u{1F})
- 示例:
my_collection_123是有效的 - 示例:
my/collection无效(包含斜杠)
集合名称必须:
- 长度为3-63个字符
- 以字母或数字开头和结尾
- 仅包含字母、数字、下划线或连字符
- 不能包含连续的句点(..)
- 不是有效的 IPv4 地址
- 示例:
my-collection-123是有效的 - 示例:
my..collection无效(连续的句点)
集合名称必须:
- 不能为空
- 不超过48个字符
- 仅包含字母、数字和下划线
- 示例:
my_collection_123是有效的 - 示例:
my-collection无效(包含连字符)
索引名称必须:
- 以字母或下划线开头
- 仅包含字母、数字和下划线
- 示例:
my_index_123是有效的 - 示例:
my-index无效(包含连字符)
命名空间名称必须:
- 长度为2到100个字符
- 仅包含:
- 字母数字字符(a-z,A-Z,0-9)
- 下划线、连字符、点
- 不能以特殊字符 (_, -, .) 开头或结尾
- 可以区分大小写
- 示例:
MyNamespace123是有效的 - 示例:
_namespace无效(以下划线开头)
索引名称必须:
- 从一封信开始
- 长度少于32个字符
- 仅包含小写ASCII字母、数字和连字符
- 使用破折号代替空格
- 示例:
my-index-123是有效的 - 示例:
My_Index无效(大写和下划线)
索引名称必须:
- 只使用小写字母
- 不能以下划线或连字符开头
- 不包含空格、逗号
- 不包含特殊字符(例如
:、"、*、+、/、\、|、?、#、>、<) - 示例:
my-index-123是有效的 - 示例:
My_Index无效(包含大写字母) - 示例:
_myindex无效(以下划线开头)
索引名称必须:
- 只使用小写字母
- 不超过 255 字节(包括多字节字符)
- 不能以下划线、连字符或加号开头
- 不包含空格、逗号
- 不包含特殊字符(例如
:、"、*、+、/、\、|、?、#、>、<) - 不能是“.”或“..”
- 不能以“.”开头(已弃用,系统/隐藏索引除外)
- 示例:
my-index-123是有效的 - 示例:
My_Index无效(包含大写字母) - 示例:
_myindex无效(以下划线开头) - 示例:
.myindex无效(以点开头,不推荐使用)
索引名称必须:
- 在同一个向量桶中保持独特
- 长度为3到63个字符
- 仅使用小写字母(
a–z)、数字(0–9)、连字符(-)和点(.) - 以字母或数字开始和结束
- 示例:
my-index.123是有效的 - 示例:
my_index无效(包含下划线) - 示例:
-myindex无效(以连字符开头) - 示例:
myindex-无效(以连字符结尾) - 示例:
MyIndex无效(包含大写字母)
更新或插入嵌入Direct link to 更新或插入嵌入
🌐 Upserting Embeddings
创建索引后,你可以将嵌入及其基本元数据一起存储:
🌐 After creating an index, you can store embeddings along with their basic metadata:
// Store embeddings with their corresponding metadata
await store.upsert({
indexName: "myCollection", // index name
vectors: embeddings, // array of embedding vectors
metadata: chunks.map((chunk) => ({
text: chunk.text, // The original text content
id: chunk.id, // Optional unique identifier
})),
});
更新或插入操作:
🌐 The upsert operation:
- 接受一组嵌入向量及其对应的元数据
- 如果向量具有相同的 ID,则更新现有向量
- 如果向量不存在则创建新的向量
- 自动处理大数据集的批处理
添加元数据Direct link to 添加元数据
🌐 Adding Metadata
向量存储支持丰富的元数据(任何可 JSON 序列化的字段)用于过滤和组织。由于元数据没有固定的模式存储,请使用一致的字段命名以避免意外的查询结果。
🌐 Vector stores support rich metadata (any JSON-serializable fields) for filtering and organization. Since metadata is stored with no fixed schema, use consistent field naming to avoid unexpected query results.
元数据对于向量存储至关重要——没有它,你将只有数值嵌入,而无法返回原始文本或筛选结果。始终至少将源文本作为元数据存储。
// Store embeddings with rich metadata for better organization and filtering
await store.upsert({
indexName: "myCollection",
vectors: embeddings,
metadata: chunks.map((chunk) => ({
// Basic content
text: chunk.text,
id: chunk.id,
// Document organization
source: chunk.source,
category: chunk.category,
// Temporal metadata
createdAt: new Date().toISOString(),
version: "1.0",
// Custom fields
language: chunk.language,
author: chunk.author,
confidenceScore: chunk.score,
})),
});
关键元数据考虑因素:
🌐 Key metadata considerations:
- 对字段命名要严格——像 'category' 和 'Category' 这样的不一致会影响查询
- 只包含你打算筛选或排序的字段——额外的字段会增加负担
- 添加时间戳(例如 'createdAt'、'lastUpdated')以跟踪内容的新鲜度
删除向量Direct link to 删除向量
🌐 Deleting Vectors
在构建 RAG 应用时,当文档被删除或更新时,你经常需要清理过时的向量。Mastra 提供了 deleteVectors 方法,该方法支持通过元数据过滤器删除向量,使得删除与特定文档相关的所有嵌入变得很容易。
🌐 When building RAG applications, you often need to clean up stale vectors when documents are deleted or updated. Mastra provides the deleteVectors method that supports deleting vectors by metadata filters, making it easy to remove all embeddings associated with a specific document.
按元数据筛选删除Direct link to 按元数据筛选删除
🌐 Delete by Metadata Filter
最常见的用例是在用户删除文档时,删除该文档的所有向量:
🌐 The most common use case is deleting all vectors for a specific document when a user deletes it:
// Delete all vectors for a specific document
await store.deleteVectors({
indexName: "myCollection",
filter: { docId: "document-123" },
});
这在以下情况下特别有用:
🌐 This is particularly useful when:
- 用户删除了一个文档,你需要移除它的所有片段
- 你正在重新索引文档,并希望先删除旧向量
- 你需要为特定用户或租户清理向量
删除多个文档Direct link to 删除多个文档
🌐 Delete Multiple Documents
你也可以使用复杂的筛选器来删除符合多个条件的向量:
🌐 You can also use complex filters to delete vectors matching multiple conditions:
// Delete all vectors for multiple documents
await store.deleteVectors({
indexName: "myCollection",
filter: {
docId: { $in: ["doc-1", "doc-2", "doc-3"] },
},
});
// Delete vectors for a specific user's documents
await store.deleteVectors({
indexName: "myCollection",
filter: {
$and: [
{ userId: "user-123" },
{ status: "archived" },
],
},
});
按向量ID删除Direct link to 按向量ID删除
🌐 Delete by Vector IDs
如果你有特定的向量 ID 需要删除,可以直接传入它们:
🌐 If you have specific vector IDs to delete, you can pass them directly:
// Delete specific vectors by their IDs
await store.deleteVectors({
indexName: "myCollection",
ids: ["vec-1", "vec-2", "vec-3"],
});
最佳实践Direct link to 最佳实践
🌐 Best Practices
- 在批量插入之前创建索引
- 对于大批量插入使用批处理操作(upsert 方法会自动处理批处理)
- 只存储你会查询的元数据
- 将嵌入维度与你的模型匹配(例如,
text-embedding-3-small为 1536)