Skip to main content

Google Gemini实时语音

🌐 Google Gemini Live Voice

GeminiLiveVoice 类使用 Google 的 Gemini Live API 提供实时语音交互功能。它支持双向音频流、工具调用、会话管理,以及标准 Google API 和 Vertex AI 两种认证方式。

🌐 The GeminiLiveVoice class provides real-time voice interaction capabilities using Google's Gemini Live API. It supports bidirectional audio streaming, tool calling, session management, and both standard Google API and Vertex AI authentication methods.

使用示例
Direct link to 使用示例

🌐 Usage Example

import { GeminiLiveVoice } from "@mastra/voice-google-gemini-live";
import { playAudio, getMicrophoneStream } from "@mastra/node-audio";

// Initialize with Gemini API (using API key)
const voice = new GeminiLiveVoice({
apiKey: process.env.GOOGLE_API_KEY, // Required for Gemini API
model: "gemini-2.0-flash-exp",
speaker: "Puck", // Default voice
debug: true,
});

// Or initialize with Vertex AI (using OAuth)
const voiceWithVertexAI = new GeminiLiveVoice({
vertexAI: true,
project: "your-gcp-project",
location: "us-central1",
serviceAccountKeyFile: "/path/to/service-account.json",
model: "gemini-2.0-flash-exp",
speaker: "Puck",
});

// Or use the VoiceConfig pattern (recommended for consistency with other providers)
const voiceWithConfig = new GeminiLiveVoice({
speechModel: {
name: "gemini-2.0-flash-exp",
apiKey: process.env.GOOGLE_API_KEY,
},
speaker: "Puck",
realtimeConfig: {
model: "gemini-2.0-flash-exp",
apiKey: process.env.GOOGLE_API_KEY,
options: {
debug: true,
sessionConfig: {
interrupts: { enabled: true },
},
},
},
});

// Establish connection (required before using other methods)
await voice.connect();

// Set up event listeners
voice.on("speaker", (audioStream) => {
// Handle audio stream (NodeJS.ReadableStream)
playAudio(audioStream);
});

voice.on("writing", ({ text, role }) => {
// Handle transcribed text
console.log(`${role}: ${text}`);
});

voice.on("turnComplete", ({ timestamp }) => {
// Handle turn completion
console.log("Turn completed at:", timestamp);
});

// Convert text to speech
await voice.speak("Hello, how can I help you today?", {
speaker: "Charon", // Override default voice
responseModalities: ["AUDIO", "TEXT"],
});

// Process audio input
const microphoneStream = getMicrophoneStream();
await voice.send(microphoneStream);

// Update session configuration
await voice.updateSessionConfig({
speaker: "Kore",
instructions: "Be more concise in your responses",
});

// When done, disconnect
await voice.disconnect();
// Or use the synchronous wrapper
voice.close();

配置
Direct link to 配置

🌐 Configuration

构造函数选项
Direct link to 构造函数选项

🌐 Constructor Options

apiKey?:

string
Google API key for Gemini API authentication. Required unless using Vertex AI.

model?:

GeminiVoiceModel
= 'gemini-2.0-flash-exp'
The model ID to use for real-time voice interactions.

speaker?:

GeminiVoiceName
= 'Puck'
Default voice ID for speech synthesis.

vertexAI?:

boolean
= false
Use Vertex AI instead of Gemini API for authentication.

project?:

string
Google Cloud project ID (required for Vertex AI).

location?:

string
= 'us-central1'
Google Cloud region for Vertex AI.

serviceAccountKeyFile?:

string
Path to service account JSON key file for Vertex AI authentication.

serviceAccountEmail?:

string
Service account email for impersonation (alternative to key file).

instructions?:

string
System instructions for the model.

sessionConfig?:

GeminiSessionConfig
Session configuration including interrupt and context settings.

debug?:

boolean
= false
Enable debug logging for troubleshooting.

会话配置
Direct link to 会话配置

🌐 Session Configuration

interrupts?:

object
Interrupt handling configuration.

interrupts.enabled?:

boolean
= true
Enable interrupt handling.

interrupts.allowUserInterruption?:

boolean
= true
Allow user to interrupt model responses.

contextCompression?:

boolean
= false
Enable automatic context compression.

方法
Direct link to 方法

🌐 Methods

connect()
Direct link to connect()

建立与 Gemini Live API 的连接。必须在使用 speak、listen 或 send 方法之前调用。

🌐 Establishes a connection to the Gemini Live API. Must be called before using speak, listen, or send methods.

requestContext?:

object
Optional request context for the connection.

returns:

Promise<void>
Promise that resolves when the connection is established.

speak()
Direct link to speak()

将文本转换为语音并发送到模型。输入可以是字符串或可读流。

🌐 Converts text to speech and sends it to the model. Can accept either a string or a readable stream as input.

input:

string | NodeJS.ReadableStream
Text or text stream to convert to speech.

options?:

GeminiLiveVoiceOptions
Optional speech configuration.

options.speaker?:

GeminiVoiceName
= Constructor's speaker value
Voice ID to use for this specific speech request.

options.languageCode?:

string
Language code for the response.

options.responseModalities?:

('AUDIO' | 'TEXT')[]
= ['AUDIO', 'TEXT']
Response modalities to receive from the model.

返回:Promise<void>(响应通过 speakerwriting 事件发送)

🌐 Returns: Promise<void> (responses are emitted via speaker and writing events)

listen()
Direct link to listen()

处理用于语音识别的音频输入。接收可读的音频数据流并返回转录的文本。

🌐 Processes audio input for speech recognition. Takes a readable stream of audio data and returns the transcribed text.

audioStream:

NodeJS.ReadableStream
Audio stream to transcribe.

options?:

GeminiLiveVoiceOptions
Optional listening configuration.

返回:Promise<string> - 转录的文本

🌐 Returns: Promise<string> - The transcribed text

send()
Direct link to send()

将音频数据实时流式传输到 Gemini 服务,以实现持续音频流场景,例如实时麦克风输入。

🌐 Streams audio data in real-time to the Gemini service for continuous audio streaming scenarios like live microphone input.

audioData:

NodeJS.ReadableStream | Int16Array
Audio stream or buffer to send to the service.

返回:Promise<void>

🌐 Returns: Promise<void>

updateSessionConfig()
Direct link to updateSessionConfig()

动态更新会话配置。可用于修改语音设置、扬声器选择以及其他运行时配置。

🌐 Updates the session configuration dynamically. This can be used to modify voice settings, speaker selection, and other runtime configurations.

config:

Partial<GeminiLiveVoiceConfig>
Configuration updates to apply.

返回:Promise<void>

🌐 Returns: Promise<void>

addTools()
Direct link to addTools()

向语音实例添加一组工具。工具使模型能够在对话过程中执行额外的操作。当将 GeminiLiveVoice 添加到代理时,为该代理配置的任何工具都将自动可用于语音界面。

🌐 Adds a set of tools to the voice instance. Tools allow the model to perform additional actions during conversations. When GeminiLiveVoice is added to an Agent, any tools configured for the Agent will automatically be available to the voice interface.

tools:

ToolsInput
Tools configuration to equip.

返回:void

🌐 Returns: void

addInstructions()
Direct link to addInstructions()

为模型添加或更新系统指令。

🌐 Adds or updates system instructions for the model.

instructions?:

string
System instructions to set.

返回:void

🌐 Returns: void

answer()
Direct link to answer()

触发模型的响应。此方法主要在与代理集成时内部使用。

🌐 Triggers a response from the model. This method is primarily used internally when integrated with an Agent.

options?:

Record<string, unknown>
Optional parameters for the answer request.

返回:Promise<void>

🌐 Returns: Promise<void>

getSpeakers()
Direct link to getSpeakers()

返回 Gemini Live API 可用的语音列表。

🌐 Returns a list of available voice speakers for the Gemini Live API.

返回:Promise<Array<{ voiceId: string; description?: string }>>

disconnect()
Direct link to disconnect()

断开与 Gemini Live 会话的连接并清理资源。这是正确处理清理的异步方法。

🌐 Disconnects from the Gemini Live session and cleans up resources. This is the async method that properly handles cleanup.

返回:Promise<void>

🌐 Returns: Promise<void>

close()
Direct link to close()

disconnect() 的同步封装器。在内部调用 disconnect(),但不等待其完成。

🌐 Synchronous wrapper for disconnect(). Calls disconnect() internally without awaiting.

返回:void

🌐 Returns: void

on()
Direct link to on()

为语音事件注册一个事件监听器。

🌐 Registers an event listener for voice events.

event:

string
Name of the event to listen for.

callback:

Function
Function to call when the event occurs.

返回:void

🌐 Returns: void

off()
Direct link to off()

移除先前注册的事件监听器。

🌐 Removes a previously registered event listener.

event:

string
Name of the event to stop listening to.

callback:

Function
The specific callback function to remove.

返回:void

🌐 Returns: void

事件
Direct link to 事件

🌐 Events

GeminiLiveVoice 类会触发以下事件:

🌐 The GeminiLiveVoice class emits the following events:

speaker:

event
Emitted when audio data is received from the model. Callback receives a NodeJS.ReadableStream.

speaking:

event
Emitted with audio metadata. Callback receives { audioData?: Int16Array, sampleRate?: number }.

writing:

event
Emitted when transcribed text is available. Callback receives { text: string, role: 'assistant' | 'user' }.

session:

event
Emitted on session state changes. Callback receives { state: 'connecting' | 'connected' | 'disconnected' | 'disconnecting' | 'updated', config?: object }.

turnComplete:

event
Emitted when a conversation turn is completed. Callback receives { timestamp: number }.

toolCall:

event
Emitted when the model requests a tool call. Callback receives { name: string, args: object, id: string }.

usage:

event
Emitted with token usage information. Callback receives { inputTokens: number, outputTokens: number, totalTokens: number, modality: string }.

error:

event
Emitted when an error occurs. Callback receives { message: string, code?: string, details?: unknown }.

interrupt:

event
中断事件。回调接收 { type: 'user' | 'model', timestamp: number }。

可用型号
Direct link to 可用型号

🌐 Available Models

以下 Gemini Live 模型可用:

🌐 The following Gemini Live models are available:

  • gemini-2.0-flash-exp(默认)
  • gemini-2.0-flash-exp-image-generation
  • gemini-2.0-flash-live-001
  • gemini-live-2.5-flash-preview-native-audio
  • gemini-2.5-flash-exp-native-audio-thinking-dialog
  • gemini-live-2.5-flash-preview
  • gemini-2.6.flash-preview-tts

可用语音
Direct link to 可用语音

🌐 Available Voices

以下语音选项可用:

🌐 The following voice options are available:

  • Puck(默认):对话式、友好
  • Charon:深刻、权威
  • Kore:中立、专业
  • Fenrir:热情、平易近人

认证方法
Direct link to 认证方法

🌐 Authentication Methods

Gemini API(开发版)
Direct link to Gemini API(开发版)

🌐 Gemini API (Development)

使用来自 Google AI Studio 的 API 密钥的最简单方法:

🌐 The simplest method using an API key from Google AI Studio:

const voice = new GeminiLiveVoice({
apiKey: "your-api-key", // Required for Gemini API
model: "gemini-2.0-flash-exp",
});

Vertex AI(生产环境)
Direct link to Vertex AI(生产环境)

🌐 Vertex AI (Production)

用于具有 OAuth 身份验证和 Google 云平台的生产环境:

🌐 For production use with OAuth authentication and Google Cloud Platform:

// Using service account key file
const voice = new GeminiLiveVoice({
vertexAI: true,
project: "your-gcp-project",
location: "us-central1",
serviceAccountKeyFile: "/path/to/service-account.json",
});

// Using Application Default Credentials
const voice = new GeminiLiveVoice({
vertexAI: true,
project: "your-gcp-project",
location: "us-central1",
});

// Using service account impersonation
const voice = new GeminiLiveVoice({
vertexAI: true,
project: "your-gcp-project",
location: "us-central1",
serviceAccountEmail: "service-account@project.iam.gserviceaccount.com",
});

高级功能
Direct link to 高级功能

🌐 Advanced Features

会话管理
Direct link to 会话管理

🌐 Session Management

Gemini Live API 支持会话恢复,以处理网络中断:

🌐 The Gemini Live API supports session resumption for handling network interruptions:

voice.on("sessionHandle", ({ handle, expiresAt }) => {
// Store session handle for resumption
saveSessionHandle(handle, expiresAt);
});

// Resume a previous session
const voice = new GeminiLiveVoice({
sessionConfig: {
enableResumption: true,
maxDuration: "2h",
},
});

工具调用
Direct link to 工具调用

🌐 Tool Calling

在对话中启用模型调用函数:

🌐 Enable the model to call functions during conversations:

import { z } from "zod";

voice.addTools({
weather: {
description: "Get weather information",
parameters: z.object({
location: z.string(),
}),
execute: async ({ location }) => {
const weather = await getWeather(location);
return weather;
},
},
});

voice.on("toolCall", ({ name, args, id }) => {
console.log(`Tool called: ${name} with args:`, args);
});

注意
Direct link to 注意

🌐 Notes

  • Gemini 实时 API 使用 WebSockets 进行实时通信
  • 音频以16kHz PCM16处理作为输入,以24kHz PCM16处理作为输出
  • 在使用其他方法之前,语音实例必须先与 connect() 连接
  • 完成后始终调用 close() 以正确清理资源
  • Vertex AI 身份验证需要相应的 IAM 权限(aiplatform.user 角色)
  • 会话恢复可以从网络中断中恢复
  • 该 API 支持与文本和音频的实时交互