Build More Powerful Voice Agents with the Gemini Live API

Build More Powerful Voice Agents with the Gemini Live API

We've significantly improved function calling and enhanced proactive audio capabilities in the Live API to handle interruptions, pauses and side conversations gracefully. 

Authors: Ivan Solovyev , Product Manager Google DeepMind, Valeria Wu , Product Manager Google DeepMind, and Mingqiu Wang , Research Engineer Google DeepMind

Today, we’re excited to announce a significant update to the Live API in the Gemini API, featuring a new native audio model, now available in preview. This update is designed to help you build more reliable, responsive, and natural-sounding voice agents.

For this model release, we've focused on two key areas:

  • More robust function calling: Making it more reliable for your agent to connect to external data and services.
  • More natural conversations: Ensuring interactions feel more intuitive with a better understanding of the context and natural resumption of the conversation in case of interruptions or pauses.

A Major Boost for Reliability

The most powerful and interesting voice agent experiences are unlocked when they can reliably connect to external data and services—allowing users to access real-time information, book appointments, or complete transactions. This is where function calling comes in. Given the real-time nature of voice interactions, there’s no time to retry a failed request, making the reliability of function calling absolutely critical.

To see what this improved reliability looks like in practice, here’s a quick demo of it in action:

More reliable function calling

The new model is significantly better at identifying the correct function to call, knowing when not to call a function, and adhering consistently  to the provided tool schema. Our internal benchmarks show a dramatic improvement in function calling accuracy (e.g. the model correctly identifying and calling a function, including in complex scenarios with 10 or more active functions). Compared to the previous version, function calling success increased by 2x in single-call tests and 1.5x in tests involving 5 to 10 calls. This boost in reliability is a big step forward for voice applications and we are continuing to improve reliability, especially for multi-turn scenarios, based on developer feedback.

Test the model’s function calling improvements with this app in Google AI Studio.

Bar chart titled "Triggering rate increase vs 05-20 preview" on a dark background. It displays three blue bars showing triggering rate increases for different numbers of tools: "Single tool" has a "2x" increase, "5 tools" has a "1.6x" increase, and "10 tools" has a "1.4x" increase, illustrating a diminishing rate increase as more tools are added.
Results based on tests run on Google AI Studio and Vertex AI

Even More Natural Conversations

We've also added more proactive audio capabilities to make interactions feel more natural. The model now ignores chatter not relevant to the active conversation and is also significantly better at understanding natural pauses and interruptions by the user.

Imagine you're talking to a voice agent and someone walks into the room to ask you a quick question. The model can now gracefully pause the conversation, while ignoring the side chatter, and seamlessly resume when you're ready to continue.

Better recognition of background conversations

Similarly, the model is now better at understanding conversational rhythms, recognizing the context of your speech and adapting to your pauses—whether you're taking a moment to articulate a complex thought or just speaking casually. In our internal evaluations, the number of times the model incorrectly interrupts the user when they are not talking dropped significantly compared to the last model. These improvements happen automatically, with no extra setup required, making the conversation much more fluid.

Graceful handling of natural pauses in the conversation

This update also brings significant improvements in interruptions detection accuracy, noticeably reducing the number of times the model fails to recognize when a user interrupts.

Significantly improved detection of interruptions

Smarter Responses with "Thinking" Capabilities

As a followup to this launch, next week we are rolling out support for "thinking" capabilities, similar to those in Gemini 2.5 Flash and Pro. We recognize that not all questions can or should be answered instantly. For complex queries that require deeper reasoning, you will be able  set a "thinking budget," allowing the model to take a few moments to process the request more thoroughly. As part of the thinking process, the model will send back a text summary of its thinking.

Live API in the real world

We’ve been working closely with our early access partners to test and improve the API capabilities and almost all have reported positive results from testing the latest model.

For example, Ava, an AI-powered family operating system, uses the Live API to act as a "household COO." Ava processes messy, real-world inputs like school emails, PDFs, and voice notes, turning them into actions like calendar events.

"The ability to have natural, bi-directional voice chat was a hard requirement," says Joe Alicata, Cofounder and CTO of Ava. "The latest model's improvements to function calling accuracy were a game-changer. We're seeing higher first-pass accuracy on noisy inputs and fewer brittle prompt hacks, which allowed our small team to ship a reliable, agentic, multimodal product much faster." 

Get Started Today

You can start building with Live API right now:

A Python code snippet on a light gray background titled "Python" demonstrates an asynchronous client interacting with the `gemini-2.5-flash-native-audio-preview-09-2025` GenAI model. The script includes imports for `asyncio` and `google.genai`, defines a system instruction setting the AI as "a helpful and friendly AI assistant" with "optimistic wit," and configures audio response modalities. It outlines a process to send real-time audio input from a microphone and asynchronously receive and play audio responses from the model.

Head over to the Live API documentation to learn more and find end-to-end code samples in the cookbooks

We believe these updates will unlock new possibilities for creating powerful and intuitive voice experiences, and we will have even more to share about the Live API soon. Happy building!

Manahil Naeem

Python & Generative AI Engineer | Automating Businesses with AI Agents and n8n Workflows

2d

Good Work !

Like
Reply
Like
Reply
Gustavo García

Principal Engineer • Real-time and AI • Ex EpicGames, Houseparty & Telefonica

1w

Still lagging behind OpenAI in terms of interfaces and APIs :(. Please implement that type of features: https://labs.livetok.io/real-time-api 🙏

Like
Reply
Ricardo Ourique

Criador de conteúdo de vídeo de qualidade na Plataforma A

1w

Uma tarde boa com Jesus

Like
Reply

To view or add a comment, sign in

Explore content categories