Build More Powerful Voice Agents with the Gemini Live API
We've significantly improved function calling and enhanced proactive audio capabilities in the Live API to handle interruptions, pauses and side conversations gracefully.
Authors: Ivan Solovyev , Product Manager Google DeepMind, Valeria Wu , Product Manager Google DeepMind, and Mingqiu Wang , Research Engineer Google DeepMind
Today, we’re excited to announce a significant update to the Live API in the Gemini API, featuring a new native audio model, now available in preview. This update is designed to help you build more reliable, responsive, and natural-sounding voice agents.
For this model release, we've focused on two key areas:
A Major Boost for Reliability
The most powerful and interesting voice agent experiences are unlocked when they can reliably connect to external data and services—allowing users to access real-time information, book appointments, or complete transactions. This is where function calling comes in. Given the real-time nature of voice interactions, there’s no time to retry a failed request, making the reliability of function calling absolutely critical.
To see what this improved reliability looks like in practice, here’s a quick demo of it in action:
The new model is significantly better at identifying the correct function to call, knowing when not to call a function, and adhering consistently to the provided tool schema. Our internal benchmarks show a dramatic improvement in function calling accuracy (e.g. the model correctly identifying and calling a function, including in complex scenarios with 10 or more active functions). Compared to the previous version, function calling success increased by 2x in single-call tests and 1.5x in tests involving 5 to 10 calls. This boost in reliability is a big step forward for voice applications and we are continuing to improve reliability, especially for multi-turn scenarios, based on developer feedback.
Test the model’s function calling improvements with this app in Google AI Studio.
Even More Natural Conversations
We've also added more proactive audio capabilities to make interactions feel more natural. The model now ignores chatter not relevant to the active conversation and is also significantly better at understanding natural pauses and interruptions by the user.
Imagine you're talking to a voice agent and someone walks into the room to ask you a quick question. The model can now gracefully pause the conversation, while ignoring the side chatter, and seamlessly resume when you're ready to continue.
Similarly, the model is now better at understanding conversational rhythms, recognizing the context of your speech and adapting to your pauses—whether you're taking a moment to articulate a complex thought or just speaking casually. In our internal evaluations, the number of times the model incorrectly interrupts the user when they are not talking dropped significantly compared to the last model. These improvements happen automatically, with no extra setup required, making the conversation much more fluid.
This update also brings significant improvements in interruptions detection accuracy, noticeably reducing the number of times the model fails to recognize when a user interrupts.
Smarter Responses with "Thinking" Capabilities
As a followup to this launch, next week we are rolling out support for "thinking" capabilities, similar to those in Gemini 2.5 Flash and Pro. We recognize that not all questions can or should be answered instantly. For complex queries that require deeper reasoning, you will be able set a "thinking budget," allowing the model to take a few moments to process the request more thoroughly. As part of the thinking process, the model will send back a text summary of its thinking.
Live API in the real world
We’ve been working closely with our early access partners to test and improve the API capabilities and almost all have reported positive results from testing the latest model.
For example, Ava, an AI-powered family operating system, uses the Live API to act as a "household COO." Ava processes messy, real-world inputs like school emails, PDFs, and voice notes, turning them into actions like calendar events.
"The ability to have natural, bi-directional voice chat was a hard requirement," says Joe Alicata, Cofounder and CTO of Ava. "The latest model's improvements to function calling accuracy were a game-changer. We're seeing higher first-pass accuracy on noisy inputs and fewer brittle prompt hacks, which allowed our small team to ship a reliable, agentic, multimodal product much faster."
Get Started Today
You can start building with Live API right now:
Head over to the Live API documentation to learn more and find end-to-end code samples in the cookbooks.
We believe these updates will unlock new possibilities for creating powerful and intuitive voice experiences, and we will have even more to share about the Live API soon. Happy building!
Python & Generative AI Engineer | Automating Businesses with AI Agents and n8n Workflows
2dGood Work !
Krishna Moorthy A
Ophanos Ana ....iou
Principal Engineer • Real-time and AI • Ex EpicGames, Houseparty & Telefonica
1wStill lagging behind OpenAI in terms of interfaces and APIs :(. Please implement that type of features: https://labs.livetok.io/real-time-api 🙏
Criador de conteúdo de vídeo de qualidade na Plataforma A
1wUma tarde boa com Jesus