Skip to main content

Building Conversational Agents — Thor Schaeff and Philipp Schmid, Google DeepMind

TL;DR

  • Google DeepMind is launching the Interactions API, a new unified interface for its Gemini models and agents designed to simplify LLM application development.
  • This API features server-side state management, eliminating the need for client-side context handling and enabling more efficient, cost-effective interactions through implicit caching.
  • It offers a standardized approach to handling multi-modal inputs and outputs, supports advanced tool-use, and is designed for easier integration into developer workflows.

Takeaways

  • The Interactions API is the new unified API from Google DeepMind, currently in beta, intended to succeed the generateContent API for Gemini models and agents.
  • It introduces server-side state management, which means the API server maintains conversation history, removing the burden of managing context on the client side.
  • Server-side state management significantly improves implicit caching, potentially making input tokens up to 90% cheaper by avoiding re-encoding previously sent context.
  • The API uses standardized "content blocks" with a type field (e.g., text, audio, video, image, function_call) for consistent input and output handling.
  • It supports asynchronous execution for long-running agent operations (like DeepResearch) via polling or webhooks, improving performance and connection management.
  • The API streamlines tool use by supporting built-in tools, Remote-MCP, and allowing the combination of Google Search with custom functions.
  • Developers can use agent skills (e.g., the Gemini Interactions API skill) within IDEs like Cursor or Anti-Gravity to guide agents on available tools, API specifics, and model preferences (e.g., defaulting to the latest Gemini free flesh model).
  • Effective system instructions are crucial for defining an agent's role (e.g., "you are a coding agent") and capabilities, helping it make informed decisions about when to invoke tools.

Vocabulary

Interactions API — The new, unified Google DeepMind API for interacting with Gemini models and agents, featuring server-side state management and a standardized content block structure. generateContent API — The previous Google DeepMind API for Gemini models, which the Interactions API is designed to replace. Server-side state management — A feature where the API server maintains the conversation history and context, removing the need for client applications to manage it themselves. Implicit caching — Automatic caching of encoded input tokens on the server, which can significantly reduce costs for subsequent API calls by avoiding redundant token re-encoding. DeepResearch — An agent capability that allows the model to autonomously perform extensive research across many websites over an extended period. Content blocks — A standardized data structure used in the Interactions API to represent various types of input and output, such as text, images, or function calls. SSE (Server-Sent Events) — A web streaming technology used by the Interactions API for efficient, real-time updates and streaming responses. Tools — Functions or external capabilities that an agent can invoke to interact with its environment, such as reading/writing files, searching the web, or generating images. Function calling (or Tool use) — The ability of a language model to generate structured data (often JSON) that an application can use to execute predefined external functions or tools. Agent skills — Pre-packaged sets of capabilities, documentation, or contextual information provided to an agent, enabling it to understand and perform specific tasks more effectively.

Transcript

It's just that, Philip and I were both Germans, so we thought it was funny maybe we could do it in German. Actually it looks like there's a German crew there which is nice. No worries, we'll do it in English. We'll do it in a couple different languages maybe we'll find out. Do you have other languages in the room? Other nationalities? What do we have? Shout it out. So here we are. Spanish. Any Icelandic? No. Ok, close. Romanian. Nice. Dutch. Penae de l'ance. Ok. Which? Hindu. Ok. Canada. No. Ok, Bangalore. No. Farsi. Farsi. Alright, nice. Check. Yeah. Brilliant. Ok, this is fantastic. Well thanks. Everyone for making your way over here with all your languages, really appreciate it. We can really put the model to the test today which is great. Yeah, hi. I'm Thor or Tostin for the German speakers amongst you. Hi, I'm Philip. Only Philip. So. Before German and English speakers. Yes, it's nice. So we work on the developer experience at Google DeepMind broadly covering kind of Gemini API and also working in Google AI Studio as a tool for developers to try out the models quickly and then also the API interfaces. To use the API, you do need an API key. Actually, who has an API key already? Gemini API? Ok, couple of folks. Who has used AI Studio before? Google AI Studio? Ok, couple of folks. Great. Maybe you can quickly... Sorry? Indicremity. Yeah, that's good. You actually don't need an API key for Indicremity, but... So, if you don't have an API key yet, if you have your machine on you, I hope you do. It's hands-on workshop. I do apologize they took away your tables. Yes. So it's a very laptop situation, like literally laptop. I hope you didn't bring your Mac Mini, or whatever. But so yes, if you just go to AI.dev. And I mean, you can also go to AI.studio. You can go to AIstudio.com, but we paid a lot of money for AI.dev. So please use Dash. It just redirects to AI studio. So AI.dev slash API key? Yes, or on the corner left side. There's a get API key and then API dash keys. Is where you can... Where you can... Yeah, it's my personal account and we cannot change language. You need only a Google account, so no credit card, no nothing. So the only thing we are going to do is part of the free tier. So if you go to AIstudio.dev and there's some like sign-in form, you can just use your Gmail and the worries we will not charge you. And then on API.keys, you find that the top right corner normally is something called API key, create, or create API key. And when you do it for the first time, you might need... So similar for me, I can import my projects or I can create my projects. I can just create, give it any name. I mean, we can call it i-e workshop. And then create project. We can... Slate it to English. Takes a few seconds and then you should be able to create key. And this key will be used for the demos or the hands-on things we are going to do use later. It shows the API key. Yeah, I will delete it, so don't copy it. Copy it very fast. You can use that key. You can put it into your BASH, Rc, or ZSHRC file, or later you can directly inline it at whatever you prefer. If you have one, you can use this one. And since we have a few more minutes time, we really want to make sure everyone who wants to follow along, and we want to make sure that we have time to create your API key. If there is an error appearing or something else is working, if you are free to raise your hand, to or will come to you and will help you. And... Yeah, just a reminder, it is a secret API key. Don't do what Philip is doing there. That's why I delete my key again. Yes. So we do get that a lot that people leak their API keys. Mostly it's clock code that is like pushing the API keys to get up. So I recommend you don't do that. Just kidding. Just remember, it is a secret API key. So treat it like a secret. Don't share it with your neighbor. No, that's... But yeah, do create that API key now, and we'll give you a couple minutes to do that. And while you do that, or once you've finished creating your API key, I'd love for you to just briefly introduce yourself if you want to, and just sort of, you know, let us know what you would like to get out of the workshop. Maybe there's a specific use case you're working on. Yeah, we'd love to kind of get to know you a little bit as well. So while you create your API keys, if you want to, free to just shout out your name or nickname, and sort of what you're working on, what you would like to get out of the workshop. Yeah. I will vote from the first day I podcast working for Corby for Wets and Biasis, and big fan of Gemma, and Gemma 4. And Gemma and I, I'm one of the few who are using it for code-hands stuff as well, I judge dick use, and I'm also running it on the glasses, actually. Oh nice. And I want to see more of this, you know. Okay. And I want Google glasses. Okay, yeah, we'll see if we can get to the glasses by the end of the workshop. Sorry. If you have a seat to your right, get your jump in, so you can fill up on the side here. Makes it easier to be locked in time to find one. And why we are waiting, we launched something cool on Chrome, which I haven't activated it, but when you go on your tap-bar, click right, you can now move tops to the side, and we have vertical tops now, so. Yeah. So more screen, that's good. Nice. Okay, cool. So glasses. Check out. Are you running? What are you running on the glass? Well, technically on the phone, right? So, yeah. Yeah. Which cross open claw? Nice. Nice. And then that's just using Gemma and I life, kind of through the web socket on the phone, right? Yeah. Nice. Yeah, that's cool. Yeah, Vision claw, if you haven't heard of that, a pretty fun open-source project. And I think you can run it, yeah, sort of on the wall. Now that Meta has opened up the SDK for the Meta Ray ban, you can actually hook in something like Gemma and I life, API, into the glasses. Well, the glasses connect to your phone, and then the phone actually connects to Gemma and I life, which is cool. But it shows what will be possible. Yeah, nice. Any problems creating API keys? Everyone has an API key. Should we check? Okay. Anyone else? Anything specific they, yeah? Michael, we're quick at your guide in Zurich. Went building like at first. Cool. Okay. Cool. Nice. Last chance. Anyone else? Okay. All right. So, we can start. Before we go into the hands-on session, we, I have like 10, 15-minute slides to give a bit of a background what we are going to use to build. The first session, which I'm going to do is more on like building an agent without any life audio input. That's where Taurus going to take over later and then we are going to building some very nice conversational agents. So, who of you has used the Gemma API to make an API call to Gemma I before? Okay. That's a few. Have any one of you used the interactions API? One, okay. Two, okay. At least some person. So, the interactions API is a new API we launched in December and better, which hopefully will succeed generate content soon. It's a unified API to use with models, with agents, and it's much more aligned with, I would say, the industry. So, much closer to what you are familiar with, from open models, with the chat completions API or from Anthropic or from OpenAI. And what are we going to do? The slides, then we build a coding agent, like a small little Claude Code with reading files, writing files, running bash commands, and then we have some time at the end if you have questions around it, and then we do a short break, toilet, drinks, and then tour will continue. We all have an API key. So, as I said, we want to build an API which works for both models and agents, and when we launch the interactions API, we also launch deep research. So, maybe you have used deep research in JetGPT or in the Gemini app, where you basically start off a query, and then you get back a plan, and then the model goes on, and like does deep research for 10, 15 minutes, visiting hundreds of sites. And the API supports both models and agents, and it's very simple to switch between those. You basically either define a model which could be Gemini free flesh, or you define your agents, which could be deep research. And we are working on it for you to bring your own agent, or you define your own agent, that you can customize all of this behaviors, and it's the same surface. So, we'll see later, but when you send a request to NanoBanana to generate an image, you can basically chain those interactions to a flesh model to do something else, to Luria, to have you generate audio, or even to hopefully soon be able to generate a video. And the interface is very similar to what you see from OpenAI. So, we now have those content blocks basically, so every input you provide, and output you provide is the same type. It has a type field, which could be a function call of thought signature text, audio video image, which hopefully makes it for you easier to build with the Gemini API, and it is less, I would say, Google-Pranded, less proto-specific, less GRPC, to make it easier for developers to build. The co-primators of the interactions API, in addition to making it easier, we also introduced state on the server, which we will use for building our agents, so we don't need to manage our loop and always send back the whole history. Very similar to responses API, you now have a previous interaction ID you can provide, which basically attaches to the existing history, so you can just send a new input. As mentioned, we have an agent with deep research and background truth, so you can start your research, can pull it, or soon use web hooks to get notified when your research is done, so you don't need to keep the connection open. We have the type blocks, and then also we have the same like streaming pattern, very typical for web development using SSE, which also makes it hopefully easier for you to build. We also part the build in tools, but we also now have support for Remote-MCP, and I think two weeks ago we launched tool combination, so you can now combine Google search with your own custom function, which was one of the big features people were asking for years, I think. And now you can do this. And to summarize again, the difference between so generate content is what we have today. Interactions API is what we will have tomorrow. We will have server side state management, but you don't need to use it. So if you say, hey, I want to manage my turns, I want to manage my context, I don't trust you, or I need to do like context engineering, I want to remove certain parts. You can do this. It's way easier to send new input basically. We have the build in agents, we have also background support, so you will get asynchronous execution. We all see with agents when you send the problem, it might take one, two, three, four minutes to complete, and keeping HTTP requests or connections open for, I would say more than like 10 seconds is not a very good practice, so you want to use asynchronous calls to either get notified or do polling when it is done, and it is less proto-oriented, which I really prefer. It's much closer to what people know from the developer ecosystem, and also a side effect of the state management is that the implicit caching for the API is much better. So for one of you who don't know what is implicit caching, so when you send a request to the model, the model needs to encode all of your input tokens. And you can cache those encodings for follow-up requests to save cost, and cache requests, I think, are 90% cheaper for the input tokens, and when you have to manage your state or context yourself, you may be stripped out line breaks, remove certain parts, this breaks the cache, and using the server side state, the server keeps the context, so the chances for your cache hit rate is much higher, and we see like two to three times better cache rates from the start-ups using interactions API today. Quick code example on what we mean with making it simpler, so on the left side you have like the very proto-specific one-off input parts with inline data or text, where the field basically describes the type on the right side. Almost looks similar to other APIs, I would say, so it should be much easier if you decide after the workshop to give it a try. Then I guess all of you know roughly what an agent is. We have a brain or our model, which decides what it wants to do, calling a tool, generating text, doing something else. We have tools, which basically gives our brain hands in ice to interact with the environment, where it is in, we have the context, it's basically all of the model knows what it has to do, what it can do, if there are certain preferences, if there are certain constraints, and then we have to do, which basically combines our model with hands and tools and runs it until the model no longer calls the tools and generates a text. So some quick examples on how to use the API, and then we go into like the nice hands on part. So basic chat usage with server side state becomes very easy because we have our interactions create call. So we define our model, we define our input, what's the capital of friends, we get back the output, and then we can just continue providing the previous ID, and the model behind the scenes, or like the server behind the scenes, basically has our user input, our model output, and then depends on the user input that you don't need to have like a client side history object, where you append the user turns, the model turns, and then the user turns again, it becomes very helpful when you build agents, where you have the loop and always need to append new user input. And as mentioned before, that also works for agents and models, so we really want to build this unified interface, where you can continue your conversations no matter what model you use. And in this example, we basically run a deep research request on research AI agents in 2026, and then we take the research output and just continue with the nano banana model to generate a visual for it, and it's like four lines of code basically, without for you to what's the context, how do I provide the input, and hopefully makes it a lot easier. But as said, you don't have to do it, so like the input field also accepts the same array with role user role model and all of the inputs. For two years, it's also hopefully now much easier. We have a type function, our name for our function which we want to use, describe, and then we have the parameters the model needs to generate. And then here's roughly what we are going to build. So we make our API call with the tools, and then we check what the output of the interaction is, requires action basically means you as a client, you need to do something. So the output generated some function call or some object which you need to react to, we iterate over our output types, check for the function call, execute the function, append the result, send a new interaction until the model no longer decides it wants to call a function and generates a text or something else. Okay. The last time I did a workshop, we were all still coding manually. So that's my first time doing a hands-on workshop where at least I don't code much manually anymore. I'm not sure how you do it. So what we are going to do today is we don't code manually. We are going to use your preferred IDE, the agent, CLI of choice. I'm not sure how many of you are following, but to make it easier we created agent skills for our agents to use. So if you go to the Gemini, you can search for Gemini API docs coding agents. I can also, can you see it? Yeah. I mean I can. Yes, this one should be the setup your coding agent with Gemini, MCP and skills I think that should bring us here. Really? Yes. Or if not you can go to the documentation and then on the left side in the getting started section there's a coding agent setup. Are we successful? Okay. You can, you don't need to install it globally. And that's a good question. So how do you install? So we have multiple skills. So we have the Gemini API dev skill. We are not going to use this one. That's for the generate content API. If you want to use that API or if you are familiar with it, go for it. But we have the Gemini Live API. That's what Torah is going to use later. So we also don't install this. But then we have the Gemini Interactions API. And here you can either pick the first command or the second command depending on what you want. And then like just copy it, open your workspace where you are working in. And then I already did it. But you can like add the command and pick install. And then you should get a wizard asking you to install it. There are many pre-selected. Don't get confused by it. It just means that all of those agents are compatible with the dot agents folder. So you might have a dot agents less skills, less Gemini Interactions API. And in there we have our skill. Yeah. Let me also wait. So I tested with cursor and an anti-gravity and also Gemini CLI. So if you are using one of those three, you are good. If you are using Claude Code, I think it should work as well. I'm not sure if they follow dot agents. There's practice, if not, I mean I'm sure you can. They invented it right? Yeah, the skills. But I think they only look at dot Claude. Okay. So. Okay. Yeah. Question? You showed two different commands. One says skills of the same and the other one context seven? Yeah, they both work the same. But what is context seven? So context seven is, I think a product based out of upstash. So context seven has an MCP server and now our skills CLI, which you can use to get access to like skills. It's like a public repository. So and skills so there's ages to same from the world's healthy. So it's like, and it works, both works on GitHub. So the Google dash Gemini dash Gemini skills is a GitHub repository. And the GitHub repository includes all of the skills we documented. So you can also go there and find it there. There are also the deals and installs commands and it makes it much easier than cloning it and making sure you have it in the right directory. And then what we can do is to make sure our skill works. We can just like ask our agent what skills can you use. And then it should, if it, if you have installed it correctly, you should see, yes, we have our specialized interactions API skill. So antichravity here picked up our skill in the dot agents slash skill folder. Okay. How are we doing? Assume more. Okay. Better? Yeah. I was using Telegram with my assistant. It's a result of this being run on the system. So would I have to have it run? No, you can like you should be able to do it unlike your open-clock kind of stuff if you tell it, install the interactions API and then just a final product of the virtual. Will it run? Yeah, it should run. It's a Python script. We are going to execute it later. Yeah, no, it should work. Okay. Any difficulties installing the skill? Any questions? What a skill is? Everyone familiar with skills? Okay. What we can do maybe to give you some insights on what skill we created or what it contains, right? The importance when creating skills as it should be either something the model cannot do reliably or if you have some personal preferences on like how to do a certain workflow or I don't know, you always need to run tests using band or something like this. And what we did with our skill is we made sure that the agent is aware of which Gemini models are available. A common issue we saw before that is like Gemini always use Gemini 1.5 which is no longer the latest model. We also included the agents here. We have some like very high level information on how it works but we did not include like all of the documentation. What we did instead is you should see yes, a link to our documentation which is available as markdown. So instead of the need to always update our skill with like for example, we added a new feature to interactions API to combine tools. We would have needed to update our skill and then every one of you also need to update your skill to be able to use it and then we are not making a lot of progress in terms of like knowledge cutoff. Instead we provide the information as part of the skill. So all of the agents now have like that fetch tool so they can query the information based on the skill and then like we only need to maintain like the documentation which is mostly up to that yet. Sorry? I mean it normally would provide it as a reference on your local fire right so it either needs to do a read fire call or a web file call which is the same I would say in terms of like cost and it works very well. So what we are going to do as a first example since we are not wipe coding we want to build something more substantial. We not just say build nation. We want to be more specific so we want to build create an agent class with a Kune Structure and run method. The Kune Structure creates Chen AI client so the Chen AI client is what we are going to use to call our model. We also need a defines model and we also need global previous interaction ID and then at the main method to run an example. Do all of that in workshop. So as we can see the model in this case or the agent as a first step read our skill analyze our main file is implementing the many skill. It should probably fail because I am using UV so I stop and I tell it use UV from the workspace and then we let it generate. Okay it checks if we have installed a library we have so maybe in your case the agent still tries to install Google Chen AI if not you can do it yourself with like pip install or UV pip install Google dash Chen AI and then make sure okay so we have our starting agent class we have our Chen AI client we have our model ID it defaults to Gemini free flash it would never have defaulted to Gemini free flash if it would it hadn't read the skill right because then we were stuck with Gemini 1.5 we have our run method which calls makes our interactions.create call we have the input text we have our previous interaction ID we set our new previous interaction ID and then we return the text and then nice as a main example which we will run in a bit we create our agent we have turned one so my name is Phil and then the agent runs it and then what is my name to now we can check if our Gemini created Gemini agent works with multi-turn and our interactions API. You might see a warning similar to the one I got here interaction usage is experimental that's as we are still in beta we really work hard to get the API out of beta to make sure you can use it in production and our call was successful hi Phil nice to meet you okay maybe we weren't successful I don't actually have a name also what is my name maybe you should check did we do it correctly I mean my name is maybe okay that's still the call sorry I asked you what's your name I mean it's a language model for trained by Google it makes sense and what is my name your name is Philip how can I help you so yes I mean it works really well if you are providing good instructions with good skills and good context and don't expect it to I don't know like cure cancer like if you have a very good understanding of what you are trying to build Gemini free flesh is like very fast I mean it didn't take much time it didn't consume credits right every one of us is somewhat token constraints at the moment and it works really well okay yeah no it works I mean we will going to use free well I mean like it also asks like if it wants to run so we are closing the loop even like with Gemini free flesh it tries to run our script to make sure it works and then we will continue in a bit how are we doing anyone successful calls yes okay perfect any any issues any errors any questions yeah I'm using a nice bit okay it works yeah nice great awesome even coding on a phone so normally the next step for our agent right we are now have our like very basic run we can chat with the model now we need to add tools to it and we want to build some kind of coding agents so first tools we are going to add is a read and write file tool and we just continue in our main agent thread it's like okay next we need to add a read file and write file tool create create a basic Python implementation and also the Jason schema definition so when you use function calling or to use right we need to create a Jason schema which we provide to the model so the model understands what it needs to generate once it generates that schema we also need to have some kind of code implementation which we can then run on the client so we ask it to create the Python implementation and also the Jason schema and map to for the key and key function let's see what it will come up with okay it's cheating a little bit so I have like a solution folder and it looked up the implementation implement it yeah I mean that's why it's still important to like check your way I can like it found like the solution and then it was like oh I got this solid Python example and solution to the agent to guide me the task is implement read file and write file yes so what we got back is we have two new like very basic very very basic file implementation so we have a read file tool with a file path which uses Python syntax to open it to read it and we have a write file tool with file path and the content and writes it and then we have our read file schema reads a file returns the content write file writes the file returns the content and it made some updates to our agent so what did it change okay our input is now a text string and the list makes sense since we now need to return a function call we check what do we do okay we create a tool definition for our model which is the schema so that's our tools map schema and then we have our loop which you might be familiar with from the slider showed so after we run our request we check the interactions with outputs so the interactions with outputs include all of the events generated by the model and since Gemini is a reasoning model it also includes for example the thoughts and the thought signatures which we need to return since we are using the previous interaction ID in the server side state that's done by us and we only need to check okay do we have a function call we have a nice debug so we can check it and for our outputs or a function call we check our tools do we have our tool or not I mean maybe the model wants to edit the file but we don't have an edit the file tool we would catch it here it calls our model and then it creates the two results so for function call we have a function result it's also a part of the change we really want to make it easy we will see later when we use Google search there will be a Google search call and a Google search result to have varied the same schema and then we use recursion so if we have two results we basically call our self the self run method again if we don't have two results we return the interaction text and it also updated our example write hello from the agent to a file named hello text and return it back so we I mean the changes roughly look good to me we can try and run it so you we run Python workshop name what do we get we get our tool call with write file we get our tool result we get a read file call again and then the final agent response the file hello agent text was successfully created with the content okay agents are not only like singleton right maybe we wanted to respond so next step is basically very ambitious telling it I want to have continuous I don't know stood in implementation to test let's see let's see where we'll continue now I hope at least it should update our main function where we have an input a while loop basically waiting for the input using our agents so yeah so we have our user input and then we have we always continue with our agent run method which then inside the agent runs it in a loops until there are no two calls anymore and if there's a result we basically get back our response okay quickly wait is it too fast or are we still on track roughly it's fast okay I mean we will share the code later also very happy to answer questions we have many like blog posts and examples of that online so you can if you're interested in like rebuilding it later with less fast but I try to be a little bit slower but I also want to give tour enough time to have you speak so what we can do now since we are like our agent implemented our like while loop to provide input we can say something like hello right normally we should now not do any two calls because hello is like our agent correctly kind of understood or jamming in this case understood hey hello is nothing I need to solve with a reader right file to write so I can say hello how can I help you can you write or maybe can you create CSV with thumbs up my take a while bit so it does the thinking and it does the function call and okay certainly here's a CSV file blah blah blah okay can you write it to disk yes I can if you learn wait maybe our model is not having our tools let's check there's the tools map it does why is it not using our tools what tools can you use it has a read file yes maybe we are not explicit enough and what we can do to improve this in one second is we can add a system instruction to tell the model hey you are coding agent you can use tools to write and interact with the okay there we got our tool call right file with our CSV cool since we saw a mistake we can now tell the model hey add system instructions for the interactions API call and add an example prompt for coding agent okay and what's really nice now since we loaded the interactions API skill in the beginning the model still has the awareness of okay hardware the system instructions to the interactions API and what I can guarantee you is that the Gemini free flash has never seen any code of the interactions API because the model was trained before we even released the API so all of the work we were doing so far is based on like the skills and like the coding infrastructure and hasn't been part of the training okay so what did we get okay we can provide it on the run command or we have one when we created that's good and okay coding persona you are an expert software engineer and help for coding assistant you have access to the local fire system okay let's accept it let's start our agent again let's say hello hello how can I help you with your software engineering and coding tasks mean definitely better than what we had so far and what did we send before we said can you create a swg with a thumbs up so can you create an swg with a thumbs up and now let's see if it calls yes and now at this time we got our right file to a call and then also we got a hey I have created a thumbs up smg file with a simple line out thumbs icon so can I yes there we go we got a thumbs up I can cool and of course what's missing for coding agent right we need to get some bash tools that's now not part of the solutions folder so let's see how we will get our bash tool now add a similar run command tool that allows the model to execute bash commands okay creates an implementation plan yeah yeah yeah okay so we have our run command which uses a sub process which in this case I guess it's okay we don't care too much about security for this example and then our output is to stood out works edit we have our run command tool yeah even updated our system prompt and now let's stop that let's clear that run it again any suggestion on what we should test I'm not sure that will work because we don't have any skills or any information for it time get the time tool call run command date Wednesday April 8th looks good yeah cool that's our small little coding agent no I mean let's not do this any more questions any more ideas we have like roughly five to seven minutes yes yes yes so what do you always can do so what we are doing here is right we always use the previous interaction ID from the previous term so we basically stack it and you can always go back to any index in the stack and branch from there so if you would keep the interaction IDs on your client side you can always use those to I don't know branch out and like I don't have like a first prompt on like do basic web search and then like use test is like a base for like five parallel requests doing some other work and you can always get the context so we have an interactions dot get method which you can use to retrieve the interaction and then also get the previous interaction ID so you can basically go back until the beginning and get all of your state if you want to save it for later and the default for those interactions being stored on the server for three tiers one day for paid usage is 55 days at the moment yeah you have to question sorry you are okay more questions yeah no so once so Gemini models have a million token context what would happen now if you reach that you will get an error but we are working on context compaction techniques but it's easier said as done and still something you currently need to maintain on your client side no no so when you send the request you get an ID the interaction ID and the interaction ID stores your input and the output of the model in the free tier the input and output and the ID is stored for one day so meaning if you send the request now and you continue eight hour later from that point the state or like the context is still available if you would send a request tomorrow it would basically say cannot find request with the all interaction ID because it basically is pruned after a day but if you use a paid API key it's stored for 55 days and the interactions API is also coming to vertex and I think there might be a little bit more flexibility in terms of how long you want to store or customize it yes we don't have one yet but hopefully soon yes we can idea that you could have the API job being spoken out with the eSpeak tool sorry it's the eSpeak tool because you see eSpeak tool to actually speak the joke about API and we could also use Gemini which has a TTS model which can speak the but speaking and listening I mean total show many many cool things any questions regarding the interactions API in small little agent no okay then you get eight minutes okay five minutes with to make your agent talk yes cool that worked really well yes yeah it's it's been a while yeah how much better much we didn't even pay him for it big upgrades yes yeah maybe I know there's a couple more minutes but caching yes yes yeah that's a good question we probably need to find Philip to answer that I actually I actually don't know Philip caching question the input tokens what was it for example when you provide an input like no interaction ID first input PDF 4,000 tokens and the text input with 10 tokens and you do a follow-up interaction call maybe only the PDF will be cached and not the other like the short text and then if you do another one maybe the PDF and like the follow-up turns will be cached so it's like more like an object level but since you I mean how is it it's very easy to make a mistake in caching if you like even the slightest change in your prompt removing widespread line breaks will break it so like having this rely on the server to keep it it's more guaranteed that it's secure and it could be as easy as hey my user says there's an empty line break at the end okay I remove it and then I use that history again and then it falls apart yes optimally but I mean it depends on where you request kind of hits it and like how fast you follow up but the cache rate should be pretty high cool maybe to kick things off we can sort of look at one of the examples that that we can build with the life API the new model so this is now a Gemini 3.1 flash life which came out two weeks ago I think very exciting it's been a while I think the previous 2.5 native audio model was December so it's been a while the reason being we did kind of major underlying architecture we work to you know ideally lower latency you know better scalability in the in the future as this rams up so that's very exciting and then you know in parallel we're working on bringing the life API to the interactions API as well so that's something that Philip and I are working on at the moment not the yet so we still have to use the life API for now but let's do a little demo and so this demo uses the life API in combination with Luria 3 so on the Gemini API have people play it with Luria so the new Luria model is actually music generation model and so Luria 3 can now generate songs you know including songs with lyrics so this is the clip model which is kind of a 30 second clip generation but then also we have a full song model now where you can generate full songs so the idea is that we have our conversational agent which is kind of the DJ that you're interacting with and then that DJ has a tool called to generate music right and so you know I so I'm originally German but you know growing up I learned kind of English through Harry Potter and BBC Radio 1 and if you remember back in the day you could like phone in and like wish for a song and so that's sort of the the idea of the experience here right then you're on live jukebox what absolute banger are we cooking up today tell me what genre are we leaning into and what sort of vibe are you going for let's make can we get some sort of high energy German techno shlager about the AI scene in the UK German techno shlager about UK AI that's absolutely brilliant mate proper radio one style any specific lyrics or themes you want crammed in there or surprise me let's go right you beauty let's get those speakers rattling prepare for a burst of machine mayhem here comes German techno shlager about the UK AI scene let's go yeah the the TPUs enjoy when you clap for them so appreciate that any anyone have any idea for another one we can do one more shout it out any languages any genres and nursing song in swahili right then welcome to the live jukebox you're on air with me can we get a nursing song in swahili please a nursing song in swahili brilliant that's something we don't hear every day to make it an absolute banger what sort of vibe are we going for upbeat and celebratory or something more relaxing and gentle and what specific bits about nursing should we squeeze in let me know and I'll get this track buzzing yeah techno hardcore techno hardcore and swahili about nursing wicked prepared a blast those speakers I'm dialing out in now get ready for a total banger mate well go see go for yet to you know guest the audio house okay that is the live jukebox DJ DJ you if you want to it's on so this was built fully with Google AI studio so it's it's kind of a little vibe code is demo I've published it in AI studio as well so you can try it out you will need a paid API key for this one just the music generation actually requires a billing account so if you have you know an account with credits you can use that for example cool yeah Gemini 3.1 flash life you know benchmarks obviously benchmarks don't really tell you the truth as much they are great for benchmarking things the real world especially in kind of life audio is often look a bit different so you know ideally we just try it out ourselves so Gemini 3.1 flash life it's the model that is now in Gemini life in your phone so if you're using Gemini app on your phone you're talking to that model as well as I think search life has it now in there as well so if you're talking to Google search I think that's the same model and then you can build applications using this model on the life API so the life API is a state full kind of web socket API you are able to send real time text audio video feeds to the model so audio you're sending in kind of you know buffer chunks so of the real time audio you're streaming that in video you can stream in at a maximum frame rate of one frame per second so this can be you know a camera feed this can be a canvas so it could be like your your screen share right so you could share the screen with the model so for example Shopify is using this for Shopify site kick where it's actually kind of like a tech support walking you through you know if you're like oh how do I set up a custom domain for like my Shopify store it would basically like talk you through how to do that and it can see kind of where you are on the screen by sort of ingesting the frames of the screen and then in return the web socket gives you kind of real time events back and so these are basically streaming back audio buffers and then also you can get the audio transcriptions so that's kind of the text of it and then we have tool calling built-ins of Google search grounding is built in by default so if you need kind of real time weather information you can access that as well and yeah some key features so what's really cool about this model again it's it's kind of native audio model so what that means is we're not going through text it's kind of not a cascading pipeline where you're transcribing the text running the text through an LLM and then generating speech but rather the model itself is you know going sound token to sound token and the intelligence is kind of baked into this audio model so it's based on you know Gemini 3.1 so decently intelligent you have different kind of thinking levels that you can enable and so the great thing with that is kind of the multilingual support so it's I think 97 languages that are kind of supported and preview at the moment which and the great thing is because it is kind of a you know a native audio model it can actually it has sort of the audio understanding of Gemini built into it so it can understand a mix of different languages it can you know sort of a danglish for example which is like a mix of German, Dutch and English right so it would be able to sort of naturally switch between kind of different languages as well which is really great yeah barge in you know obviously there's kind of automatic voice activity detection sort of built into the model so you can interrupt it you saw it earlier that's kind of trying to have a conversation but we're trying to get it to you know use the tool tool uses the other big thing so major improvements in kind of tool use and instruction following here with that model and so you can build some really cool things with that so obviously we currently only give you a web socket API so that is kind of a downside if you use something like GPC real time before you know you get a direct web RTC kind of infrastructure which which can be helpful so we have partnered with you know a lot of sort of integration partners like life kid pipe cats software mentioners in Poland they build a great service called fish gem vision agents vox implant so these partners have integrated kind of the life API directly and then give you sort of easy web RTC integrations if you if you you know want that or need that kind of for you system yeah let's let's try it out so you can try it out yourself and it's going to be interesting if we all try it out in this room at the same time so we'll see how that works but again AI.Studio or AI.Deaf and then slash life you can try out the model and so you can you know ingest your webcam here so we can give it our webcam fees allow this time and then you know we can send text as well so we can send like how is my out face so in this case you know I'm not interesting any you're wearing a green jacket over a blue t-shirt paired with a black cap the combination looks casual and comfortable is there a specific occasion you have in mind? Yeah okay so you know that obviously is a bit further away from our upbeat sort of British Australian life DJ so what we can do is we can you kind of adjust our voice so through the system instructions now in terms of the bass voices we don't have that many that's kind of 30 different bass voices got a project in mind and they are you know fairly generic sort of but because you know Gemini has very deep audio understanding what we can do is actually we can modify the voice through prompt through system instructions so what we can do is actually we can give us kind of a system instruction here for example and we can just say okay speak and a friendly Irish accent right and so now we have that bass voice puck and speaking of friendly Irish accent hey can you hear me? Well hello there loud and clear so I can what can I do for you on this fine day? What do you think of my outfit? Well no here looking very smart so yeah that green jacket suits your well I must say a grand casual look where you're thinking ahead and out somewhere and now I'm just here at the AI engineer okay anyway so I realized I didn't put on so what I wanted to show you as well is this grounding with Google search so what we can do is you know obviously the model itself is trained to a certain knowledge cut off date and so you know if we need kind of real-time information like this gorgeous weather that you know no one would actually believe us the model wouldn't believe us you know if it didn't have Google search grounding but now you know we can we can have Google search grounding we can have different thinking levels as well so no thinking is actually very minimal thinking but it means you know the model will respond faster it's kind of lower latency because the more thinking we have obviously the more thinking steps are kind of involved and it'll you know increase the latency so you can see in the benchmarks as well there's kind of thinking high thinking low sort of different thinking levels but so now if we turn on kind of the Google search here oh hello there what ranges shenanigans can I help you with today anything at all just give a shout yeah what's the weather right now in London actually well now it looks like it's about nine degrees and mostly cloudy in London right now might be a bit chilly so you'd be wanting a jacket for sure anything else you need to know about the goings on over there what were you talking about can you check again what date is today oh apologies there I seem to have gotten ahead of myself today is Wednesday April 8th 2026 is there anything specific you needed to know about today yeah now look up today's actual weather right jar for Wednesday April 8th 2026 in London you're looking at temperatures between five and 13 degrees with a decent chance of rain so don't forget that umbrella anything else on your mind all right here we are it looks like I didn't pray to the the demo gots okay there seems to be something going on I wonder if we messed up the UI there it should it should it should work a lot better than that I think the Google search grounding for some reason isn't working so but what we can do now is we can obviously try it out ourselves in an application so the easiest way we could do that is you know like the life jupebox DJ life jupebox DJ and we could use Google AI Studio to kind of wipe coach our integration so we have this pill here which is called Ed you know voice conversation real-time voice with you know Gemini Life API and then we could say built kind of a multi-lingual interview assistant that allows me to you know train for interviews in different languages like German and English and Spanish and you know what have you and so we can now kind of fire this off so this uses Gemini 3 flash preview it is limited to kind of JavaScript full stack environments at the moment so I think you can choose between kind of next JS Angular there's like XR building blocks as well if you're building for for kind of classes sort of web VR experiences so you know feel free to kind of fire one off these off right now or you can also clone the life jupebox DJ that are shared with you early and you can try that out it'll take a little time and you'll hear like a little chime once it is ready so in the meantime what we can do is so if you go to the Gemini Life API docs Gemini Life API there we are so you know we've done this we tried out kind of the life API in Google AI Studio and we also can use the coding agent skills so Phil showed you this earlier we have dedicated coding agent skills also for the Gemini Life API so you can install that it'll help you know your coding agent integrate the life API more easily more quickly but then also you know we have good old example apps on on GitHub which can be very very helpful so what you can do is you can clone these example apps so in GitHub you know you can like you do in GitHub right you can clone this so we'll open can you terminal and yes feel free to follow along we do that bigger I will make a new directory we'll call it AI and Europe I like that they call it Europe right but it's I mean I guess yeah the UK is part of Europe just not the EU fair enough and so we'll we'll go in there and then we'll just do a git clone of our app and so now we have our app in here so there's actually a couple different examples that we can use in here so if you're using anti-gravity there's a handy a GY command to open your examples in anti-gravity and so we can you know look at our different examples here and how we can need to set that up so we have you know two different scenarios so the Gemini Life Genai Python example uses the Gemini Life API on the server so it creates a WebSocket connection from your server to you know Gemini Life API and then on your front end you basically set up a proxy to proxy the WebSocket connection to your client side right because your browser window is kind of your client and so that is what is capturing your your audio feed your video feed and so in this example which is using fast API and we're basically just setting up kind of a WebSocket here that our client can connect to and then we're basically just receiving sort of our you know audio cues our video cues from the client site or you know our text input cue as well and so we're receiving that we then setting up a life session so we're using our Gemini client so we kind of abstracted sort of all the life API stuff into this Gemini Life file here and you can see like starting the session we're basically setting up kind of our life connect config maybe I close this for now so you can see that we're setting up some system instructions so that was you know earlier kind of set a helpful assistant we also said kind of speaking of friendly Irish accent for example you know this is where we pushed our sort of system instructions our Godrails we can you know make that pretty long in terms of covering sort of what we what we want and then we can define kind of our tools here and then we're basically setting up our you know session and our session sort of is you know the WebSocket session and then we're just receiving our audio and video cues from the client site and proxying that through so that is kind of one approach that's sort of the server to server approach and you know we're just using kind of UV here so if we're setting this up for the first time we can go into so this is our Gemini life and Genai Python example here so we can set up our virtual environment we can then activate our source here we can install our dependencies okay I might have this is a fun part of a Google laptop security come on all right just look away don't look just look at that and and then yeah install our requirements and so we'll need an API key so the API key you can see kind of here how the config figuration is so we basically just need our Gemini API key and we need to set up an environment variable for that so you can see we have an example file here not a lot in there because it's basically just that so we can copy our dot in V dot example into our dot in V file here and then you remember how to get your API key where do you get your API key yes AI to death there we got fantastic love it okay AI dot dev so that is where you get your API key I think I have a couple API keys so it takes a little while to load them here which one is maybe we'll use this one so once you've created your API key you can copy the API key from here while actually maybe I should create a new one because later I'll need to delete this so we'll see AI and Europe will we have a couple projects here we'll just use which one which one we use and too many projects okay we'll just use this one and so now I really don't like us actually returning the clear API key here I think we learned something today okay we saved that and so now we have sort of our API key set up and now what we can do is we can run our demo and that was just the main dot pie and so now our demo will be up and running on local host 8000 here and so we can see it's just kind of a basics of demo and when we connect top of the morning tier I'm Gemini Lloyd you for a little demo of what this API can do why not try out some fun features like here and me speak in different acts I can see you're all right you're sitting there with your hands on face looking straight at me but I'm not sure see I like to speak German with you how can I help you don't thank me one t salio da shaman a new to find a more washing AI so you have good sure to find me now that was that was smart but this I was asking sorry I yeah yeah but it didn't transcribe it all right we're finding out a lot of things here to improve which is nice but yeah so what we can see is so this is you could actually notice that the latency is a bit worse because we're actually having that jump from our client to our server I would love to play the Wi-Fi but so with the server to server set up you just have that additional latency of sort of going through your client so what we can do as well is we can go directly from our client to the server and so this is the other example that we have which is this one here using if thermal tokens so for a thermal tokens we can just kind of look into the setup here very similar we'll just go back actually let me get a new terminal up here so we'll say a thermal tokens so our thermal tokens are basically short lift tokens that we generate with our API key on the server size and then we send that a thermal token to our client so you know our phone our browser to then initiate the WebSocket connection directly from the client to the life API again similar setup here we want a virtual environment we then activate our virtual environment and we install our dependencies we also need our API key again so I think what we can do is just use the same one so we'll copy we just copy this one here and then paste paste that in is it big enough I don't know can people see that maybe we'll zoom in a little bit more so now we have our key here as well and so now that we have our dependencies installed we can run our server and so our server here we can look at the server real quick so this server is basically just a very you know slim sort of back end that just has our Gemini API key and then it generates an a thermal token for us so a thermal token is currently on the V1 alpha API so you'll need to use actually a different API at the moment for this and then you would pass in kind of this expiration time because ideally you know the token should be short-lived so should the token ever leak you know it shouldn't be too costly because it'll expire pretty soon so there we are that's our token and then we return our token back to our clients so on our front end we then have our Gemini life kind of integration here and so this is an example of just a pure WebSocket integration without sort of any SDK so you know if you you can use kind of any sort of WebSocket framework here and you can see sort of you know how all the the different sort of raw WebSocket events are handled and so you can see here we have you know this is kind of our WebSocket API so here we need to use kind of the V1 alpha and then this is the bidi so bidirectional you know we're streaming in both directions and here we pass our access token which is our short-lived token great so what we can do now is now our service up and running so we can see this beautiful interface here which was handcrafted this you know no agent involved in the creation of this one back in the day and so we can just see kind of all the different knobs here that we have so we'll try this kind of Google grounding as well and hopefully it was just the thing we need to fix in the UI and here's our flash life preview model so that's 3.1 flash life and then we can connect here and so we see kind of the server events actually let me maybe do that again so we say enable Google search grounding connect so what we can see is kind of our WebSocket setup so if we look into the network we now see we have kind of this WebSocket connection here so we see we have our access token so our token request to the backend actually let me see that okay so you can see we're just getting our token so the response was just our auth token and when it expires what we sent in terms of the payload where's our payload okay actually there was no payload yeah we're just getting a token back so once we have our token we then set up our WebSocket and so we can see our messages here so basically the first message in our WebSocket we're sending our setup configuration so you see we're you know sending our model as like Gemini 3.1 flash preview we have some real time input there we have the tools so hopefully now the Google search should work and so now what we can do is we can start our audio hey can you hear me yep I can hear you loud and clear how can I help you today um I'm in London right now and I'm wondering do I need an umbrella what's the weather today I'm not getting the current weather for London right now sorry about that your best bet is probably to quickly check a weather app on your phone or search online that'll give you the latest on rain oh you're killing me just use Google search my apologies I seem to be having some trouble pulling that up right now I know that's frustrating you can get the forecast easily by just asking your phone or searching online I'll try to get this sorted on my end I understand your frustration sometimes technology gives us hiccups and I'm sorry I couldn't get that weather for you right now I'm always learning so hopefully next time we'll be smoother was there anything else I could help you with all right that was painful I do apologize for that okay so something is definitely going on that is not good not not nice um let's see if we have more luck with the custom tools so we're also passing in do we pass in some custom tools or is there actually some issue with our hey can you show me a hello world alert there you go a hello world alert has been displayed anything else I can help you with yeah can you change the background of the page to green please you got it the background color should now be green let me know if you need anything else yeah can you tell me what's the weather right now in London I understand you're asking about the weather in London but unfortunately I'm unable to google grounding but for some reason it's yeah so with the google grounding it should work so I wonder if I messed up something nowhere is it all the weather information isn't something I can grab right now but if you have any other questions or need help with different functions feel free to ask let's try this again hey can you hear me yeah we need some audio hey can you change the the background back to wife please my wish I could but I don't actually have control over your display settings you might need to look in your devices settings menu to change that anything else I can help with instead oh sorry off the page the the background of the page back to wife oh I see sadly I still can't change that for you those appearance settings are part of the website or app you're using oh yeah you're right um no but okay well does able custom tools that shouldn't be the case anymore all right and lots of work to do but so yeah this is sort of how you can get up and running did anyone manage to get a running on their machine yeah is that is Google search grounding working okay that's uh that's a shame all right um what else do we have so we need to fix lots of takes away from this um yeah if you didn't get the link so this is the link to the life API examples so um they also link from the docs so if you go to kind of the Gemini life API docs you can find them there um is that the end of my slides I do think we have yeah the agent skills as well so I mean if you just go to the Gemini life docs Gemini life API and in the on the main docs we've actually linked it from the top there try the life API Google AI Studio uh the GIPPOP examples or use the coding skills um so that is yeah that is how you can get started cool that was uh not how it was hoping this would go um yeah we'll we'll figure out what what we're wrong there but um yeah appreciate y'all joining and yeah we'll we'll take questions yeah so you can look at kind of the session management so um there's actually sort of so the life API has kind of this context window compression um so without compression kind of audio only sessions are limited to 15 minutes um audio video sessions are limited to two minutes um so what that means is that sort of you will the session will kind of terminate it will give you sort of the score away pain um and what you can do is sort of enabling context window compression so context window compression is kind of like a you know sliding window where you basically say you know like I want to keep that much context sort of in my window and sort of as the conversation progresses it will then actually like forget kind of the previous context before that window um so yeah so context window compression is something that you can enable kind of you know to have that sliding window to uh make the sessions longer but then you know there's only so much context depending on the frame race that you're feeding in in terms of images depending um yeah mostly mostly its images audio is sort of audio only sessions there's more kind of context you can keep in the window but then as you're adding kind of video frames to it uh it does yeah compress it yeah oh yeah yeah um is there any real life use cases that this is being used for that's very creative from a like a business context very what you've done today is a lot of fun but like applying it to business as there any businesses that are actually doing this really well right now yeah um I mean so Shopify has it in production with Shopify Cycic um there is a bunch of so like actually one that I really enjoy if you Gemini Life API blog we had kind of a case study which is um this startup yeah so I mean stitch is using it as well you can like vibe code vibe designed with your voice and stitch but then you know stitch is built by Google so um you know you probably got a discount dash um way more is also integrating it so you know we first we got rid of the drivers and the cars and then you know you do want to talk to someone in the car you can then talk to Gemini in the future so they they're working on that um I like this one so this is hey ado um it's a it's a great startup from Argentina so they are building these um voice sort of companions for the elderly in combination with kind of an app for sort of the caretakers or you know the children of the elderly where you know they can get notifications yeah should the elderly mumble about something that no but so it's it's a really nice interaction where like the multilinguality um really shines you know because like in Argentina uh you know a lot of it would be Spanish um but there is a nice example you can kind of look at um which is really sweet I came to visit my grandma Maria uh what you do say about it you might you you can you can look at it in your in your own time uh and then I think there was another one so um yes it is somewhat more of a future music use case just kind of looking at you know some of the rough edges and limitations and I think for now if you're you know like in in a real business use case you might be you know better off with kind of the cascading pipeline because it gives you so the observability at each step of the pipeline um which kind of with the the real time you know native audio um you don't really have that fine-grained control and observability in terms of you know plugging into say like rewriting the response before it's being sad um you know obviously there are certain benefits with that in terms of the natural flow of the conversation but for certain business use cases it might might just not be there yet so it is somewhat you know a best in the future of like what uh kind of real-time conversational interactions will look like in the future um but yeah depending on your business use case today it might not be the best fit just yet uh no so you in the session you can't uh retrieve them so you would have to store them on your on your end um again so that is sort of where the um integration partners come in so if uh yeah so life kit pipecats they all have like really good offerings to you know store the entire audio as well as the entire transcript and sort of give you additional observability tooling on top of that as well um so it's not something that is kind of all available sort of you know from just the Google size so that's something where currently we're relying on kind of partner integrations to sort of give you that additional functionality yeah oh sorry uh behind you yeah okay um so you said that this is not you know like super business ready for more more complex use cases but what are your thoughts for example in um replacing interview interviews for recruitment using this yeah I mean I'm not sure I would you know replace your entire interview pipeline with that but I think it is a great scenario you know for certain steps in the interview process or you know being able to screen more candidates for example um yeah do I think that's ready um because there's a lot of like um context that you also want to give it right especially even if it's like a first screen or the quick screens that you're mentioning you still want to put some criteria that would guide and guard rails and how to guide the the conversation and then the second part to this is also kind of work for example with if my company is in a uh Google environment Google meets right right now we have auto transcript so can you put this tools together um yeah so it is a preview so it is something you can use in production um depending on your use case I think you need to evaluate like you know do you need like sock to it's sort of there's potentially a bunch of things that you need to build around it for it to actually fit your business use case um so yeah it is ready for experimentation does that help yeah yeah uh oh yes um an issue I always have is when I'm demonstrating voice agents a lot of people are talking and it's it can't differentiate the speakers between each other so is there a solution for this already uh on a voice sample or something turns of like identifying the different speakers yeah and so it only listens to you if it's your agent for instance oh interesting um it only listens to you imagine you have a coding agent and somebody makes a prank and says delete the file or stuff like that you don't want that yeah now that's that's interesting no I don't think there's any specific ability to sort of like say just listen to me um so there is sort of kind of this proactive audio where you can tell it to kind of only respond in certain you know to certain contacts so like ignore things that aren't relevant to the conversation um so to some extent that works um but I don't think it's super reliable at the moment where you'd say like only listen to me ignore anyone else I've done that with parakeet from Nvidia where you can train a little 10 seconds and it can differentiate the speakers that way but it would be nice to have something like you talk to it and then it recognizes your voice yeah and ignore the other voices for the rest of the session that's cool that's a great great idea yeah thanks yeah I think in the yeah what does thinking look like is it think in text or is also thinking in speech yeah so you get the thinking only in text so there's text events on the web socket channel that um so you can you can opt into getting the thinking as text yeah it wouldn't speak out the thinking yeah thanks for the demo and uh for being brave to go up against the demo gods um I did a bit of question around the multi-modality side of things uh one of the areas that I really want uh good text to no speech to text models is having them be grounded in what I'm looking at like when I'm coding I want to just you know word vomit into cursor sorry anti-gravity uh and have it understand the context and you know when I say something that is a specific class name it should just actually use that um how does that work inside of this framework like with with this be a way to actually do do that grounding or would you recommend some other API for that um and this is for so so general Gemini models are really good at audio understanding so um if your use case doesn't require like full lee real time transcription I would actually recommend using like Gemini 3 Flash to basically transcribe but also guess you know like ingest context and sort of basically get contextually aware transcription um yeah there's uh so I mean depending on your use case if you needed to like be fully real time conversational um then yeah so you can kind of use text to sort of ingest additional input or you know imagery if it makes sense but then again that you know reduces the context um window size um or you know if you don't need kind of the fully real time um then like using Gemini you know just flash to you know transcribe is actually pretty good first input then speak for five seconds and not send an additional image then you are not using so much context I think like one image is around 1200 1200 tokens so not too much um so if you are like in your editor you want real time you basically can use the API first input is send real time image and then send real time audio and then you stop basically and the model has the image input and the audio input and can respond to it so you there's no need to stream the image consistently if you don't if it doesn't change or if you don't need like it to react to it yeah cool yeah uh I think we have a bit more time uh there in the back yeah how do we get you the microphone or just just want to shout it out oh yeah we'll do uh thank you for your interesting presentation I have a question about the personalization or adaptation can it recognize the speakers level or the knowledge during the interaction and then based on the speakers knowledge produce the result or not sorry can you repeat can it I can it recognize the speakers knowledge or the background during the interaction to produce the response based on the speakers knowledge or something like that to personalize itself so you you want to ingest kind of initial context is that what you're saying? Not context for example suppose I'm talking to that about the civil engineering and can it recognize I'm a civil engineer and based on my knowledge produce the result use the advanced keyword in civil engineering or not based on my knowledge so you would you would have to if it's like special specialized knowledge you would have to give it away to access that oh that may it doesn't have any memory to recognize the humans background or to find the main context of the information of the interaction and based on that reduce the next result so you have to somehow feed in that knowledge so you could either do that sort of before you you know like as you set up the session you can ingest kind of the the knowledge as initial context for example and then it has that in context to talk about or you would give it kind of function calls to access knowledge sort of during the session as you as you converse yes thank you you know I was to find that look out the gpt for example during the some turns it can find the speakers or the users knowledge or the main concept of the information and based on that the gpt can produce some results that means a step by step gradually it's personalized to the main context of the conversation so my question is can it step by step personalized itself to the main context of the of the interaction and then produce a result to point today yeah so like as long as the the context stays within the context window during that conversation yeah it would it can just like it can identify different speakers and sort of remember what was said in the conversation and like if they introduce themselves with their name as well it can remember kind of that person's name and so like yeah that's kind of the the audio understanding sort of within Gemini thank you okay cool yeah jujus want to pass it over there yeah thanks for the nice presentation could you share maybe some of your experience on how to evaluate these live voice apps because I can imagine that this becomes a lot more complicated than typical apps yeah so it definitely depends on your use case in terms of like you know like what are your requirements in terms of you know do you have hip-hop do you have stock to like what is the amount of function calls the amount of guardrails so there's definitely a lot you know these demos are nice and fun but like to bring that in a business business context there definitely a lot more steps involved and so that is kind of where the partner integrations come in so you know life kit has built kind of their entire business around sort of giving you all the batteries around some voice agents and so I would recommend if you know sort of looking at the partner integrations for the sort of real business use cases potentially thanks sure yeah I hope you don't mind if I just ask a simple question about the interactions API going back to my previous talk yes please once I can be available on vertex hopefully soon I mean if you speak to some Google Cloud person some vertex person the more you tell them I needed on Google Cloud the easier it gets we'll do yeah it's at least not in our control okay yeah that's bad that's good I mean it the API will be the same so you can start today on Gemini API start testing if you need higher rate limits anything else you can like always reach out if you need to vertex enterprise specific features then you might need to wait a little bit can I just like in terms of like PII in any like conversation history you know how that's like stored in terms of like like data sovereignty can you specify your own data sources or is that all handled like back end with like within the API so you can always disable storing anything so we have a store equals vaults flag so we would not store anything but not storing means no server side state so if you would like to use this it's a bit difficult for other vertex features in terms of data sovereignty where you call the model I would expect them to be like similar to generate content so if they have it today they will have it there in the future as well okay cool that's great thanks appreciate it cool one last one oh okay oh sorry never please thank you so I have a question about how hallucinations so we are the one hallucinations so you have shown some examples with the weather that didn't work so well but how do your clients actually deal with that stuff on production because I can imagine that for like some examples that we've seen here this is fine but in real life this is a different story so can you give some best practices or how to deal with that yeah definitely so I mean for the demos the there's definitely a lack of best practices in terms of like system instructions and you know there's a lot that you can do sort of with the you know better system instructions to have the agent actually follow the system instructions and not you know go off and like hallucinate the weather for example and so yeah I think we we have like there's some best practices stocks so I'd recommend kind of going through through those we have an example as well sort of you know how to sort of structure your your system prompt and sort of put you know your guardrails in their guidelines and kind of the tool definitions as well and so once you built that up the agent gets a lot better you know at following the system instructions and kind of staying within those those parameters cool cool yes thanks so much everyone I do apologize for the hiccups but we we learned something and we'll we'll improve up on it but yeah we'd love for y'all to test it out and you know let me know over the next couple days I'll be around what you find and let me know your feedback thank you cheers

Feedback / ReportSpotted an issue or have an improvement idea?