- Generic third-party agentic tools, when used out-of-the-box, can lead to unpredictable agent behavior, suboptimal performance, and significant security risks.
- A five-pronged framework is proposed to optimize the integration of these tools, ensuring agents act reliably and efficiently within specific application contexts.
- The framework empowers developers to tailor tool descriptions, enforce deterministic controls, and strategically compose or directly call tools, effectively "bending" the tools to their will without "breaking" the application.
Bending a Public MCP Server Without Breaking It — Nimrod Hauser, Baz
- Generic tools cause issues: Third-party agentic tools, designed for broad use cases, often come with shallow descriptions and can lead to agents behaving unexpectedly, performing poorly, or creating security vulnerabilities (e.g., data leaks in multi-tenant systems).
- Curate available tools: Filter out irrelevant or potentially harmful tools from the agent's context. Providing fewer, more focused tools reduces cognitive load on the agent and the likelihood of incorrect actions.
- Wrap tools with custom descriptions: Enhance generic tool descriptions with detailed, use-case specific instructions and behavioral guidance. This helps agents make better decisions on when and how to use a tool, improving relevance and performance.
- Implement deterministic guardrails: For mission-critical or sensitive operations, embed non-agentic, deterministic logic around tool invocations. This ensures actions like saving files or accessing specific resources adhere to strict rules, preventing agent "hallucinations" or security breaches.
- Compose new tools from existing ones: Combine or extend existing third-party tools to create specialized tools tailored for specific sub-tasks or contexts. This allows for distinct descriptions and additional pre/post-invocation logic.
- Treat tools as simple functions: For repetitive, non-negotiable parts of a workflow, bypass the agentic decision-making process by deterministically calling third-party tool functions directly. This is useful for complex, mandatory steps like logging in.
Agentic tools — Functions or capabilities provided to an AI agent, often wrapped with descriptions to guide their usage.
MCP server — A server that provides a collection of tools and functionalities, frequently serving as a source for agentic tools. (e.g., Playwright's MCP server).
Non-deterministic — Describes an agent's behavior where the same input does not always produce the same output, leading to unpredictability.
Deterministic guardrails — Explicit, pre-defined logic that restricts or controls an agent's actions in sensitive or critical scenarios, overriding potential agentic misinterpretations.
Context window — The limited amount of information (tokens) an LLM can process at one time, including prompts, instructions, and available tool descriptions.
Hallucination — When an AI agent generates incorrect, fabricated, or nonsensical information, such as a non-existent URL or file path.
Tool description — A natural language explanation associated with an agentic tool, which the agent uses to understand the tool's purpose and decide when and how to invoke it.
Wrapping tools — The practice of enclosing a third-party tool's original functionality with custom logic or an enhanced description to better align its use with a specific application's requirements.
Curating tools — The process of carefully selecting, filtering, and organizing which third-party tools are made available to an agent, often to remove irrelevant or potentially problematic options.
Hi everyone and welcome to our talk today about bending a public MCP server without breaking it But today we may have just broken it because our MCP server seems to have caught on fire We're glad you're here. We need all the help we can get Let's go through our talk and see how we can improve Whatever is going on right here. I'm house or I work at buzz We've been building a part code reviewers for the past few years now as well as a bunch of other features Anything that can can make the lives of people at R&D easier and better whether they're devs pms Anything else if it's a genetic we're probably tinkering with it But let's jump right in and start looking at what's going on with our MCP server. I suspect it's the tools We're gonna talk about third-party tools and why they might blow our applications First of all, it's me as I said Nymrod houseer a founding engineer at buzz been with the company since it was founded in 2023 I've been at back in the data for the past 20 years or so had a in my career had a brief stint and Salesforce and ever since mostly startups cyber crypto and now developer tools nowadays I mostly want to talk to you guys about agentech tools all right agentech tools and specifically third-party tools They can be a great force a great addition to our application But they don't always work out of the box. We expect them to make our application better Sometimes we'll see degradation and we'll try to understand why that happens After that we'll explore a framework of five best practices that we can follow In order to turn this around and make our application kickass Along the way hopefully we'll fix the buzz to the MCP server that we just saw put out that fire Make it work and make our agents behave the way we want them to Yeah, this is looking kind of bad Think we should dive in So we're gonna talk about agentech tools when we Use MCP servers we get tools from coming from the MCP server So as long as we're talking about third-party tools I don't care if they're coming from an MCP server from a library Maybe we copy and pasted them from somewhere else if they're agentech tools They're written by a different team they're relevant for this discussion So what are these tools? Essentially tools are just Collable functions wrapped with a nice description the description is important because it lets the agents know When to use the code and how to use the code and we'll dive deep into these aspects of the description But again, it's kind of like glorified integration code written by a third party In today's talk we're going to take playwrights MCP server as an example So essentially we're looking at integration code written by the good people at the team of playwright wrapped with their descriptions And we'll see how we can make these tools work better kind of tailor them for our use case Yeah, so third-party tools have their challenges First and foremost they might cause our agents to behave unexpectedly You know agents are already non-deterministic unpredictable things You give them tools and you get unpredictability at scale But also they can just delegate performance You might want the agent to do a certain thing and you get supper results wrong results Or maybe it just does it but in a way that's not optimal And through today's best practices hopefully we can see how we can make the implementation of third-party tools In our agentic workflows much much better Last and foremost these bad performances that unexpected behavior that can also mean full-blown security issues I mean just imagine a scenario pretty classic scenario Like a multi-tenant architecture and your agent might not know all there is to know about your architecture And the division into folders or databases and schemas It just doesn't have the proper guardrails and it might leak client data to another client to another client Things of that nature you really want to guard real your agents and and this is becomes even more important When dealing with third-party tools who are not aware of your architecture So we'll cover that as well All right I think I think we're gonna work About ready to look at a use case To look at some code actually But we'll need a use case and with your permission I'll we'll choose one of ours So today our use case will be Buzz's spec reviewer So what is the spec reviewer? It's one of our products which is essentially an agentic reviewer That knows how to compare requirements with implementation So as a first step it needs to kind of collect requirements It will go to your ticketing systems like your jira's or linear or anything of that nature And read a ticket And it can also go to Figma and look at visual designs In kind of a multi-model Way of operation it will actually see the design that is intended And that part is the requirements what's once it understands what a developer was tasked with That's when it will spin up playwright's mcp server to actually open up A browser go into your system check the branch see the implementation and it will need to assess Whether the implementation meets the requirement It will give us kind of a verdict it will take a snapshot as evidence Whether this was fulfilled or wasn't fulfilled And it does all this automatically and can save people mostly pms A lot and lot of time doing mineral validation work So we've built a toy example of our spec reviewer And we're going to see how we handle the tools to get the most of it I hope this makes sense At a high level I think it's time to look at some code and hopefully everything will be much much clearer All right so we have a toy example of our spec reviewer We'll go through it kind of quickly We don't need to dive into every aspect of it. It's it's a pretty small project And we'll see what's going on and focus on the parts that we care about So we start here with our main function and We have some a director where we want to save snapshots we'll get to that later But right up the bat we have our mcp server configuration We have just the one we're using only playwrights mcp server This is pretty standard So we have the one mcp As we go into our main function You can see that we're Defining our mcp client And we're going to use it in just a little bit will put it in our agent But I want to focus on this this is where the magic of this talk happens We have built a base class For getting the tools and all it does it has one function called get tools As we will go through the talk we will Go and increase in complexity and improve the way we handle the tools that are coming from our third party mcp server So here it is We're starting with the baseline. We'll look at it in just a second and as we start our session this is This is where this inheritance is going to take place every time we run this we will The get tools will do something a little bit more advanced so We start we want to start the our flow we have this function called login to buzz because for this talk our example is going to be logging into our system And we will talk about why we need this towards the end. There's actually an interesting point here We will define an llem will create an agent Uh, we will give it a system a message and a human message to start to kick to kick it off These are the messages it's going to get and we will invoke it Um, you're probably kind of wondering maybe you want to see a little bit more under the hood maybe look at the prompts So this should be relatively straightforward, you know System prompt. This is mostly a i generated saying things like you are a meticulous q a agent You need to review requirements from the ticket as well as visual verification Everything we talked about at high level is right here some guidelines first read the ticket understand it navigate through the system uh, and then at the end like we said it needs to give us a pass or fail verdict uh specific observations and Reference everything with a screenshot for evidence Uh, human prompt is very similar. It does have a multi model aspect to it where we take images and we embed them in the human prompt But these days it's very straightforward and any any coding agent can just whip that out for you if you need it speaking of images we have two images here We have a ticket that we took a snapshot of our real product doesn't take tickets as snapshots. We were just lazy but um, um, the agent can definitely read this Understand the requirement. There is a an accompanying design Which is this one? So the ticket states that we want to have a configuration drawer for our spec reviewer in our system In buzz it explains how it should look and a design is given so the agent Should understand that it's looking for a drawer Inside our agents tab for spec reviewer and it should look roughly like this amazing I think we it's about time we just fire this up and uh, hopefully it will make everything so much clearer We have a breakpoint here right after we get the tools almost forgot our First run is gonna be with this v0 the benchmark what is our benchmark if we go to our Get tools we see that what we do for v0 is Classic out of the box. We just use length chains load mcp tools Uh method that is it for the first round. We're not tinkering with tools at all. Let's see how it behaves vanilla All right Okay, so this is starting up and we have our tools Let's see what we have here. So right off the bat the good people at the player at playwright have given us 21 tools and Everything that has to do with manipulating the browser browser close browser resize Console messages handle dialogue file upload fill form install all the press key And then we can look at the descriptions. What is the description for a tool called press key Press a key on the keyboard. What is the description for something like resize? Resize the browser window browser close close the page these seem Very shallow and very generic, but we don't blame them the people at playwright don't know what our specific use case is This mcp server will need to cater to I don't know how many different use cases it has to be generic But for us using this we and we'll see this Going forward we might want to put in our own descriptions that really are tailored to our use case But we're not there yet. We're still at the baseline So let's just continue and we will see that this is running Okay So Player is running it's spinning up a browser And now it's going to log in And once it's logged in the agent is going to Take over and start running according to the prompt And there it's off to the races It's opening or it's it's logged in this is our homepage which is the changes screen and now it's going to need to find the relevant Page which is the agents tab So it's going to need to explore the system a little bit And It might work it might not work remember the tools are not optimized at this point and it's done Let's see how it did so Looking at the results It tells me that the requirement is not implemented that the status is it's a failed verdict It gives me an observation and it tells me that the requirement is not met because it could navigate to a seemingly made-up page Called buzz co Slash spec reviewer. This might be in hallucination a lapse in judgment of the agents part A bunch of other things and it gives an evidence of a 404 screenshot Which probably took and we can probably check out in a screenshots folder It didn't even manage to take the screenshot properly So a lot of things went wrong and this is actually a great outcome For the beginning of the talk whose whole concept is optimizing our use of agentic tools So let's see what we can do to improve our tools and we'll run this again and see if we can turn this upside down All right Cool Rmcp servers already starting to look a little bit better the fires put out It's just this spark now and this is probably because we've gone through some code We're starting to understand the problem but we still need to start to actually implementing our improvements and see what can be done to really make the system better So time to introduce our five concepts that we're going to go over We're going to look at how we can cure it third party tools Wrap third party tools with our own descriptions and perhaps some additional things Adding deterministic guardrails whenever we feel it's necessary and we'll give an example Creating new tools out of the existing tools actually using the existing tools as building blocks And lastly there's always the option to treat tools as simple functions Just calling them using them as that integration code we spoke about written to us by the good people at the team of playwright You know taking some parts of the workflow outside of the agentic flow Whenever we feel it's necessary. We'll talk about this towards the end So it's also a tool in our arsenal I did kind of split these into two buckets one is more in the realm of context engineering the other deterministic guardrails It doesn't really matter at the end of the day whatever gets our application To work as we wanted that's what we need to use So now we'll go over them one by one looking at codes see how we can improve our toy example that we just saw Starting with our first point curating third party tools All right, let's see how this one looks All right, we're back here at our familiar project and through the magic of video editing We have now important v1 it used to be v0 original. Let's now v1 curated The only difference like we've seen is that now we have this as v1 and this as we said It's that class that inherits from the base class if used to have just get tools vanilla using length chains function Now we can see what we have implemented here So we go in and we used to return this right but now we have this big list of all the tool names that we get from our playwright mcp And this smallest this is pretty you know standard stuff in python List comprehension so we just created this list of tools that we want to exclude We just went over them and we we know the tools we've been using the this mcp server for a while And we decided that for our use case we might not need resizing the browser We don't want our agent to drag things We don't want to run code inside the browser on its own These are just not things that our spec reviewer needs to do as part of its operations Maybe for your use case this is needed but for ours Not so much So all we do is we get all the tools and instead of just returning them We simply exclude the ones that we don't want So there are a bunch your six here that we're going to simply not use We fire this up We have our breakpoint and instead of 21 tools which we used to have I expect to see Less and so we have 16 amazing um So This means our context window already has less tools in it Our agent has less to choose from so everything might become simpler We'll see that not all the guidelines that we're going to go through will necessarily um reduce stuff from our context window some will actually add to it But this is all part of this tradeoff this juggling act that we're going to talk about Moving on to our next point The practice of wrapping third party tools This is amazing we talked about how the Descriptions specifically coming from the playwright mcp are Super shallow and very very generic and that it's totally understandable because they need to cater to Every possible use case in the world that might want to use the browser But if you really want to optimize you might want to start tailoring stuff for your own use case Let's see how this happens Okay, this is becoming familiar territory by now And as always through the magic of video editing we have a v2 imported wrapped So we're wrapping tools this time going down we see that we're calling the v2 Class which will implement get tools and we'll see what's going on here If I go to v2 wrapped I see that we as before we get all the tools But now we have this new class called tool wrapper which has a method that we're calling wrap playwright tools Let's see what's going on here as before we still have this list of all the tool names We'll do the filtering a little bit further But instead of just the tool names we also have all these descriptions and so for every tool We we want to specify what needs to happen and From experience We have our own kind of little emphasis that we want to give our agent We might tell it you know before calling the browser tool first call this other tool this tool We found to be especially helpful It's uh, it has kind of a misleading name. It's called the snapshot tool It's actually not a visual snapshot. It's the accessibility snapshot That kind of shows you all the different buttons and all the different menu items in text And we feel that the agent the agent really gets a good understanding of what is in a page when it calls that tool So we tell it for a bunch of tools you know instead before calling hover before calling click please use this tool before So we can kind of really affect its behavior we can Make it more eager to choose one tool over the other we can do a bunch of things um For instance, this is the tool I just talked about the accessibility snapshot We will tell it always prefer this overtaking an actual snapshot which is this tool so you can really give a lot of Ga guidance from your own experience for your own particular use case and this is very very powerful In here we have this dictionary which just maps tool names with their new enhanced descriptions Still we have our tools to filter at the end We have the function that we called called wrap playwright tools And it just goes through all the tools that we get from playwright out of the box We filter what needs to be filtered and for for other tools we Get our enhanced description based on the tool name and we create this tool and we append it to the list of wrapped tools So we get enhanced tools What is this method that creates an enhanced tool? Well It's a method that gets the original tool and the enhanced description creates a new tool and returns it And so what does this amazing new tool what what does it do? Exactly what the old tool did it just invokes the original tool It just has an enhanced description So if we run this Going back to main and we run this and we still have our breakpoint We can see that we still have less tools like we wanted to from before even less we filtered a bunch more But when we look at the descriptions You see that they're much longer and they are they are the ones that we wanted For example here is the tool we spoke about browser snapshot capture and accessibility snapshot of the current page Yadayadayada all the things we said if we look at another one browser click Here's our guideline for first call the other tool and then call this one Now our agent knows how we want it to behave All right onto the next one First of all our mcp server. I don't know if you can notice but things are looking even better Some of the interfaces seem to work Lights blinking things firing but we're still far from the home stretch We'll move on to point number three and keep making this better Now we're moving into the realm of deterministic guardrails and this is Putting in deterministic guardrails taking control of sensitive or mission critical aspects of our tasks with deterministic logic that is not up to agentic decision making sometimes there are Aspects of your tasks that are just too sensitive to leave at the hands of the agents We talked before about scenarios like multi-tenant architecture and may and Scenarios where the agent might not be fully aware of your architecture things of that nature And of course you need to specify everything you can in the tool descriptions and the prompts But sometimes really want to enforce that it is not doing anything funky You know agents are non deterministic things and sometimes they will ignore you We know of all these phenomena such as needle in the haystack and loss in the middle and a lot of instances where agents will just not work as You intend them intended them to This is where you want to put some deterministic enforcement We did this around the tool that takes actual visual snapshots not the accessibility snapshot We talked about before but the actual visual snapshots We had a folder that we defined and we said this is the output folder This is where we want you to put images But there is a possibility that the agents will go rogue and just store images in other places So that's where we want to draw the line and make sure this never happens Okay So as always we have v3 now which is the one we want to look at So going back to main we see we have this here v3 guardrailed We dive in and we see That we have again our rap playwright tools obviously this time it's gonna do something a little bit different As we increment every time so we still have the names we still have the descriptions and Going down down down down down by the way we can already see that apart from that v3 that we all always have Which looks like this and the tool wrapper which we had before we now have another class called path validation Let's see where we use it. So we're going down down down. We're going past the dictionary past the tools to filter we're in the Method that method that were always importing rap playwright tools and rap playwright tools as before It goes over all the original tools we got from our playwright mcp filters what needs to be filtered gives the enhanced description for each of the tools if we find it and then as before we have the same helper function Create playwright tool wrapper that function that takes a tool gives it an enhanced description and Creates a new tool out of it same functionality new description. Let's see what's changed now So when we want to create the new tool right? So this is the tool we're creating As we said a tool is just a Colable function with some description we give the new description And before we had this part because we said what is the new tool do exactly what the old one did But we added this part we're saying if the tool that is now Being activated if it is the takes screenshot tool and we've kind of researched the tool and we know that it uses under the hood It will use either the path or file name as keywords at least the relevant keywords for us So if if you're trying to invoke the tool and it is this tool then Find the these keywords and validate them and we have some helper functions this path validation These are just helper functions. We don't need to go too much too much into them But there's just deterministic logic where we take the path that the tool chose and we take our path where we Want to enforce things being stored we call it the screenshots route And we just use this method We want to know if that path is relative to the screen the chosen path is relative to the screenshot path So it is a deterministic way once we understand where the tool intends to store the image we know if it is Inside the right folder So going back to our helper helper function every time the tool is invoked before it actually gets invoked We kind of extract the path it intends to save to to invoke it with We try to validate and if it is not a valid path it will not reach invocation it will raise an error So this is a deterministic way where we can stop the tool right in its tracks or the deterministic guardrail Now another nice thing to note is that we might raise a mixed exception We don't return an exception We don't want the whole agenda process to fail Instead we handle this nicely by creating a very nice agent facing explanation and and what the error what the agent will get back is this nice message saying listen access is denied You can't save it there. You need to save it here. Please provide a proper file name and proper path An agent that gets this message is most likely to just try again But aligned this is our way of aligning it and it will try again give a correct path managed to traverse this get here And save the image so that's exactly what we want to happen in these very mission sensitive security related scenarios. So that is amazing Okay Personally, I love this point. I think it is so common and a tool everybody must have in their arsenal But it is time to go forward. We have two more This might be a little bit more niche more in the advanced side But it is worth kind of checking out knowing that it is a possibility This talks about composing new tools from existing tools and the place where we did that Was interesting it was actually it also had to do with the same take snapshot tool We felt that sometimes we want to take snapshots in general. It's kind of a generic thing But when we take screenshots In the context of evidence at the end of the flow We felt that maybe having a separate tool for that was Was in order why We can create a new tool Whose functionality is essentially Pretty much the same as take regular snapshot But because it will have separate descriptions The agent can choose either this or that and we can also tell it to behave slightly differently in both cases And we can also give some additional Additional actions to do maybe even deterministic actions just before invoking the original tool So we created a new agent called the evidence tool and let's take a look what we did there All right this time around we have v4 imported As we would expect by this point So we can take a look at v4 see how it implements get tools We go in this opens up this looks very similar to before actually everything looks similar We have tool wrapper we have path validation very very similar What's the difference? Well, we inject a new tool So we have our you know inside tool wrapper what we would have come to expect by now We have the tool names the tool descriptions the dictionary We have our tools to filter everything's the same and then we have the same old same old wrap playwright tool Which iterates over all the tools filters what needs to be filtered and creates any tools that we need to create with enhanced descriptions But so this is what we used to have we finished the loop and we have this we have this part injecting a brand new tool We chose and you don't have to go the same route, but we said because this is building on the screenshot tool Then only create this new tool if the screenshot tool has not been filtered out totally optional Just what we chose to do in this particular use case And so we have our new description for this new tool And we create a new tool. Let's take a look at the the screenshot I'm sorry the description so It's a description as you might expect from a description of this kind of tool take a screenshot specifically for the purposes of evidence and because the prompt When it describes the flow tells it that this ends with taking snapshots for evidence The agent will probably know to choose this tool over the regular screenshot taking tool We tell it to use this only when capturing things for evidence and we also specify how to go about doing evidence For example that when it creates an image and it wants to store it We want it to identify the relevant ticket and put include the ticket number in the file name being saved And so you can see a nice example of how we're going to have two tools The agent will know to differentiate when to use this one or whether to use the other one specifically in the context of evidence taking And when it chooses that tool in the context of evidence taking it will this will cause it to Have different considerations when choosing the image name for example And you can do a bunch of other things you can add a guardrails or deterministic actions that are specific to this tool for example So really the sky's the limit the world is your oyster and knocker solves out So this was this Moving on to our final fifth and final point Look at this our server All the fires out it's the systems booting looks Well, and still we have one final point which we actually touched on in the beginning of the talk this is Co-treating tools as deterministic functions or callable functions Sometimes we get all these wonderful Functions all this code from the people at these third party teams in our case the team of playwright They gave us all this really nice code and we can just call this code Outside of the agentic flow just just call it just just Just disregard the description and just use the function they gave us Where do we use that? I don't know if you remember but when we first looked at the code We had this function called login to buzz and we said we'll get back to that. This is us Uh kind of closing that loop Let's take a look Okay Back to the top of our main pie file we go down down down down down down down down down down down down down down down down down Here is the imported v4 But today we're interested in but now we're interested in something else which is this part When we just when we just start out we kind of define the mcp client we define this A class that implements get tools we get our tools and the first thing we do is we have this deterministic function called login to buzz Only after we log in to buzz Do we actually create an agent and then and then we give it all the messages it needs and we invoke it So why is the logging deterministic here? Well It seems that logging is kind of tricky because This is a toy example again, but in a real product we need to log into our system. We need to log into client systems So and each client might have a different logging mechanism and they can be tricky and they can have secrets And they can allow a lot of things can make this kind of complicated So on the one hand it's complicated and on the other hand it is a An action that we will always want to take there is no agentic flow for spec reviewer that does not begin with elogin and so We and so we do use tools we got from mcp server in order to Achieve this we're just not letting the agent try to do it because it's We saw it gave us Support results. So for these very specific niche Use cases sometimes you might want to take matters into your own hand Here if we go into the function not much going on I The again toy example the real product behaves very differently, but here I just hid some JWT tokens as environment variables and I You know this function accepts all the tools from the mcp server. We kind of pluck the the ones we want and we And the gist of it is that we are going to inject these JWT tokens into the browser's local storage And then when we click click the login button we will just log in like magic So we just do this deterministically once we have logged in we Take the reins and give it to the agent we say you're logged in You're off to the races and that's how we just unburden the agent from this somewhat clunky action it means to take which we can just take off its hands not bother it it and its context with it Cool, so that simplifies things at least for us Oh almost forgot I still owe you one last thing I think we should fire this thing up now with all the improvements And see where we land we failed on the previous run didn't we Okay, let's let's see how it works So going back to the familiar code base our main loop and we run this We no longer need the break the break point. I think I took it out. Yeah, and this is just running Okay, so this is going to start by doing all the JWT magic in the back which we now know how it works And let's see how it goes I'll take the time to remind you that we gave it a ticket and a A design and it needs to see this drawer. That's what it's actually doing now Okay, so it kind of reloaded. I think maybe the JWT tokens have kicked in it go it's going into our home page called changes And it's probably going to do a little bit of reasoning and it's going to need to find the way to navigate itself through the system To the agents tab It doesn't always show it. It has a lot of latency and lag at least from our experience But it does take a screenshot as evidence. It finished. Let's see what it did So it says that this past it says it claims that the configuration is present and includes the sections that Are need to be there. It has a screenshot the screenshot has a file name that includes the ticket like we said It has reasoning. Let's take a look at the screenshot. So you see the The image we gave it was a dark mode and it was navigating in in Light mode and it took it and it looks it looks great. It looks like it's it's done and it's correct now I will say that It did see a configuration a drawer. It did see everything that's supposed to be in it It might not be pixel perfect But pixel perfect Verification is something that we worked very hard on in our real product. So Outside of this toy example, pixel perfect does work because it is super important for front and validation Making sure the padding and the margins and all the Style is just as it needs to be Amazing. So with that done, I think we're it's sad. It's almost goodbye time So I think it's where all words start to get ready to wrap up look at our server look at it It's amazing. It's beautiful. It's working green lights are blinking engines roaring GPUs churning Millions of spec reviewer being being executed and a great acceptance rate. Thank you for all your help Let's just Do this one final summary before Separate ways so as we said we looked at agentic tools today whether they comes whether they come from libraries mcp servers or any other place we saw that sometimes they will fail out of the box They might be very generic. They might be not tailored enough for our use case and tailoring them Is what's gonna make our agents pop now? Sometimes we wanted to cure the tools and kind of ease the load on the on the context window Sometimes we wanted long descriptions and more of our both tools So there's no one size fits all it's mainly a question of how do I kind of mold the tools to best fit my use case Sometimes they'll be deterministic sometimes they'll be flexible It depends and you're gonna need to tinker with it and make it your own and and and and strive for the best possible setup That you can do to achieve your goals I hope this has given you a few pointers Things that you can try out. This has been amazing and great fun I'm gonna move up here now because I have a shameless plug After I thank you for listening and so always feel free to reach out. I will see you guys in the next one Cheers
TL;DR
- Generic third-party agentic tools, when used out-of-the-box, can lead to unpredictable agent behavior, suboptimal performance, and significant security risks.
- A five-pronged framework is proposed to optimize the integration of these tools, ensuring agents act reliably and efficiently within specific application contexts.
- The framework empowers developers to tailor tool descriptions, enforce deterministic controls, and strategically compose or directly call tools, effectively "bending" the tools to their will without "breaking" the application.
Takeaways
- Generic tools cause issues: Third-party agentic tools, designed for broad use cases, often come with shallow descriptions and can lead to agents behaving unexpectedly, performing poorly, or creating security vulnerabilities (e.g., data leaks in multi-tenant systems).
- Curate available tools: Filter out irrelevant or potentially harmful tools from the agent's context. Providing fewer, more focused tools reduces cognitive load on the agent and the likelihood of incorrect actions.
- Wrap tools with custom descriptions: Enhance generic tool descriptions with detailed, use-case specific instructions and behavioral guidance. This helps agents make better decisions on when and how to use a tool, improving relevance and performance.
- Implement deterministic guardrails: For mission-critical or sensitive operations, embed non-agentic, deterministic logic around tool invocations. This ensures actions like saving files or accessing specific resources adhere to strict rules, preventing agent "hallucinations" or security breaches.
- Compose new tools from existing ones: Combine or extend existing third-party tools to create specialized tools tailored for specific sub-tasks or contexts. This allows for distinct descriptions and additional pre/post-invocation logic.
- Treat tools as simple functions: For repetitive, non-negotiable parts of a workflow, bypass the agentic decision-making process by deterministically calling third-party tool functions directly. This is useful for complex, mandatory steps like logging in.
Vocabulary
Agentic tools — Functions or capabilities provided to an AI agent, often wrapped with descriptions to guide their usage.
MCP server — A server that provides a collection of tools and functionalities, frequently serving as a source for agentic tools. (e.g., Playwright's MCP server).
Non-deterministic — Describes an agent's behavior where the same input does not always produce the same output, leading to unpredictability.
Deterministic guardrails — Explicit, pre-defined logic that restricts or controls an agent's actions in sensitive or critical scenarios, overriding potential agentic misinterpretations.
Context window — The limited amount of information (tokens) an LLM can process at one time, including prompts, instructions, and available tool descriptions.
Hallucination — When an AI agent generates incorrect, fabricated, or nonsensical information, such as a non-existent URL or file path.
Tool description — A natural language explanation associated with an agentic tool, which the agent uses to understand the tool's purpose and decide when and how to invoke it.
Wrapping tools — The practice of enclosing a third-party tool's original functionality with custom logic or an enhanced description to better align its use with a specific application's requirements.
Curating tools — The process of carefully selecting, filtering, and organizing which third-party tools are made available to an agent, often to remove irrelevant or potentially problematic options.
Transcript
Hi everyone and welcome to our talk today about bending a public MCP server without breaking it But today we may have just broken it because our MCP server seems to have caught on fire We're glad you're here. We need all the help we can get Let's go through our talk and see how we can improve Whatever is going on right here. I'm house or I work at buzz We've been building a part code reviewers for the past few years now as well as a bunch of other features Anything that can can make the lives of people at R&D easier and better whether they're devs pms Anything else if it's a genetic we're probably tinkering with it But let's jump right in and start looking at what's going on with our MCP server. I suspect it's the tools We're gonna talk about third-party tools and why they might blow our applications First of all, it's me as I said Nymrod houseer a founding engineer at buzz been with the company since it was founded in 2023 I've been at back in the data for the past 20 years or so had a in my career had a brief stint and Salesforce and ever since mostly startups cyber crypto and now developer tools nowadays I mostly want to talk to you guys about agentech tools all right agentech tools and specifically third-party tools They can be a great force a great addition to our application But they don't always work out of the box. We expect them to make our application better Sometimes we'll see degradation and we'll try to understand why that happens After that we'll explore a framework of five best practices that we can follow In order to turn this around and make our application kickass Along the way hopefully we'll fix the buzz to the MCP server that we just saw put out that fire Make it work and make our agents behave the way we want them to Yeah, this is looking kind of bad Think we should dive in So we're gonna talk about agentech tools when we Use MCP servers we get tools from coming from the MCP server So as long as we're talking about third-party tools I don't care if they're coming from an MCP server from a library Maybe we copy and pasted them from somewhere else if they're agentech tools They're written by a different team they're relevant for this discussion So what are these tools? Essentially tools are just Collable functions wrapped with a nice description the description is important because it lets the agents know When to use the code and how to use the code and we'll dive deep into these aspects of the description But again, it's kind of like glorified integration code written by a third party In today's talk we're going to take playwrights MCP server as an example So essentially we're looking at integration code written by the good people at the team of playwright wrapped with their descriptions And we'll see how we can make these tools work better kind of tailor them for our use case Yeah, so third-party tools have their challenges First and foremost they might cause our agents to behave unexpectedly You know agents are already non-deterministic unpredictable things You give them tools and you get unpredictability at scale But also they can just delegate performance You might want the agent to do a certain thing and you get supper results wrong results Or maybe it just does it but in a way that's not optimal And through today's best practices hopefully we can see how we can make the implementation of third-party tools In our agentic workflows much much better Last and foremost these bad performances that unexpected behavior that can also mean full-blown security issues I mean just imagine a scenario pretty classic scenario Like a multi-tenant architecture and your agent might not know all there is to know about your architecture And the division into folders or databases and schemas It just doesn't have the proper guardrails and it might leak client data to another client to another client Things of that nature you really want to guard real your agents and and this is becomes even more important When dealing with third-party tools who are not aware of your architecture So we'll cover that as well All right I think I think we're gonna work About ready to look at a use case To look at some code actually But we'll need a use case and with your permission I'll we'll choose one of ours So today our use case will be Buzz's spec reviewer So what is the spec reviewer? It's one of our products which is essentially an agentic reviewer That knows how to compare requirements with implementation So as a first step it needs to kind of collect requirements It will go to your ticketing systems like your jira's or linear or anything of that nature And read a ticket And it can also go to Figma and look at visual designs In kind of a multi-model Way of operation it will actually see the design that is intended And that part is the requirements what's once it understands what a developer was tasked with That's when it will spin up playwright's mcp server to actually open up A browser go into your system check the branch see the implementation and it will need to assess Whether the implementation meets the requirement It will give us kind of a verdict it will take a snapshot as evidence Whether this was fulfilled or wasn't fulfilled And it does all this automatically and can save people mostly pms A lot and lot of time doing mineral validation work So we've built a toy example of our spec reviewer And we're going to see how we handle the tools to get the most of it I hope this makes sense At a high level I think it's time to look at some code and hopefully everything will be much much clearer All right so we have a toy example of our spec reviewer We'll go through it kind of quickly We don't need to dive into every aspect of it. It's it's a pretty small project And we'll see what's going on and focus on the parts that we care about So we start here with our main function and We have some a director where we want to save snapshots we'll get to that later But right up the bat we have our mcp server configuration We have just the one we're using only playwrights mcp server This is pretty standard So we have the one mcp As we go into our main function You can see that we're Defining our mcp client And we're going to use it in just a little bit will put it in our agent But I want to focus on this this is where the magic of this talk happens We have built a base class For getting the tools and all it does it has one function called get tools As we will go through the talk we will Go and increase in complexity and improve the way we handle the tools that are coming from our third party mcp server So here it is We're starting with the baseline. We'll look at it in just a second and as we start our session this is This is where this inheritance is going to take place every time we run this we will The get tools will do something a little bit more advanced so We start we want to start the our flow we have this function called login to buzz because for this talk our example is going to be logging into our system And we will talk about why we need this towards the end. There's actually an interesting point here We will define an llem will create an agent Uh, we will give it a system a message and a human message to start to kick to kick it off These are the messages it's going to get and we will invoke it Um, you're probably kind of wondering maybe you want to see a little bit more under the hood maybe look at the prompts So this should be relatively straightforward, you know System prompt. This is mostly a i generated saying things like you are a meticulous q a agent You need to review requirements from the ticket as well as visual verification Everything we talked about at high level is right here some guidelines first read the ticket understand it navigate through the system uh, and then at the end like we said it needs to give us a pass or fail verdict uh specific observations and Reference everything with a screenshot for evidence Uh, human prompt is very similar. It does have a multi model aspect to it where we take images and we embed them in the human prompt But these days it's very straightforward and any any coding agent can just whip that out for you if you need it speaking of images we have two images here We have a ticket that we took a snapshot of our real product doesn't take tickets as snapshots. We were just lazy but um, um, the agent can definitely read this Understand the requirement. There is a an accompanying design Which is this one? So the ticket states that we want to have a configuration drawer for our spec reviewer in our system In buzz it explains how it should look and a design is given so the agent Should understand that it's looking for a drawer Inside our agents tab for spec reviewer and it should look roughly like this amazing I think we it's about time we just fire this up and uh, hopefully it will make everything so much clearer We have a breakpoint here right after we get the tools almost forgot our First run is gonna be with this v0 the benchmark what is our benchmark if we go to our Get tools we see that what we do for v0 is Classic out of the box. We just use length chains load mcp tools Uh method that is it for the first round. We're not tinkering with tools at all. Let's see how it behaves vanilla All right Okay, so this is starting up and we have our tools Let's see what we have here. So right off the bat the good people at the player at playwright have given us 21 tools and Everything that has to do with manipulating the browser browser close browser resize Console messages handle dialogue file upload fill form install all the press key And then we can look at the descriptions. What is the description for a tool called press key Press a key on the keyboard. What is the description for something like resize? Resize the browser window browser close close the page these seem Very shallow and very generic, but we don't blame them the people at playwright don't know what our specific use case is This mcp server will need to cater to I don't know how many different use cases it has to be generic But for us using this we and we'll see this Going forward we might want to put in our own descriptions that really are tailored to our use case But we're not there yet. We're still at the baseline So let's just continue and we will see that this is running Okay So Player is running it's spinning up a browser And now it's going to log in And once it's logged in the agent is going to Take over and start running according to the prompt And there it's off to the races It's opening or it's it's logged in this is our homepage which is the changes screen and now it's going to need to find the relevant Page which is the agents tab So it's going to need to explore the system a little bit And It might work it might not work remember the tools are not optimized at this point and it's done Let's see how it did so Looking at the results It tells me that the requirement is not implemented that the status is it's a failed verdict It gives me an observation and it tells me that the requirement is not met because it could navigate to a seemingly made-up page Called buzz co Slash spec reviewer. This might be in hallucination a lapse in judgment of the agents part A bunch of other things and it gives an evidence of a 404 screenshot Which probably took and we can probably check out in a screenshots folder It didn't even manage to take the screenshot properly So a lot of things went wrong and this is actually a great outcome For the beginning of the talk whose whole concept is optimizing our use of agentic tools So let's see what we can do to improve our tools and we'll run this again and see if we can turn this upside down All right Cool Rmcp servers already starting to look a little bit better the fires put out It's just this spark now and this is probably because we've gone through some code We're starting to understand the problem but we still need to start to actually implementing our improvements and see what can be done to really make the system better So time to introduce our five concepts that we're going to go over We're going to look at how we can cure it third party tools Wrap third party tools with our own descriptions and perhaps some additional things Adding deterministic guardrails whenever we feel it's necessary and we'll give an example Creating new tools out of the existing tools actually using the existing tools as building blocks And lastly there's always the option to treat tools as simple functions Just calling them using them as that integration code we spoke about written to us by the good people at the team of playwright You know taking some parts of the workflow outside of the agentic flow Whenever we feel it's necessary. We'll talk about this towards the end So it's also a tool in our arsenal I did kind of split these into two buckets one is more in the realm of context engineering the other deterministic guardrails It doesn't really matter at the end of the day whatever gets our application To work as we wanted that's what we need to use So now we'll go over them one by one looking at codes see how we can improve our toy example that we just saw Starting with our first point curating third party tools All right, let's see how this one looks All right, we're back here at our familiar project and through the magic of video editing We have now important v1 it used to be v0 original. Let's now v1 curated The only difference like we've seen is that now we have this as v1 and this as we said It's that class that inherits from the base class if used to have just get tools vanilla using length chains function Now we can see what we have implemented here So we go in and we used to return this right but now we have this big list of all the tool names that we get from our playwright mcp And this smallest this is pretty you know standard stuff in python List comprehension so we just created this list of tools that we want to exclude We just went over them and we we know the tools we've been using the this mcp server for a while And we decided that for our use case we might not need resizing the browser We don't want our agent to drag things We don't want to run code inside the browser on its own These are just not things that our spec reviewer needs to do as part of its operations Maybe for your use case this is needed but for ours Not so much So all we do is we get all the tools and instead of just returning them We simply exclude the ones that we don't want So there are a bunch your six here that we're going to simply not use We fire this up We have our breakpoint and instead of 21 tools which we used to have I expect to see Less and so we have 16 amazing um So This means our context window already has less tools in it Our agent has less to choose from so everything might become simpler We'll see that not all the guidelines that we're going to go through will necessarily um reduce stuff from our context window some will actually add to it But this is all part of this tradeoff this juggling act that we're going to talk about Moving on to our next point The practice of wrapping third party tools This is amazing we talked about how the Descriptions specifically coming from the playwright mcp are Super shallow and very very generic and that it's totally understandable because they need to cater to Every possible use case in the world that might want to use the browser But if you really want to optimize you might want to start tailoring stuff for your own use case Let's see how this happens Okay, this is becoming familiar territory by now And as always through the magic of video editing we have a v2 imported wrapped So we're wrapping tools this time going down we see that we're calling the v2 Class which will implement get tools and we'll see what's going on here If I go to v2 wrapped I see that we as before we get all the tools But now we have this new class called tool wrapper which has a method that we're calling wrap playwright tools Let's see what's going on here as before we still have this list of all the tool names We'll do the filtering a little bit further But instead of just the tool names we also have all these descriptions and so for every tool We we want to specify what needs to happen and From experience We have our own kind of little emphasis that we want to give our agent We might tell it you know before calling the browser tool first call this other tool this tool We found to be especially helpful It's uh, it has kind of a misleading name. It's called the snapshot tool It's actually not a visual snapshot. It's the accessibility snapshot That kind of shows you all the different buttons and all the different menu items in text And we feel that the agent the agent really gets a good understanding of what is in a page when it calls that tool So we tell it for a bunch of tools you know instead before calling hover before calling click please use this tool before So we can kind of really affect its behavior we can Make it more eager to choose one tool over the other we can do a bunch of things um For instance, this is the tool I just talked about the accessibility snapshot We will tell it always prefer this overtaking an actual snapshot which is this tool so you can really give a lot of Ga guidance from your own experience for your own particular use case and this is very very powerful In here we have this dictionary which just maps tool names with their new enhanced descriptions Still we have our tools to filter at the end We have the function that we called called wrap playwright tools And it just goes through all the tools that we get from playwright out of the box We filter what needs to be filtered and for for other tools we Get our enhanced description based on the tool name and we create this tool and we append it to the list of wrapped tools So we get enhanced tools What is this method that creates an enhanced tool? Well It's a method that gets the original tool and the enhanced description creates a new tool and returns it And so what does this amazing new tool what what does it do? Exactly what the old tool did it just invokes the original tool It just has an enhanced description So if we run this Going back to main and we run this and we still have our breakpoint We can see that we still have less tools like we wanted to from before even less we filtered a bunch more But when we look at the descriptions You see that they're much longer and they are they are the ones that we wanted For example here is the tool we spoke about browser snapshot capture and accessibility snapshot of the current page Yadayadayada all the things we said if we look at another one browser click Here's our guideline for first call the other tool and then call this one Now our agent knows how we want it to behave All right onto the next one First of all our mcp server. I don't know if you can notice but things are looking even better Some of the interfaces seem to work Lights blinking things firing but we're still far from the home stretch We'll move on to point number three and keep making this better Now we're moving into the realm of deterministic guardrails and this is Putting in deterministic guardrails taking control of sensitive or mission critical aspects of our tasks with deterministic logic that is not up to agentic decision making sometimes there are Aspects of your tasks that are just too sensitive to leave at the hands of the agents We talked before about scenarios like multi-tenant architecture and may and Scenarios where the agent might not be fully aware of your architecture things of that nature And of course you need to specify everything you can in the tool descriptions and the prompts But sometimes really want to enforce that it is not doing anything funky You know agents are non deterministic things and sometimes they will ignore you We know of all these phenomena such as needle in the haystack and loss in the middle and a lot of instances where agents will just not work as You intend them intended them to This is where you want to put some deterministic enforcement We did this around the tool that takes actual visual snapshots not the accessibility snapshot We talked about before but the actual visual snapshots We had a folder that we defined and we said this is the output folder This is where we want you to put images But there is a possibility that the agents will go rogue and just store images in other places So that's where we want to draw the line and make sure this never happens Okay So as always we have v3 now which is the one we want to look at So going back to main we see we have this here v3 guardrailed We dive in and we see That we have again our rap playwright tools obviously this time it's gonna do something a little bit different As we increment every time so we still have the names we still have the descriptions and Going down down down down down by the way we can already see that apart from that v3 that we all always have Which looks like this and the tool wrapper which we had before we now have another class called path validation Let's see where we use it. So we're going down down down. We're going past the dictionary past the tools to filter we're in the Method that method that were always importing rap playwright tools and rap playwright tools as before It goes over all the original tools we got from our playwright mcp filters what needs to be filtered gives the enhanced description for each of the tools if we find it and then as before we have the same helper function Create playwright tool wrapper that function that takes a tool gives it an enhanced description and Creates a new tool out of it same functionality new description. Let's see what's changed now So when we want to create the new tool right? So this is the tool we're creating As we said a tool is just a Colable function with some description we give the new description And before we had this part because we said what is the new tool do exactly what the old one did But we added this part we're saying if the tool that is now Being activated if it is the takes screenshot tool and we've kind of researched the tool and we know that it uses under the hood It will use either the path or file name as keywords at least the relevant keywords for us So if if you're trying to invoke the tool and it is this tool then Find the these keywords and validate them and we have some helper functions this path validation These are just helper functions. We don't need to go too much too much into them But there's just deterministic logic where we take the path that the tool chose and we take our path where we Want to enforce things being stored we call it the screenshots route And we just use this method We want to know if that path is relative to the screen the chosen path is relative to the screenshot path So it is a deterministic way once we understand where the tool intends to store the image we know if it is Inside the right folder So going back to our helper helper function every time the tool is invoked before it actually gets invoked We kind of extract the path it intends to save to to invoke it with We try to validate and if it is not a valid path it will not reach invocation it will raise an error So this is a deterministic way where we can stop the tool right in its tracks or the deterministic guardrail Now another nice thing to note is that we might raise a mixed exception We don't return an exception We don't want the whole agenda process to fail Instead we handle this nicely by creating a very nice agent facing explanation and and what the error what the agent will get back is this nice message saying listen access is denied You can't save it there. You need to save it here. Please provide a proper file name and proper path An agent that gets this message is most likely to just try again But aligned this is our way of aligning it and it will try again give a correct path managed to traverse this get here And save the image so that's exactly what we want to happen in these very mission sensitive security related scenarios. So that is amazing Okay Personally, I love this point. I think it is so common and a tool everybody must have in their arsenal But it is time to go forward. We have two more This might be a little bit more niche more in the advanced side But it is worth kind of checking out knowing that it is a possibility This talks about composing new tools from existing tools and the place where we did that Was interesting it was actually it also had to do with the same take snapshot tool We felt that sometimes we want to take snapshots in general. It's kind of a generic thing But when we take screenshots In the context of evidence at the end of the flow We felt that maybe having a separate tool for that was Was in order why We can create a new tool Whose functionality is essentially Pretty much the same as take regular snapshot But because it will have separate descriptions The agent can choose either this or that and we can also tell it to behave slightly differently in both cases And we can also give some additional Additional actions to do maybe even deterministic actions just before invoking the original tool So we created a new agent called the evidence tool and let's take a look what we did there All right this time around we have v4 imported As we would expect by this point So we can take a look at v4 see how it implements get tools We go in this opens up this looks very similar to before actually everything looks similar We have tool wrapper we have path validation very very similar What's the difference? Well, we inject a new tool So we have our you know inside tool wrapper what we would have come to expect by now We have the tool names the tool descriptions the dictionary We have our tools to filter everything's the same and then we have the same old same old wrap playwright tool Which iterates over all the tools filters what needs to be filtered and creates any tools that we need to create with enhanced descriptions But so this is what we used to have we finished the loop and we have this we have this part injecting a brand new tool We chose and you don't have to go the same route, but we said because this is building on the screenshot tool Then only create this new tool if the screenshot tool has not been filtered out totally optional Just what we chose to do in this particular use case And so we have our new description for this new tool And we create a new tool. Let's take a look at the the screenshot I'm sorry the description so It's a description as you might expect from a description of this kind of tool take a screenshot specifically for the purposes of evidence and because the prompt When it describes the flow tells it that this ends with taking snapshots for evidence The agent will probably know to choose this tool over the regular screenshot taking tool We tell it to use this only when capturing things for evidence and we also specify how to go about doing evidence For example that when it creates an image and it wants to store it We want it to identify the relevant ticket and put include the ticket number in the file name being saved And so you can see a nice example of how we're going to have two tools The agent will know to differentiate when to use this one or whether to use the other one specifically in the context of evidence taking And when it chooses that tool in the context of evidence taking it will this will cause it to Have different considerations when choosing the image name for example And you can do a bunch of other things you can add a guardrails or deterministic actions that are specific to this tool for example So really the sky's the limit the world is your oyster and knocker solves out So this was this Moving on to our final fifth and final point Look at this our server All the fires out it's the systems booting looks Well, and still we have one final point which we actually touched on in the beginning of the talk this is Co-treating tools as deterministic functions or callable functions Sometimes we get all these wonderful Functions all this code from the people at these third party teams in our case the team of playwright They gave us all this really nice code and we can just call this code Outside of the agentic flow just just call it just just Just disregard the description and just use the function they gave us Where do we use that? I don't know if you remember but when we first looked at the code We had this function called login to buzz and we said we'll get back to that. This is us Uh kind of closing that loop Let's take a look Okay Back to the top of our main pie file we go down down down down down down down down down down down down down down down down down Here is the imported v4 But today we're interested in but now we're interested in something else which is this part When we just when we just start out we kind of define the mcp client we define this A class that implements get tools we get our tools and the first thing we do is we have this deterministic function called login to buzz Only after we log in to buzz Do we actually create an agent and then and then we give it all the messages it needs and we invoke it So why is the logging deterministic here? Well It seems that logging is kind of tricky because This is a toy example again, but in a real product we need to log into our system. We need to log into client systems So and each client might have a different logging mechanism and they can be tricky and they can have secrets And they can allow a lot of things can make this kind of complicated So on the one hand it's complicated and on the other hand it is a An action that we will always want to take there is no agentic flow for spec reviewer that does not begin with elogin and so We and so we do use tools we got from mcp server in order to Achieve this we're just not letting the agent try to do it because it's We saw it gave us Support results. So for these very specific niche Use cases sometimes you might want to take matters into your own hand Here if we go into the function not much going on I The again toy example the real product behaves very differently, but here I just hid some JWT tokens as environment variables and I You know this function accepts all the tools from the mcp server. We kind of pluck the the ones we want and we And the gist of it is that we are going to inject these JWT tokens into the browser's local storage And then when we click click the login button we will just log in like magic So we just do this deterministically once we have logged in we Take the reins and give it to the agent we say you're logged in You're off to the races and that's how we just unburden the agent from this somewhat clunky action it means to take which we can just take off its hands not bother it it and its context with it Cool, so that simplifies things at least for us Oh almost forgot I still owe you one last thing I think we should fire this thing up now with all the improvements And see where we land we failed on the previous run didn't we Okay, let's let's see how it works So going back to the familiar code base our main loop and we run this We no longer need the break the break point. I think I took it out. Yeah, and this is just running Okay, so this is going to start by doing all the JWT magic in the back which we now know how it works And let's see how it goes I'll take the time to remind you that we gave it a ticket and a A design and it needs to see this drawer. That's what it's actually doing now Okay, so it kind of reloaded. I think maybe the JWT tokens have kicked in it go it's going into our home page called changes And it's probably going to do a little bit of reasoning and it's going to need to find the way to navigate itself through the system To the agents tab It doesn't always show it. It has a lot of latency and lag at least from our experience But it does take a screenshot as evidence. It finished. Let's see what it did So it says that this past it says it claims that the configuration is present and includes the sections that Are need to be there. It has a screenshot the screenshot has a file name that includes the ticket like we said It has reasoning. Let's take a look at the screenshot. So you see the The image we gave it was a dark mode and it was navigating in in Light mode and it took it and it looks it looks great. It looks like it's it's done and it's correct now I will say that It did see a configuration a drawer. It did see everything that's supposed to be in it It might not be pixel perfect But pixel perfect Verification is something that we worked very hard on in our real product. So Outside of this toy example, pixel perfect does work because it is super important for front and validation Making sure the padding and the margins and all the Style is just as it needs to be Amazing. So with that done, I think we're it's sad. It's almost goodbye time So I think it's where all words start to get ready to wrap up look at our server look at it It's amazing. It's beautiful. It's working green lights are blinking engines roaring GPUs churning Millions of spec reviewer being being executed and a great acceptance rate. Thank you for all your help Let's just Do this one final summary before Separate ways so as we said we looked at agentic tools today whether they comes whether they come from libraries mcp servers or any other place we saw that sometimes they will fail out of the box They might be very generic. They might be not tailored enough for our use case and tailoring them Is what's gonna make our agents pop now? Sometimes we wanted to cure the tools and kind of ease the load on the on the context window Sometimes we wanted long descriptions and more of our both tools So there's no one size fits all it's mainly a question of how do I kind of mold the tools to best fit my use case Sometimes they'll be deterministic sometimes they'll be flexible It depends and you're gonna need to tinker with it and make it your own and and and and strive for the best possible setup That you can do to achieve your goals I hope this has given you a few pointers Things that you can try out. This has been amazing and great fun I'm gonna move up here now because I have a shameless plug After I thank you for listening and so always feel free to reach out. I will see you guys in the next one Cheers