Skip to main content

Shipping complex AI applications — Braintrust & Trainline

TL;DR

  • Many generative AI proofs-of-concept (POCs) fail to reach production due to a lack of operational rigor, as traditional software engineering's deterministic quality practices don't fully apply to non-deterministic AI systems.
  • Delivering quality AI at scale requires a new approach that blends quality checks for both deterministic software components and non-deterministic machine learning models.
  • Platforms like BrainTrust provide AI observability and evaluation capabilities to industrialize AI applications, enabling teams to move quickly without compromising quality.

Takeaways

  • The primary hurdle for AI applications moving from POC to production is often not model sophistication but insufficient operational rigor and quality management for non-deterministic behavior.
  • AI systems, particularly agentic ones, combine deterministic software logic with non-deterministic model outputs, demanding a hybrid quality assurance framework.
  • Break down complex AI systems into modular, multi-stage processes (similar to microservices) to manage responsibilities and facilitate systematic improvements.
  • Establish a continuous "flywheel" development process: instrument and collect data, identify failure modes, remediate, ship, and monitor, iteratively improving the system.
  • Utilize "golden test sets" to systematically identify and evaluate failure modes, supplementing them with real-world production data to detect edge cases and complete the evaluation loop.
  • AI observability platforms are crucial for understanding system behavior at a granular level (e.g., tool call, token level) and debugging complex, multi-agent workflows in production.
  • Implementing robust evaluation allows for cost optimization (e.g., safely switching to cheaper or more efficient LLMs) and accelerates the shipping of new features with confidence in their performance.
  • Effective AI quality tools facilitate cross-functional collaboration by providing non-technical stakeholders with self-service insights into AI system performance and impact.

Vocabulary

LLM — Large Language Model; a type of AI model trained on vast amounts of text data to understand and generate human-like text. POC — Proof of Concept; a small project to demonstrate the feasibility and potential of an idea or technology. Generative AI — Artificial intelligence capable of generating new content, such as text, images, or code, rather than just analyzing existing data. AI Observability — The practice of gaining deep visibility into the internal states and behaviors of AI systems in production to understand their performance, identify issues, and debug them. Agentic system — An AI system designed to act autonomously, often by combining multiple AI models and tools to achieve complex goals. Tool calling — The ability of an LLM or AI agent to interact with external tools, APIs, or functions to retrieve information or perform actions. Golden test set — A curated, high-quality dataset used as a benchmark to evaluate the performance and identify failure modes of an AI system, especially after changes. Flywheel — In AI development, a continuous iterative process of building, evaluating, shipping, and monitoring AI systems to drive continuous improvement and value. Offline evaluation — Testing and evaluating an AI model or system using historical data before deploying it to production. Online evaluation — Continuously monitoring and evaluating an AI model or system using real-time data from production, often through A/B tests or user feedback.

Transcript

Good afternoon, everybody. Welcome to Sonny London. Hey, is everyone's first time here? I think it is because this is averse conference. Amazing, amazing. Well, thank you very much for joining today's session. Hopefully you are in the right session, but for those who need to double click, this is a hands-on workshop to delivering quality AI applications with brain trust. And we're also be partnering with our colleagues at Traynon, which are interested in so shortly. So probably beyond doing coosers guy doesn't even go here, what's his name? His introduction to myself is you can call me Durand, but like the band Durand Durand with the G, so hopefully folks are Simon LeBone fan. And I've had a bit of part of helping organizations and enterprises help scale adoption of mission critical systems and I'm moving into the age of AI. You know, background in full mathematics. So I know there's all the rage going from machine learning to data science and our AI engineering and this comes at a very topical time. Feel free to connect me with LinkedIn if you want to stalk or just want to, you know, just do the general chat around this area. I'm also joined by my two friends over from Traynon, if you want to come over and introduce yourselves. Of course, if the mic is working. Oh, good. Yes. Hello, everyone. Thanks for coming to this workshop, especially after that launch break and the sunny time outside. Thank you very much for coming here. So my name is Osama. I'm a senior AI engineer at TrainLine. For those who know me, yeah, one person. Yeah, I was a staff platform engineer before and I have background in computer vision and I was doing also mobile apps on the site. So nothing to do with AI at some point, but yeah, now we are doing AI together with my and at TrainLine. Hi, everyone. My name is Mike. I'm a senior AI engineer at TrainLine. I have background research background in LLMs, but my research is in the pre big LLMs. So like if you've heard of BERT or the older LLMs. So we're continuing building a state of the art agentic products at TrainLine. If you've come here from abroad, I'm sure you must have used TrainLine if not, please do because it's a state of the art for buying tickets and other things. So yeah, welcome and we're very excited to host you at this workshop and yeah, we'll guide you through the hands-on experience. Fantastic. I'm also joined by some of my colleagues as well from Brain Trust coming over the pond. So Phil, Eric and Rose, if you just shot your hand, so don't worry folks, you're in safe hands if you're a stuck. So just holler and they can help. Fantastic. Just want to do a little bit of housekeeping as well. If everybody could join into the AI Engineer Slack channel, well there's an AI Engineer organization, but there's also a specific Slack channel which we'll be using today to help progress the workshop. So if you are stuck, we can use the Slack channel to help each other out. Again, we're already on there, but as we progress and get to the hands-on element and if you are stuck, this will really help. We're also providing cheat sheets. So if you are encountering a particular hurdle, there's a step-by-step instructions which help you just get to where we need to go to the workshop without having to feel like you're falling behind. Again, all the assets which we share today is publicly available. Again, we can have any particular follow-ups if needed. But we'll give you a few minutes to make sure you get onto that and then join the channel AI Engineer Europe 226 Brain Trust Dash Workshop. It's a public-facing channel. All right, team. I guess we can start. Yep, just for the people who are calling the back, we'll need to join the Slack channel. So if you've seen Pierce, I'm hoping to do it. So that seemed thank you everybody. Let's proceed. Okay, so this is kind of help orient today's workshop. We'll be breaking into three main sections. We'll spend a little time just setting the bit of establishing the background while we hear and why this workshop is relevant. Hopefully, we'll set the context for when we go into the workshop building the system talking about this, you know, how do we ship AI quality and then we'll wrap up for the key takeaways and we'll be around to answer any kind of open questions and answers that we have from the field. Okay, so hopefully today there's going to be a lot of people from very different backgrounds. We intended this workshop to be catered for, again, probably everyone here knows what an LLM is. I don't want to insult anybody, but hopefully you're starting to explore your journey in terms of your maturing your operations in terms of building AI systems. So whether you're an AI product engineer, probably an applied team, or come up from a traditional machine learning folks, maybe you might be a platform operator infrastructure. This is really appropriate for you. Just kind of a show of hands here like who here comes from a let's say a traditional data science background, perhaps. Interesting, interesting. Who here perhaps comes from maybe software engineering or then pivoting into AI. Okay, so this is definitely the right room for you. So hopefully we'll be able to kind of accelerate things as we progress. Okay, just so I guess at the context here, I think this is not an uncommon expression which we are seeing more broadly across the industry. Again, but if show hands who here has done a machine learning or AI POC, let's say, let's say most specifically a generative AI POC, but then let's fail to kind of take that into production. Okay, it's got a few hands, yeah. I'll be worried if everyone didn't, but this is kind of a key thing which is speaking to me and my customers, you know, executives, top-level folks, all the rage, you know, they're thinking about this new technology. This one is very new, but it's new for a lot of folks, especially in kind of more enterprise and regulated industries and then they're trying to take this to delivering value to their customers. But unfortunately, there's a big hurdle between taking what you might develop locally on your machine and then industrializing it and making sure it works in anger. But the key thing that we've seen from all of the research out there, it's not that the models are particularly smart. Like, we've got very sophisticated models whether you building something in-house or you're using kind of a top off the shelf, you know, top commercial online providers out there. What we do see more broadly speaking is the type of operational rigor when it comes to delivering the systems of scale has not kind of kept up because traditional software engineering, very deterministic, OnePlus 1 equals 2, great. And online systems, you know, as my tenure, or as a five-year-old would say, you know, 2 plus 3 equals 10 daddy. So as having to kind of adjust it and make sure that we're delivering that 2 scale. And again, once again, this is a cable what we see when shipping these things is the fact that again, people think that demo state is suitable. But clearly, doing 2 to 3, 5 demos, great putting a production, everything goes awry. Treating some of your logs as observability and then this is really critical to how we look at brain trust is, you know, logs will tell you what has happened. But sometimes you need to go deep into the system and understand its behavior. And this is really where observability comes into play. Something as well is, you know, works on my machine, fails in production, I try to patch the prompt and then again, operational until the next issue happens and the next failed mode. But again, how do we keep track of that? If especially if you don't have a system in place, irrespective of what tooling you use, this can, you know, has a categorical effect. So again, a lot of what we see, again, it's not to do the tooling technology is down to more operational workflows. And this is really what we're aiming to help you in today's workshop. Okay? So again, as I mentioned, it's not the prototype, it's getting to a state where we're knowing exactly what's changed in the system, how do we interrupt with that and then how do we systematically put such a rigor so that we can get better and better. Remember, 100, our target is not 100% coverage, it's getting as close as possible while maintaining fixing the gaps that might have existed. And again, something that we see time and time again is again, a one prompt might work, but as you move and industrialize it, you want to probably do things like breaking this down into each individual sectors of responsibility. So again, if you do come from the software engineering background, you know about these, you'll break it down the monolith into microservices. We'll outline a very similar approach here when we're talking about building these systems. Again, making sure that we're understanding these changes, putting a set of systems in place, this is really what we want to do in today's workshop. In terms of today, hopefully, as I mentioned, this is a hands-on workshop, so we're going to be going into the terminal, we're going to be going to the UI, and we're going to be going through the step-by-step and guiding along the way. So we'll be doing a stage AI system with multi-stage tool calling, which really allows us to see this more, let's say, a genetic flow. We'll use brain trust to instrument and see how the performance of the application is working. We then also want to then take a look at creating what we call identifying the failure modes using a golden tet. So we'll push that through. We also then want to talk about how we industrialize it. So they're moving from, hey, it just works on my machine, to something that you can use in production and have it manage forward with the system in place. And key thing is identifying those edge cases, because again, you can create a test data set, but ultimately there's no substitute for real world data. So get able to show you how we will take those real signals in and evaluate to complete the loop. So just by the finish reduction here today as well, who here's heard of brain trust, played around brain trust, can I get a shorthands? Okay, great, fantastic. So, okay, love this, love your own hands. So just a bit of production to brain trust. We're a company now that's, I think just shy of three years old. We're approximately a series beat company. So we just announced that a food market goal. We raised $80 million, so the $800 million valuation. We have investors such as iconic, is at 16, as well as Gwelok to really talk about, you know, helping organizations ship quality AI at scale. So we're the platform for AI observability. We've got a heavy user base globally, but we're also expanding our presence heavily in Europe. So, again, I'm one of the first engineers to join it, help build out our go-to-market function here. We're very excited, again, with our customers and our friends over the trainway to do that. Some of our local customers include like Lovable, Dr. Lub as well, are really pushing the forefront of these AI systems. Again, when it comes to using brain trust, again, I want to kind of get hands on with it, but where we really distinguish ourselves is it being able to do this at scale. So, founder, anchor Gwel, this is actually, he's start time building brain trust. So he's an expert when it comes to sort of database systems. And he's built a company called Imperial Provisity that was acquired by Figma that talks about document extraction. He needs to let the ML machine learning team out there. And because he realized, you know, building these evaluations are hard, understanding production traces are hard. So we're having this issue, I'm sure they are other organizations out there, which are doing the same. And so he founded brain trust, to really help do this at scale. And as we're doing and understanding these these traces to come in and again, being very highly semi-structured data, which changes, he realized that traditional analytical systems were for purpose. So we've created a new category of databases called brain trust, which really helps identify and accelerate this at scale. With us again, we're to work with platform agnostics. So irrespective of the agent framework you're using or the LLM providers out there, we're intending to help you deliver value irrespective of that. One thing I will talk about as we progress in this workshop is this concept of a flywheel. So again, if you ever come from agile development, you know, perfection is enemy of good. We want to start somewhere. So even if the case that it's a new application and you don't know how it's going to be of production, we can start up with the evaluation set. If you have an existing application to instrumenting, that's great. We can pull that information in and identify the failure modes here. So the key thing is get information into the system, identify those modes, remediate, ship it out, and then monitor and complete the flywheel again and again. So you get to a uni. All right then. You would have had a lot for me. So one of the things I'm going to provide my colleagues over at Train Line to maybe just share their experiences of, you know, prior to brain trust and how they're helping out. So Osama, I'll give you a teeth. Thank you very much. And hello again for people who are joining us just now. As my introduced Train Line, we are a company that actually helping people get on the trains. Trains are different planes if you don't know. There is this worldwide system, central system for all planes around the world. It's not the case for trains. And in Europe and the UK, it's very hard, I would say, if you would like to install an app for each career, it will take a whole space on your mobile phone for sure. So Train Line is actually being that platform basically to help you book tickets, mobile app, agnostic, platform, agnostic, career, agnostic, you do it on one app and you can book a train from Paris to London, from Leon to Tumilano, whenever you like basically, every career in the EU. We sell almost like 6.3 billions of tickets on the trains, 27 millions of activities and counting. And the other interesting probably one for this conference is how many AI conversations actually that we have from our with our travel assistant. So we do have a travel assistant that is exposed to people. And it's not just a chatbot. It's actually a multi-agent system that actually can handle refunds for you, can handle changing trains for you. So it is very, I would say, proactive agent system, agent system and not just a chatbot, probably mine would like to add more about that. So what are the benefits of having 27 active customers? You've got a huge space of how you can serve agent applications live to the customers. What are the examples of that as well to some, I talked about is a travel assistant which you can get to from a ticket window in the application. It's an agent system which is something we want to talk about a little bit later in the slide. Awesome. We should bring us outside to the next point. I will keep it short. Selling train tickets and being like a train ticket company, how come that we are doing machine learning? Of course, we can. There's so much things to do and to help people in terms of their journeys, getting their tickets, getting their trains, getting back home basically. We build, we do the two things. So the classic ML part which is actually building models, we do that. We build ML models inside of train line from scratch, from data to model. This is something we do. And we also do the multi-agentic, the genitriary systems that we are now familiar with on top of LLM's that we love and cherish or the toolings and context engineering and all of that. We do that. So we do these both sides of the story at train line. These are two examples of what users are actually using on top of those systems on the left side. So this is, you can think of it as your weather application but for train disruptions. So basically you have a ticket for a train. You get there. We know if this train would be disrupted or not, if it will be probably late or not. We know that but based on huge data that we have on top of it that was machine learning model that was trained and can actually predict train disruptions being late and all of that. So this is the classic ML part. The other one that's the travel assistant that I told you about. It is very, as I said, it's very, I would say, advanced multi-agent system. It can show you alternative trains if your train is canceled or something wrong with it. Good luck doing that yourself even with chat GPT. And the other one is hand-refund. So you can actually get for a refund on your ticket if your train is late or all of that and it actually can give you all of that without a handover. And also can do the handover to actual human customer support in our lovely customer spot team. Which means that if you are doing this at production in production level and that scale at train line scale, it means that or you can ask the question are we breaking things at train line? Of course, we don't want to do that. And we are moving fast because technology is definitely moving fast. And this is why we are here. We are here to show you what are we doing to move fast without breaking things in terms of AI. And we can do that on top of handling the complex software systems that we love. And cherries from APIs, at scale and serving millions of users. So how we would like to think about it is this scale. So we know for sure that whatever we have as software systems, this is the deterministic side of the story that we have. And on the other side building the ML models, this is the non deterministic side of the story. And we know for sure that the agent existence are in between. There are parts of them that are deterministic. There are parts of them that are not definitely deterministic. And this is how we are the framework of thinking that we have. In terms of quality, how are we handling that? That's question. For the ML models, we do care about the quality of data and that we use for training, the models. But also we have on top of it, of course, ML machine learning evaluations, whether offline or online, offline, it means before going to production, you need to do your evaluations. And online is the data from production you get it. And evaluate your model if it's doing well or not. In our case, for instance, for that forecast, is it predicting the state of the train if it's disrupted or not? Is it the model correct in its prediction? That's one example. On the other hand, for people who are familiar with software engineering, we do all quality checks and the diagrams, and the tooling that we have for handling quality at scale, a very large scale at the train line. And for those systems, I think you already guessed it, it's basically combination of both. It's not one without another, for sure. We do everything that we do quality wise in terms of deterministic systems, but we also use the evaluation side from the non-deterministic systems. So yeah, that's how the framework of thinking that we have a train line thinking about these systems. And definitely, brain trust is helping with that. And by the way, disclaimer, we are not, I would say, we are here just because we are convinced that it brain trust works for us. We are not paid. So that's 100%. We are happy customers. We have been with brain trust for a long time. And we use it in different sum of brain trust features. We actually use them here. So, for instance, for the ML AI evaluations, we do that. And we follow the scoring of travel assistant on many levels, from tone of voice to actual helplessness when it comes to tickets. And tickets are really, really, I would say complex in terms of reasoning and what you should get, what you did not get, depending on the, if the train is late or not, the type of ticket is a return ticket or advance. It's so many complex cases. But yeah, we do follow that evaluation side, but remind you would like to add more about that. Yeah. So can I get a raise a fan from people who have struggled with LLM costs, number of tokens, switching models, problems like this? So yeah, it's nice to see. So it was a problem with train line as well because we do this at scale and the amount of open AI and traffic amount that will be pays, just like this are. So we have to keep switching models, which is like the best model for a use case, cheaper models, models with efficient tokens. Now, anytime you want to switch models, you want to make sure that it is performing at least at the same level as your current model, right? Before brain trust, we had no specific way of doing it because we didn't have score set up, we didn't have, you know, sort of like the evaluation. So with the usage of brain trust, what it has enabled us to do is to simulate how the performance of the lower model would look like and we've used brain trust extensively to run offline evaluation and see what the effect is going to be and also online evaluation to see that the intended effect that we observed in offline is as expected. So that's I think one of the use cases. The other one is more generic, which is like it would have taken us like a lot along to evaluate a new feature shipping into terminal assistant, but with brain trust, we have been able to make sure that we are assured that the new experience with the users is going to be good. So brain trust has helped us a lot in shipping fast. Cool. And also the other part is observability for sure. We do use brain trust for observability if this one is working. Yes. So yeah, this is true example from brain trust. We are spoiling basically the workshop, whatever you are going to see there. But yeah, we can track everything we are doing, tool calling, the other agent, in terms of number, in terms of quality, that those kind of insights, you will need them to get from proof of concept to actual system in production or in your company or for something actually that's been used there out there and users find it would say true working product and not to say just any slope, AR generated system. No, we need that for definitely for those cases. Would you like to add something here? I was just going to say that brain just enables you to look inside complex, agentic workflows up to like tool call level, token level, which is like very insightful and helps you debug a lot of things in production and before you deploy it. And one last thing probably is the cross-functional friendly point. That's something that we have discovered along the way because we are building those systems from travel assistants to models. At some point we need to have, we are a big company and we have people from product people with non-technical backgrounds and we need to communicate and we need to share many things and they also need to self-serve some of those things. We cannot say, baby, sit, any people and say, hey, you should do this. Let me get you the logs, download it and send them, that does not work. At scale, we need like a way where we can help, I would say, we can work cross-functional way and let people be, I would say, free to do whatever they want and self-serve with data and insights. So this is what we have also discovered along the way and brain trust helps also say, with many requests that we wouldn't be had, but we are really appreciated for sure. And of course more, we are using just a part of that system. We have our own things, brain trust, I think you are building even more two things. So definitely more to come, I think the workshop will definitely help you discover all those things. And of course, how fun, building during the workshop, let's say laptops out and if you are interested, train lines are hiring in the A&G and inside, of course. And give it to you, yeah? Perfect. Well, thank you so much for that team. And just to say again, I'm a personal thank you and extension of my wife. We thank you so much for the split-save functionality. Give me colors to that. All right, then team. All right, let's proceed with the setup. So again, this is going to be a hands-on workshop. Better dust off your bash skills team, Jokes, I've got a nice wrapper command to help out with that. So hopefully everybody, or at least most folks have joined the Slack organization for AI Engineer and then also joined the other channel. I've put a link to this, but there's also QR code to the repository which we'll be using. So I'll give it a few minutes. Really key things is signing up to a free brain trust account. If you use Gmail, you can use a little plus sign trick to just create your account. If you are an existing brain trust user today, so hopefully it doesn't pollute your existing one, you will also need access to an open AI API key that should be easy for you to generate. If for some reason you're able to generate a key, then please let my colleagues know we will be able to send it and DM you on Slack to use for this specific session today. And I think obviously as an engineering conference and especially in AI, if you are using AI coding assistance, feel free to use that with the IDEO in the terminal to guide you along the way, ask questions around the code base and so forth. I've tried to simplify a lot of the scaffolding in place. So I'm using mice to manage node and PNPM on the machine, but alternatively you can download node V2-2 in the specific version there. It should be fine, but again, I just want to caveat, especially for this workshop, I fixed that specific let's say one time. I'm using make to just again, some of the syntax exturior to wrap the commands. If for a reason you don't have make installed, if you're on a Windows machine, you can then just run the PNPM commands, which are then just wrappers for the package.json file. I hear that today. So yeah, we'll give it a few minutes for folks to to scan and clone. I just work pointing out as well. I have a in the shoot today, I can see lots of folks have already gone through it, but yeah, we provide a step-by-step logo. So as mentioned, as we're proceeding with the workshop today, if you are stuck, you're going to please answer questions on the Slack channel, but also you prefer to this. As mentioned, this will stay. This is a public-facing asset, so even if you outside as workshop, you're stuck, you can kind of go to it at your own leisure. Just kind of a maybe a short of hands. Who here is not able to get an open API key? Open AI, sorry, API key. Hopefully folks can be able to provision that. It's really critical to this. Yeah. Some people join here, too. Don't have a QR code. For the Slack. I can do that. Okay folks, I think you did have a few late stars, so yeah, I don't have to spend too long with this, but if you did join late, please scan that QR code, which will give you access to the AI Slack engineering organization, and then join the AI channel, AI engineer Europe 2026 Brain Trust Workshop. It's a public-facing channel. Through that channel, you'll be able to get access to the pinned content. So the repository as well as a cheat sheet, which will be following as well. Just a bit of a sense check folks. Hopefully most of you would be able to join the Slack channel and are able to clone the repository and access to that. All right then. As I mentioned at the start of the agenda as well, what we'll be doing is creating an example, let's say support triage agent, so hopefully this kind of exemplifies a lot of systems that folks maybe have already built or are trying to build. It is a fictitious application designed specifically for the workshop. Please do not use it in production. It's just really around to just teach us how do we build these AI patterns and for scale. So the idea there is given a ticket might be raised in the particular system. We have a set of agents that goes through a pipeline process, I don't say pipeline, but a stage process tool calling that produces a set of information that can be emitted into downstream systems from that perspective. Just to kind of help visualize what the system does. Again, mentioning we're taking the ticket input. We have a first step which is collecting the context. It's quite a deterministic way of extracting information. We then proceed with the type of portion where we've broken down into three stages. So we have an LLM and tool calls to triage that particular issue. We then want to do a policy reviewer to make sure the output is correct. We then want to create and draft, let's say a customer facing a reply with reply writer. And then we want to package this together. And depending on how severe this particular ticket, we can invoke another tool call to whether we need to escalate this to let's say a human in the loop and then draft the final result. So again, not too complex, but just give you an idea of the system that will be trying to build and operationalize in the session. One thing to point out is as we begin to use brain trust, it's really where does it fit into helping you deploy and manage us at scale. I just isolated this here where everything we're going to be doing towards the end, especially because we progress or trace it into end as my colleagues talked about. We will use manage mode for prompt, so offloading that what you'll do locally into a secure environment, tool calls as well will be managed. So again, how do you talk about getting into external systems and then asking around evaluations and scores which will be run through our brain trust infrastructure. It's worth pointing out as well if you do go into the GitHub repository, there's a page, GitHub pages version of this as well. So again, the slides will be really available for you to use and consume. Okay, key thing as well with this is I've tried to help you do each face and checkpoint this. So if you are stuck, the idea is just you get, get checkout to a specific tag or the branch, but in this case each individual tag or branch is fully runable at that stage. So if you're ever stuck again, check and get checkout to their branch so you're making store make setup, run the commands and then you should be able to get near identical output every time. So again, this is all documented in the readme, but it's also included in the cheat sheet as well for you to progress. Okay, so yeah, just to kind of give you the sequence of events, we'll be doing kind of the scaffolding and setup. I'll talk about building a basic agent. Again, this workshop isn't really designed to talk about building agents. It's about, okay, once you do have an agent, how do you do it operationalize that? So but just for brevity, one of the kind of do it step by step, and then we'll kind of get into the core fit where we'll add the tracing, talk about the evaluations, you know, talking about that golden set and identifying where we can improve. And then the entire all together and using kind of managed infrastructure within brain trust to to help operationalize this out. And again, talk about that collaboration that my colleagues trained line talked about. Also kind of the key thing, it's like once you do identify a production failure, it's then how do we apply fixed this and then complete that flywheel and give you a finalized asset. Okay, so step one, we're going to be talking about building the agent. So in this checkpoint here, again, we need to start somewhere, right? So what do we do? We will take an initial call to an LLM, we'll set up a prompt, one shot, one in one out and get an output. So again, if you're doing a proof of concept, this might be great as part of the initial spike, but again, we know there's work to be done. But yeah, for context, we're just going to be stepping through this. But as I mentioned, just because it works in the demo doesn't mean that that's already going to work in production. I've put some pseudocode here, I just do articulate here, but you can see here as with all these language models, we provide a set of assistant prompt, we provide the user, and especially the text, and we pass it output back into our application. So one function, one model call, we have the structured output which we want to receive as part of the output. So what we'll do as well, when we do the scaffolding and checking out to the first branch, we get one a set of people tickets, so it's a case that's kind of already done in code, or you can use the command line, so I'll just demonstrate that shortly what it would look like with the outputs, and you'll get something similar to that in JSON format. So hopefully everyone can see this at the moment. I have the application here. So I'm just going to go to basic checkout. I'll check it. And what's this? So if I take a look here, so I'm now checked out into that first hi, so building the basic agent. I've got the application here, which is again very similar to the prompt code. We're creating a client, calling OpenAI SDK, again for brevity, I've just I didn't use any agent SDKs, but we fully support that as well as we proceed with the workshop. So this is really available, and you'll see again from make file, if you don't have make installed, then you can just execute PNPM scripts, or I don't know if you use MPM or Yarn. That's also possible as well. I haven't tested it, but the investigation will work because it's just using a package of JSON. So in this case, let's say you want to do something like you know, make tickets. So I want to give it a thing, say you know, my password needs to be reset. In this case, I provided some defaults, some enter, enter, enter. And now it's just making a call to OpenAI. I'm just using, you probably will see from the environment variable file, I'm using GPT-5 Mini. You can switch it if you want it, but just for the purpose of this, we'll keep it simple. So you can see here, I've got the ticket, and it's provided some output here as well. But again, that's a singular shot. I think we've, I'm having some. So you know, based on this output, it looks fairly plausible, but again, it's not going to account for a lot of the edge cases that we want, especially if you've got a lot of new ones to the organization which we're trying to build into the logic here. I'm getting into things like make demo. So big demo, if you want to script, it's the same thing. I'm just calling the same function, but I've got it codified as JSON here. So the JSON fields are available to see. Thanks fairly straightforward. I don't enjoy on that too much. So the next thing I want to do, then is talk about adding account log tools. So we probably want to say, look, let's try to make this a bit more deterministic. Even though a prompt might be very well structured, I may want to bring in different ways of how it might operate. So in this case, I'm calling three different tools to look at relevant help desk articles around that. There's going to be both internal and external. I may want to look at certain things that have happened to the accounts. So in the same, I'm customizing, I've done a certain migration, and that might have an impact on backend systems. And there's probably a reason where I want to create an escalation here. In the case, what I've done is I've made it a bit more deterministic. For the purposes of this workshop, again, this is kind of treated as code, but in reality, you are probably going to be interfacing with external systems like a vector search, MCPC, CLI, and all the types of interfaces to build out this capability. And again, key thing here is the more things that you add, the number of ways that you can fail will also increase. So again, this is why tracing, as we begin to go, will become more important. So just one here. So again, read me here. Next thing we'll do is go to local tools. So in this case, the tools that I created are then available here. But again, they just kind of checked in as code for simplicity sake. And again, I can do the same thing where you do make a ticket, say password needs reset account locked. Okay. Now, so yeah, it's provided a little bit more information. You can see it's a little bit more verbose because we've introduced tool calls into this and give them more context to the LLM. Also at point here, if you are feeling stuck folks that a lot of these work, the tags of the workshop branches are built sequentially. So if you let's go into, let's say, get tag number six, it's going to include everything as part of that. So don't feel like you have to go through each one. If you are feeling stuck and you want to skip, you can do that as well. Okay. One, two tools. So again, we really, I've really showed the code anyway. So, but this is an idea to pseudo code, what it would look like. Stages. So I think this is the next thing where again, we're breaking down that monolithic call from LLM. We're now introducing tools and the next thing is even further drilling down to special stages of how the LLM should behave. So you would have seen from that sequence diagram where I've done effectively five stages for this, where I'm now setting up things to collect the context, triage, determining if it's meeting brand policies, providing a customer friendly reply, and also something internally for our systems, and then finalizing the result for downstream systems. Again, it's just coming from, you know, traditional software engineering. You break it down your problem, you can see exactly where something is going wrong on the stack and you'll be able to remediate. So again, we're possible to try to be more explicit and break it down into challenges that you can work on. So yeah, just a bit of pseudo code. This is kind of like what it would look like. Probably going to put a debugger with a new ID and take a look at that. Okay, so I've kind of been doing this in part, but we've already started with the start point and then I'm going to talk about, you know, doing the special stage here. So let's take a look at the next part of the readme, special stages here. So if I take a look now, in the source folder it's got the individual functions which are pieced out. The prompts associated with that is being used. So let's say here's a triage from what we're using. If I go down to the application, you can see, you know, it's starting here. So I'm using a synchronous functions to execute. And similarly as well, if I do something like you know, take it, maybe I'll do something with different. What's another classical problem? I was just like, I need to upgrade my plan from Pro to Enterprise, but the website is not working, getting $800. So in this case, I'm in customer tier two. We're talking about billing here and my counter is actually at column three. So just to show you, we are, this is live, it's not doing something that's hot-coated in the system. What's pointing out as well, this stage is slightly slow because we've broken down the individual one short LLM into sequential calls. So it is expected to take a little bit longer, but again, that's the part of building up the sigantic foe. And now you can see again, it's a bit more verbose, but there's a lot more thought into this agent here. So I guess you can see again, we talked about a billing issue. You can see that it believes that it's quite high because a customer wants to do that upgrade from a different tier, but there's obviously an impact to this from a revenue perspective. So in this case, we should escalate to the appropriate people on our side. So you can see here what the escalation usually should be, and both the internal and the customer are facing reply as well. So we've included a confidence score to say, you know, if this is true issue, but again, as mentioned, two calls will be able to pull information, which is happening across different systems to provide a greater level of confidence. Perfect. All right, that in mind. So you know, again, we've now hopefully shown how we can build and take an agent, break it down, build something that's multi-stage, introducing two calling to give us where we need. The next thing is then to provide that information and start tracing it. So we can actually see what's happening down to the end of dual details. And this is where observability piece comes into play. So what we're going to do in this kind of section is break down the full execution past. We know that the stuff is very nested in structure. So there's two calls. There's additional function calls behind it. We do want to track some of the key things. And I think my end pointed out, the early struggles that they had to talk about, you know, latency cost tokens account, especially for time to first token, is a very important metric so we see many of our customers trying to identify. Also, you know, what were the inputs, what were the outputs, metadata associated with this. And then also, including additional types of fields so that again, when we talk about monitoring and observability, we can query it within the UI on the fly and set up a learning as if needed. So again, having an output is not enough. We need to understand the full execution part. And that's what tracing allows us to do. Yeah, just to give you an idea of the context. And I know it's a very contentious topic to someone because again, I come from in a background in full stack development. So, you know, but both coins, it was both, you know, Python, Go as well as TypeScript. But, yeah, our SDKs is multilingual. So, Ruby, Go, I think even .NET, we've got some folks who are using it. And that's all good. But yeah, I've just kept TypeScript for simplicity here today. But, yes, SDKs do cover a range of different languages out there that you can start tracing the application with. So, yeah, a real key thing to this and one of my challenges that I do see with our customers book broadly is they will do an individual interaction against a single parent span. And this even my work will let's say multi-turn multi-conversational agents. What we want to be able to do is trace that into a nested structure so you can see everything. But in one interaction, let's say one conversation in a parent span. So, it's really critical as we start instrumenting our application to make sure that we're getting the right structures in place. Otherwise, you're not going to be able to see the full effect of where things might go wrong in your application. Again, this will just come down to reading the trace, hopefully folks, especially I know some of the, you've got more software engineers in the room that are using kind of the traditional observability tools out there. This is not too dissimilar. But for folks who maybe new to this screen, a trace just allows us to not find out what has happened but what's currently happening with your application in real time. So, again, tracking every single call, bringing that metadata, and again, depending where the failure mode is, we can then identify what might be the a call of remediation that we need to take. Okay? So, in the next stage, what we're going to use add tracing to this. And then we're actually going to run it and then go into brain trust and hopefully we'll see it happening. But one thing I do want to bear in mind before I do that. So let's go to our read me. Let's tell us to trace it. Okay, now we're good. Okay, so I've introduced a helper script here called tracing. So, what we do quite well within our SDK is, again, we want to not introduce more complexity where it's needed. So, if you are using, again, the standard LLM providers SDK, you can simply just wrap that function. We provide that out of the box. But then when it comes down to individual calls, again, we can set up some helper scripts here to help just kind of wrap up as needed. If you are using something like Python, then you can also have a decorator function which helps as well. In this case, you know, I've got a nice little function which helps with parent and child. And then when it comes down to the actual application tracing, you can see here, like I've got a child span which has been executed throughout this. Again, I don't want to do truly code this at this point because the code is expanding pretty heavily. But yeah, just kind of want to walk through the kind of key concepts for this. Okay, before I run the application, I don't want to go into the brain trust UI. Sometimes it might create a, if you really signed up for a free trial account, so hopefully you would have done that earlier today. It'll create like a test project that's totally fine. It will create a new project as we execute this. A really key thing as well. If you haven't done already, top left hand corner, you go to your profile. Okay, let me just create a one. So, okay, let's create the, just for now, go to your profile. And then where it says, APR, keys, into your name for the key, generate that in and then use that within your environment, your dot ENV file, or secure that in a key vault if you have that already. There's also the open AI key which we're hoping you generated. That should also be used in AI providers. So, I've set this at the organization level. You can also do this per project depending if you want to segregate it by a particular team or environment that's also fully supported. But in this case, we'll need this key for later when it comes down to the managed and online scoring. So, but just a bit of a tidbit, that key that you generated, make sure you put it into your brain trust organization here to be used later. Right, with that in mind, I'm going to run the demo. So, here, so, and just point of reference, if you are ever stuck with subs, you should just use make setup, make sure anything's in place. And this case, then we want to do make demo. I just be going to run the next, I keep those tickets. I can even do it using the make ticket command as well. So, while it's running the background, oh sorry, while it's running the background, if you go back to your location, you would see there's a project called helper workshop. So, that's the one that will be created as part of this workshop there today. If you navigate to the long tab, you'll start to see this coming through in real time. So, what's pointing out, again, thanks to our capabilities of brainstorm, we have near instantaneous write into our system and read available shortly after. So, especially, it's a long blocking function. So, a lot of our customers, especially the more sophisticated ones, are really using brain trust at scale and really pushing the envelope. So, it's not the case, I hate, I've got a thousand traces, you know, they're pushing, you know, tens of millions of traces at a time, across a very short period and they want to be able to get an aggregate this and again, as you build more sophistication in your application, you're sending this out to more users, that's going to grow up pretty quickly and you need to be able to have a system that can handle the set scale. So, once you start to see the logs that are coming in, come to reverse order, I'm just going to take a look at the first one here and we'll start to see the instrumentation going to be traced. There's a particular button here that allows you to view this in full screen, which I think is quite helpful. So, as I mentioned, because we've got the sniffer structure in place where we're going through each in the sort of tree, this interaction at the top is the demo ticket. You can see the very top level, we're saying that a Hollumn-Mitt took traction on that invocation, the prompt documents in, out cost and latency associated with it as well. I've also included some metadata because I want to be able to extract and filter that out as needed and I'll show you how to work. And everything's available here to view. Meta-Days is there. If I also want to take a different look at this, we can even look at individual steps and let's say so I want to look at what's happened with tree art specialists down to the actual invocation to the Hollumn. So, again, SDK provides a lot flexibility around this so you can see what was put in. The reasoning behind it and the information that we set out for it. And then, coming down to the last, you know, should be escalated or not. None of the views that we will see as part of this is taking a look at the timeline so it just gives you an idea of like a waterfall methodology to see if they put a particular step which is taking longer. Do we need to remediate? It's impossible to do. All right. So, if I take a look about logs page, get about a gather that, the fresh, something that's kind of four tickets that were pushed. Yep, this should come through right now. Yep, so that's two tickets at the moment so, again, that same information that we played is also available in the cons. Yep, so we've covered tracing. Let's talk about the evaluation portion for this. It's quite interesting, again, depending where you are in your journey or building this application, a lot of customers already build an application already, have it sort of moms have been pulled in. But what happens when you have effectively a co-star problem where you don't know what you're building. So, what does good look like? And, in fact, what does good shift to me? In the case of the support application that we're developing today, kind of in more negotiables, how we categorized the support case, how we make sure that there's no severity or low severity outcome for these issues which are blocking. Does the escalation stay in policy with the SLA's? Does the structure look sound? And if we're making any particular changes, does it actually improve the way we look with actually breaking or having a regression to the application? We can do this using evaluation. So, an evaluation for those, hopefully many folks who know what they are, but for those folks in the room, think of it as a way you have your data set, your input, you have a task, and then you have an outcome which you want to evaluate against or scoring function. And this is kind of a little bit different comparing to a traditional software development into working with AI systems because of the snow deterministic nature. So, in this particular portion, what I'm going to be talking about here is creating what we call a golden data set. So, in this support application, I'll be kind of testing anecdotally, but I want to create a set of edge cases where I think this is really going to help give us at least the initial level of confidence to the business that what we're releasing out into production is sort of a purpose. There's always room for improvement, but it's not just me releasing this application based on vibes. I have a concrete way of saying, okay, this is how it's is performed over time. To do this, again, we use kind of two main types of scoring functions. The first of which is deterministic. So, I think a lot of folks have already started using this, again, coming from traditional software engineering, you know, unit tests, you wouldn't say analogous, but they are quite similar in nature where they're very easy to run, cost effective, you're not actually using a model at this point. The secondary type, which is a little bit more sophisticated, which is then using an LLM as a job, so another AI system. And this is really helpful when it comes to systems where there's nuance, which can't really be determined on deterministic systems alone. So, again, creating want to talk about branding style is just meeting customer satisfaction and so forth. So, the most important thing is if you cannot write it in a deterministic way, you want to be using LLM as a job where possible. And why this is more important, again, it's just making sure that any change that we make is a safe to change as we progress this. Okay. So, I'm going to pivot into the idea again. Let's take a look at the readme file. And then we're going to do, what's the dictionary name? You can go on C, okay. And then list on this here. Then make demo, well, make setup, we should be fine. Okay. So, one thing I'd like you to do as well is then to run the C dataset command, so we do make C dataset. Okay. So, what is this is going to upload our evaluation test cases into the read trust UI. So, if I pivot into my datasets now, you'll see it's called helper. See dataset, again, just for simplicity of creating 10 inputs, I've also categorized them around there. So, the input will be expecting some metadata associated to that. So, kind of the core structure to creating an evaluation. So, as part of this as well, and the deterministic, and non-deterministic, I've created some scoring functions, which will be using as part of this. So, it's available here, and then the scoring functions. So, again, checking the category as a schema in place is an escalation reason when needed, again, very easy to run and codify. Yeah, we have a more sophistication with the customer rubric, so this is more of the LLM as a charge use case here. All right. And then what we can do is then go back to read me. So, we don't need to do the demo, we already pushed that through. Let's go ahead and then do make evil. So, this is then going to run an evaluation, and as I hear, the seed data as well, again, no necessary, I have to put it into code, but whether it's coming from a database, in this case, I've just used a flat file JSON for this, just to give you some context. So, we have running this evaluation, we should have, if we go to the UI, a place for experiments, so I'll just start checking here. Yeah, so that experiment ran, again, stole the JSON's associated with it. So, you should get an output like the Silesan terminal, but if we go back to the UI, again, we're starting to track our application against the inputs and outputs for this particular data set. Quite similar to the tracing we would have seen from the online traces, you get to see very similar to you here, and see how it works across your experimentation as well. And I think pointing out, as we progress with the workshop, we'll start to see how do we then improve it using the different functions and track that across the UI. If you need a bit more real estate as well, you can collapse the menu, which does help. Okay, time. Okay, all right, let's proceed with the next checkpoint, around deploying and managing this. So, as I mentioned, I think a lot of folks, at least, and hopefully when I've seen this popular customers, is, again, things will work well on my machine. Okay, now checking that into code. I kind of want to take this into a place where I can start to collaborate a little bit better. It's in a, there's a point of reference. I've got a version in history. You can start to identify, again, who's made what change? And you need a way to be able to bring these users together. So, I think that the customer talked about a collaboration. What's interesting as well is, again, changing the prompt on your machine and then trying to ship that code to a repository, and then maybe somebody who's, let's say, a non-technical SME product manager, perhaps, they want to update the prompts. They can't do it. They have to tap you on the shoulder. Maybe that's happened to some of the folks in the room here today. I know it's happened to me a few times before. And I can start to get really frustrating. So, we actually just want a way to be able to pull that together. A really key thing, especially for those who work in very regulated industries. Like, reproducibility is a very big key thing. And I've worked in, you know, better part of over a decade in both banking and capital markets. I know obviously with the regulatory out there, especially things like write to be forgotten, understanding who's made a change, especially in a stress exit scenario. How can we put this into a new system? This is really key to helping unlock that. And interesting when it comes to identifying changes before you do that. So, again, we don't want to just be making changes, pushing out production and asking us what's happened. We need to be able to kind of get that in place. And so, for us, again, we're introducing some capabilities now. So, what you've been running at the moment, you've been running the tools, you've been running the prompts, everything on your local machine. What we want to do is then offload that capability into brain trust. So, when your application is running in the cigarette environment, it can refer to brain trust to pull that information and then help with the path of execution, which, again, follows the tracing mechanisms that we've done. So, by default, when you're running the make-mands, the runtime mode was set to a kind of local. If you want to use manage mode, just use the prefix, managed, and whatever the remaining make-mands will happen there as well. So, what I want to do then is pivot into the ID. So, going back to my readme. So, I'll just take a look here. We'll do make setup. I keep thinking I do want to emphasize at this point because we're managing this in brain trust. Use the setup brain trust command here. So, what that's going to do is going to package up the scoring functions, tools, and prompts and push it onto the secured infrastructure. So, you should get an output like this and what that would look like in the UI. So, let's go back to overview to see. So, on the left hand side, if we take a look at prompts, you'll start to see the three prompts that we created as part of that workflow. Give you an idea here. If we go to the true or specialist. So, you can define a slug. So, let's say a mutable ID, but you can refer to it in the code. This can also be generated at needed. The prompt is, again, treated as code. We also use some interpolation there if you want to parameterize and I'll show what that looks like shortly. If I go to the scoring function, so let's get going to score as. Okay, let me come up in a second. I take a look parameters. I'm just going to take a look here and I've created parameters specifically just to simplify things changing the baseline model. So, maybe just kind of a show of hands. These models get released so quickly. Has anybody had a PM or let's say a non-engineering SME say, hey, by the way, can I change the prompt? Can I change the model or the prompt and see what that would look like? Has that kind of happened to you folks? I think they've got three hands here. Great. Okay, so, the good thing with this now using the managed parameters, those non-technical SMEs they can come into brain trust and change the prompt here. Write a comment to say, look, let's use a different model. I'm going to use something like for Mini. Just say, testing a new model. I'm going to save this version here. Write the comment. And what I'm going to do as well, just to give you an ending claim. I'm going to make a set of brain trust. So every time you change something, you can run this command. But I'm just doing it for brevity here. That's more to keep this model in sync. So, for a lot of prompts, that's kind of the sync. But if I take a look at the sorry, the parameters in place, it should be there. And so, I want to do is if I do something like unlock. So in this case, by running the management, I'm just changing the course of the execution. Not to run the model locally, but to follow the path of what brain trust is set for the model. And if everything goes well, see it's a new system. You can see that they've changed the model here. So again, I didn't have to do any code changes. All I had to do was go into UI, change the model I wanted to. When you have a parameter, run that and how that uses a establishing baseline for the animations. Again, if you want to, you can also run the demo script to push in the demo tickets. I'm just skipping that for the workshop today. I think I do see this with many of my different customers is saying, oh, you know, there can be another set of course, concern. It's saying, hey, by the way, brain trust is now having access control to these button parameters. What I would say is brain trust is not really intended to replace that rigor. You probably still want to use things like virtual control systems anyway to track that. What we're just saying is when it comes to operation, I lies it and making sure that other users are able to work on the shared system. This is a recommended part that we would take to help out with that. So again, you would probably still have to have your prompts, your tooling, your parameters in a centralized way, but then provide automation in place for you to synchronize that and work. That's the best way we see customers take advantage of this. It's like portion we're going to talk about online scoring. So now that we have those evaluations in place, what we're going to do is then apply those scoring to actual live production logs that are coming through in the application. It's great that we've done our test cases. We've got some level of confidence that it's working, but again, there's no substitute for production data. We all know this. So what we're going to be doing is creating, again, moving the logic into brain trust and setting up on the goal automations that would then track and evaluate this as long as it comes in real time. What pointing out when you start your journey, it's probably, especially if you're using LLM as a judge, you want to start with let's say a higher sampling rate. So again, as logs are coming in, where possible, you want to make sure you identify a baseline. But again, there's a trade-off when these calls can be quite expensive, especially if you're using more sophisticated models and unit high rate of reasoning. At that point, when you do want to are happy with the output, you may want to reduce that sampling rate down to five to ten percent. So again, you're managing your cost effectively. Determine the score. Again, they're cheap, recommend running them all the time. Sorry, what was that question? Yes, so I'll bring that up in the UI. I'll show you. If I go back into the IDE, I'll go into readme here. Oh, so that's why I missed out the managed tools. So let's just say get checkout. Okay, there. As I mentioned, they all build upon each other, so skipping this is totally fine. So get checkout. All right, now, if I do, it will make setup. Everything should be fine. Take setup brain trust. Some of this case, I'm actually taking the tools as well for the production. Again, this is just to really help accelerate things when possible. Coming down to, if I hit refresh, right, so I scrolls. The scoring functions that you saw earlier in code, they're now managing brain trust. So you can see it's available here. The ones that I want to call out is the tree. So I've got a lot of them as a judge, which has been applied. So I put it out, but taking S. And there's an automation rule in place. So, root quality in line. So while I've done here, again, I'm automated the setup as code. So to boot shop the project. But again, what we're able to do here is to say, look, depending if it's an individual span, we can run the execution or the entire trace. And this is why metadata is so important because we only want to trace maybe specific failures that might happen within the code. Again, depending on the use case, my sampling rate, as I mentioned, is set to 100, but again, for more expensive calls, we want to take that as well. So this is what the automation or how we would do that in brain trust there today. No, so automation is more like the execution against incoming logs. But I'm more interested, I've applied automation to setting up this environment and scaffolding. So that's just want to delineate that here. Sorry, it's a question. Could you expand on that please? Yeah. So, correct. Yes, correct. In okay. Yep, we'll probably want to take that as an edge case, push that into a data set, identify it, and then move that back. So that would be the approach that I would take for this. If you don't have any ground truth or they already, right? So this is why I said, depending where you start, it's better to have some kind of data and then begin your flywheel around from that. Oh, that case. We can probably put that into a data set and then replay that through the playground. That's going to be the way we do that. Yeah. Yeah. All right. So we checked out online scoring, coming up to the time. Okay, now to the remediation portion. So hopefully seeing the delta, or again, why we're here today. So hopefully this might help you with that particular question. So again, here's something that might happen as a plausible input to our agentic system is customer user might say, hey, no, this is an urgent, but I'll see if we'll come and export the invoices before the board meeting. The model says, look, wait, this looks okay for me. Someone says no urgent. Come see, come see. But the business is very different, right? This probably does need immediate attention. Your CFO shows the end of quarterly report that needs to be done. And this is the difference between what we're doing here today is trying to identify what is a proper failure mode and then we mediate that possible. So in this case, again, I can run this particular mode here. I've got this in this data set. So just going to be a play of C where we will replay the failure. We want to suppress evaluation against this. We're going to tighten the prompt, run it again and see what that looks like. We're probably going to use it again. It's not just one particular test case, but it can cross our entire test cases as well just to see if it does work as an intended and we have regressed on something else from that perspective. Okay, so let's go ahead and have two separate branches for this. So split our branches A, one's going to have the failure rate in mind. We're going to go to the UI and view that and then we're going to talk about the remediation path there as well. So many products really. So we can even do one time mode which will make a replay failure. Okay, so I've got the set of five cases which are the regression of failure modes. So that's the first one that you saw in the example ticket. Yeah, that's a JSON file here. Let's take a look here. Okay, so take a look at the the failure mode here. You can drill into this. You'll notice as well that we set up the online, the manage tools as well as the online scoring. The trace becomes even more sophisticated in the fact that we're executing this against the secure brain trust environment. So again, moving from local to managed. I'm just going to go to the make file. Sorry, thanks for the JSON. Let's go to the manual. So I'm just doing an evaluation against a specific scenario. Okay, so as we can see as we're progressing with the application, the experiment, again, viewer just allows us to see the progress of our changes. So you can see we've run the later set. We notice some, you know, degradation here which allows us to track it. So the managed data set that I had in mind, the real failure rates are being captured. And you can see, you know, I've been able to see it into compare against existing experiments to see to track the progress and remediate where possible. So I'm forgiving the next thing is I'm going to go to the readme. And then proceed with the remediation. So in this remediation, I've changed the prompt which I've used. So one way to view that is if I take a look at the prompt and the change. So we'll see, we'll see any differences there. Just in place. Come down to the readme file here. Let's say I do one say here, let's give us a specific flag, I think. I don't know if you want to just double-check put that. Yeah, that's it. So if you want to just put that if it exists, replace, in VM build chat. Yeah. Yeah, I don't know what's fine. So yeah, just folks, if you want to, in the part of the remediations script, if you set the environment variable brain trust on the score, if on the score exists, you set that to replace. It's going to push in the updated changes into the environment. So, if I take, oops, it's not for that. And then you'll see as well, now that we've updated the problem if you're cold and pushed it up, it's available here as well. So as well in the UI, again, part of the operation, operationalization, we can see who's changed what, but actually what has been changed to the particular prompt in this case, particularly here. So, including two facts, and that out. Okay. So, coming back to here, let's say we do a, um, wrong this evaluation again. So we made the change to the prompt, and now we're running, um, the remediated version to see how that performs. Okay, so that's around the experiment here with the new changes, hopefully. And I'm pivoting to the, the UI, I'm going to go to experiments here. And it's barely there, but you can just see, you know, kind of tape it back up, just, you know, any improvement going up is that improvement. But yeah, just to give you an idea here, that, you know, now that we've done the evaluation, actually what I can do is, um, do a diff, and we can do, so do a comparison in the delta. Yeah. So yeah, let's come up, uh, improve over time, um, which way, which way, uh, the intended outcome. Yeah. I think we're approaching the end of the, the content. Um, so I know it's, uh, been a number of steps. So I really, really want to thank you for your time and attention to kind of walk to that, uh, as mentioned, the artifacts of public. We've got the cheat sheet there. We've got the Slack channel to help if you have any, any questions. Hopefully, just to give you a summary of what you've, uh, accomplished today, uh, in, in this order is you went from taking a single shop prompt into building a five stage, uh, a agent workflow using, uh, tool calls, what we're able to then do is then inspect how this works pretty much, and then diving into this by adding brain trust tracing, making sure everything is recorded. Um, we also then want to talk about, you know, how do we then evaluate the system from, uh, you know, there's not online, when something new, uh, creating those, those, uh, effectively golden set with those test cases, which we were going to execute against. We then deployed, uh, that, uh, those managed prompts, those tools and parameters into the brain trust secure architecture, uh, to use, and we've also added online scoring to then evaluate the system, uh, as it unfolds. And then we picked a particular, uh, production failure, we looked at the trace, uh, we modified the prompt in our, uh, case and we saw the delta there and running the evaluation again, we saw it, uh, achieve, uh, backup to where needed to be. And in this case, completing the full evaluation of your building, observing it, deploying it and taking it to that moving forward. Um, so yeah, just to kind of, uh, call it, um, and bring it at home. So again, hopefully this is not in common, but again, what might work in production is not really going to work in prototype. Uh, we really need to break this down, identify the failure modes and, and move forward. And that's where, again, explicit stages become really, really important, right? And, um, again, this does introduce more, uh, areas of, of where things could go wrong, but it's easier to debug if that's a case. Um, you know, there's no substitute from diving into the code and tracking everything. So I would say it's observable. It's these table stakes at this point. If you've got a production layer application and you're not tracing it, you need to go back to the drawing board and get that done operational. And hopefully again, we can show you how to, to be able to do that, um, using brain trust. Um, again, no substitution for production logs, but better to start somewhere from the way. If you have an idea of what an issue might happen, these are the perfect ways to, to supplement evaluation. So using those, uh, failure modes as is your test cases. And as I mentioned, this is a continuous process, right? Nothing's ever done. If you haven't worked in their job development, constant feedback is, is important. And again, we're bringing this, this operation model, but with, uh, a US surface of, of, of operating. Um, and yeah, just hopefully bringing home to your teams here today. So, um, you know, my encouragement to you is to, if this is something of, of interest is pick something that's already operational day and it doesn't have to be the entire suite. Maybe start off with something that's maybe a bit more, um, uh, I guess more critical that you really want to improve, um, the operational modes, add the tracing, collect your educacers from that, that mode, um, build those terminally explorers and then reach everything back as possible. So again, the faster feedback loop that you have, uh, you have more insight into the more that you can, again, improve the overall, uh, delivery, uh, operations of, of your system. Um, yeah, and just kind of call to action. So again, I know we've thrown a lot of content at you. It's, uh, well, obviously trying to get a bit of feedback, we try to put some tape in place so I'd appreciate everyone's kind of juggling everything to do this. But, um, as I mentioned, you have to start somewhere. Let's just accelerate you. Uh, we have a sort of documentation that's provided, um, that's, uh, you can also use our AI agent on the agent to search if you have any questions, but we also have a cookbook available. So I tend to throw the cookbook directly into, you know, uh, cursor, codex and whatever and say, yeah, basically this, take the SDK, uh, and start tracing my application. It does a pretty effective. We even do have a version now just a CLI, um, for our brain trust, um, application. So that allows you to do things as auto instrumentation. I'm just going to plug my colleague Eric who's doing some fast work and please check out his booth because it's, it's, it's amazing. Um, and again, if you, this is something that's interested you and you want to explore more, um, you know, please reach out to your county at brain trust. Again, we're happy to support where, where possible. Uh, and if you are in discord, you feel free to join a happy answer questions there, uh, from that. And, uh, yeah, again, really want to thank you for your time retention. And it was a really sunny day and I don't want to keep people in here. I want you to get some fresh air, but, uh, yeah, just on behalf of brain trust, trainer and myself, like, thank you so much for your time and attention. It's, it's been an honor and look forward to seeing you out there tracing and getting value from delivering AI. I applaud you. Thank you.

Feedback / ReportSpotted an issue or have an improvement idea?