- LLM-based judges often fail to correlate with human annotations, leading to unreliable evaluations of AI agents in both development and production environments.
- To address this, LLM judges must be calibrated using human feedback, specifically through prompt optimization techniques like the GIPA algorithm.
- Effective calibration requires context-specific binary metrics, high-quality human annotations with detailed reasoning, and iterative refinement of the judge's prompts, often starting with a well-engineered seed prompt.
Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI
- Uncalibrated LLM judges provide misleading signals, hindering the speed of AI agent development and failing to accurately assess real-world performance.
- Calibrated LLM judges are essential for accelerating offline evaluation loops, enabling rapid online monitoring of agent performance, and establishing a "data flywheel" for continuous improvement.
- Design evaluation metrics tailored to the specific business use case, leveraging subject matter experts to identify concrete error types. Binary (yes/no) classifications are generally more effective than granular 1-5 scores for LLM judges to learn.
- High-quality human annotations, especially those including the reasoning behind a judgment, are crucial for the LLM judge to learn complex policies and criteria effectively during optimization.
- The GIPA algorithm optimizes LLM judge prompts by iteratively sampling new candidates (via mutation and merging), evaluating them against mini-batches, and selecting a diverse set of high-performing prompts using a Pareto frontier approach.
- The initial "seed" prompt for the LLM judge is critical; starting with an unbiased assumption (e.g., "assume compliant unless specified") can prevent early biases and improve learning trajectory.
- Customizing the "reflection template" used by GIPA to guide the LLM in analyzing failures and proposing improvements (e.g., by explicitly asking it to learn policy rules) significantly enhances the optimization process.
- Optimization is an iterative process requiring debugging and parameter tuning; begin with small iterations, analyze how candidates improve, and refine prompts or configurations based on early results, as the algorithm isn't "plug-and-play."
LLM as a judge — An AI model, typically a large language model, used to evaluate the performance or outputs of another AI agent or system.
Calibration — The process of adjusting an LLM judge's behavior to ensure its evaluations align closely with human judgments or predefined criteria.
Prompt optimization — Techniques used to automatically or systematically refine and improve the text prompts given to an LLM to achieve better performance for a specific task.
GIPA algorithm — A prompt optimization algorithm that uses genetic-algorithm-like principles, including prompt mutation, merging, and Pareto frontier selection, to evolve effective prompts.
Agent — An AI system designed to perform tasks by interacting with an environment, often by using tools or following complex policies.
Hallucination — A phenomenon where an LLM generates information that is factually incorrect or inconsistent with its training data, presenting it as true.
Pareto frontier — In optimization, a set of solutions where no single objective can be improved without sacrificing at least one other objective; used in GIPA to select diverse and high-performing prompts.
Reflection template — A specific prompt structure used within GIPA or similar algorithms that guides an LLM to analyze its own performance or previous prompt failures and propose improvements.
Data flywheel — A concept in AI engineering where the output or observed behavior of an AI system generates new data that is then used to retrain and improve the system, creating a continuous feedback loop.
Policy adherence — The extent to which an AI agent follows predefined rules, guidelines, or business policies in its interactions and decisions.
Hello everyone and welcome to my talk slash workshop judge the judge and today we're gonna talk about LLMS judge Quite sure you know this scenario you have an agent in production and someone from the team says We need to monitor the reliability so you go to one of the libraries and maybe use the hallucination LMS judge put it in production within your observability platform and things look fine but Customers are actually saying that the agent is not working and you look at the traces. It's not working You look now under the hoods about this hallucination LLMS judge and you'll find a prompt not very far from this one You'll be given an LLMS output right whether it's in a hallucination Magnum sticks Now obviously How the hell would the agent know whether it's a hallucination if it could then your output have worked from the day one? So today we're gonna talk about how can we build calibrated LLMS as a judge that works Calibrated mean calibrated with human annotation and the way we're gonna calibrate them is by using Optimization or prompt optimization specifically. We're gonna use GIPA Quite good algorithm for optimizing prompts Now why do we want this why do we want calibrated LMS charges or good LMS charges? First thing is for offline or offline evaluations as you know like usually to create Good agent or good prompt the way you do is you try to experiment with the prompt then run your evals see if it improves things or not if it does Good if it does not you go back and you improve it a little bit Prove the harness the prompt and do it again and again and The speed in which you move to production or add features actually the speed into which you can complete this loop and The bottom neck in this loop is actually the evaluation How fast can you evaluate? Obviously the slowest possible evaluation is having human annotator really look at your whole test set and annotated manually The quality is quite good by then each iteration will take a lot You can have faster ones by using an LLM as a judge But then if that LLM as a judge does not Correlate with a human annotation then you'll end up with usually signal and all the scoop will move fast. It won't go anywhere So having calibrated LLM as a judge with a similar quality Let's say as human annotator will make it development much faster The second thing is basically the online evaluation like our example from the beginning if you haven't online evaluation you want to see Basically in production if things are improving or not improving Same thing if you have LLM as a judge that are calibrated with your business Goals then you can quickly see whether changes that you've made are improving not improvement whether There is some change in the distribution of the data how people are interacting with with your agent or model and basically react rapidly And finally and I would call this is the holy grail of AI engineering is really to build this data flywheel Where you optimize your harness Observe some traces and then add new evils based on these traces the edge cases and do it again and again and again here if If you have a way to kind of add new evaluations quickly obviously automatic evaluations From the traces from kind of the annotations the data You can go through this loop faster and faster to the moment or to the point that you can think of it as an automatic loop Right because you can optimize the harnesses with optimization techniques like GIPA what we're gonna use here But the same thing you can do it for the evils and Basically over time your application will improve just with new observations So today we're gonna build and optimize these LLM as judges that are calibrated with human annotations But before going there Small entry about myself my name is Mahmoud the co-founder and CEO of agenda Agenda is an open source LLM ops platform Basically providing you all the tools from observability prompt management evaluation Covering the whole life cycle of building reliable agents My experience is in machine learning I have more than 15 years experience and that And the previous life I was in academia Worked on machine learning applied to computational biology protein structure prediction and right now We're working a lot on these sampling and auto optimization workflows So if you're interested in that please reach out with love to have a conversation and show you also what you're building So what's the plan for the stock? Basically we're gonna work on a practical use case. It's a customer support agent that we want to evaluate And we are going to build an LLM as a judge that is calibrated with the annotations basically human annotations For that customer support the plan would be ready to go over the whole process of building this Starting with how we design the metrics how to think about the data curation the labeling but the main focus would be really about The part about optimizing the LLM as a judge using GEPA and obviously then validating the results all the code and the data used in this We'll be found in GitHub and you can find them in the links and this video and the last slide So let's start with the data sets. We're gonna use Tao bench Tao bench is a benchmark in a large data sets built by Sierra customer support Scala I think and they have like multiple benchmark for real-world scenarios for customer support agents One of them is the airline agent which we are going to use in this example And basically what they have is an airline customer support agent that has access to multiple tools to manage reservations Access flight information access user information and has a quite complex policy to be held to like when to change your reservation When to provide information and so on so forth just like a customer support agent the human one and the data that we have is Is the agent itself but most importantly? 599 conversation traces that are generated with annotations Now the format the original format of the annotations is like in the format of assertion But I pre-processed or bypassed process the data so that we have for each trace and a notation like a human annotation Where it says for example like in this case you have the conversation that you see here and then you have an annotation that the agent is not compliant because it improved approved the cancellation without verifying that the reservation Met the airline cancellation rules so basically the evaluation failed and the reason is because the agent cancelled the reservation without verifying something so basically it did not be halt to the policy and the data is more or less kind of Not very good. It's 62% compliant 38% Compliant and it's generated with multiple models and trials Overall the data is is kind of the problem is quite complex because the policy is quite complex The data has caveats honestly. It's not very clean Due to the reasons how it's also generated but for our use case. I think it's very interesting use case to test Gipa and to kind of demo how it would work in a Test case the workflow that we have is four steps First thing is designing the metrics that is deciding what will the LLM as a judge measure What are the different access that it would look to? The second thing is annotating the data And then we will optimize the judge and validator results now the most important thing to take here is that the metrics Need to come from the use case itself. It does not make sense to have General metrics like hallucination When you're evaluating your AI agent it really depends on kind of the business use case and the best person the best people to To determine these metrics are the subject matter experts for example in the case of a customer support agent You need to have subject matter expert look at the conversation provide the feedback and I think the workflow to do that the best one that described it is Hamil From Hamil Dev I'm gonna share his blog and The YouTube video and really describe this idea of error analysis very well But but I'm gonna go over it very quickly and also the annotation workflow very quickly So the idea is that you you provide your subject matter expert with all these traces of the trajectories of the conversation and they would annotate them first by Comantic what did work or what did not work? but then kind of slowly trying to cluster create clusters of the error types like when it's failing why is it failing and Here I'm showing in agenda how it's done and basically here I have like these I discovered kind of these four error types while going through these Traces so there is sometimes issues with policy adherence Sometimes issue with the response style Information delivery basically the agent is not informing the customer that they made the change or something like that and Finally some tools has not been called called correctly and the idea is that we're gonna take these for error types and then we're gonna build for LLM as judges for these so it does not make sense to have one LLM as a judge which is success Basically try to evaluate all of these it will make it too complex and it's very hard to learn and you will see a little bit later That even with a simple like when we're gonna simplify it. It's still hard to learn Calibrated LLM as a judge or optimize a Calibrated LLM as a judge So it makes a lot of sense to make things very specific in These metrics that we want to evaluate and the second thing is to move away from One to five scores of like percentages and instead have like really binary a Solution like whether it adhered to policy or not With obviously some reasoning and the reason it's again It's already quite hard to calibrate an LLM as a judge with a true false like binary classification Adding another layer and saying okay, it should be a number between one and five. It's it's hard It's even hard for human annotator like to have two human and an annotator agree on the same score So the moment that we have defined these different metrics start the point of annotation Here again, I'm using agenda basically You would take your traces create an annotation queue and kind of specify for your annotator like the name of the The feedback or the evaluator policy adherence here and then providing like they should provide each one with whether it's a dear to the policy whether it does not and provide the reasoning and the reasoning here is very important because without the reasoning we will see the optimization algorithm will need to By discover itself like why it failed and It's gonna be very hard like in unless It's a very specific kind of feedback for example tool Failure where it can infer things. It's very hard to see for example from a conversation Why it did not adhere to policy without someone providing information in the beginning about it So having that reasoning is very important later for the LLM as a judge to learn and again This reasoning as I've shown previously with the annotation It it kind of describe why for example the agent here is not compliant because It approved the cancellation without before verifying That the reservation met the Conservatory rules so now that we have so now that we have the annotations we can go to the optimization But before going there I want to add the small note Although we went very quickly through the first step and the second step These are actually the hardest part of the problem like in reality as every data scientist know Getting your data is the hardest thing and You need to make sure to look at the data look at the annotation Make sure that the distribution is is good That the annotation and the information within the data is enough For the algorithm to learn a representation of the LLM as a trap that is meaningful in our case here The data is not that good The number of traces is a small the problem is quite complex They're not very well distributed because of the reason they have been generated Also the annotation is actually Kind of AI generated based on the assertion and the original data you can see a little bit more how it's done in In the repository So it makes things quite complex for this problem But it's still quite a good demonstration of how it would work So now that we have the annotated data we can start with the optimization and for the optimization We are going to use the gap algorithm some going to explain in the beginning how the gap algorithm works And then we can jump to the Jupyter Notebook and start optimizing Based on the annotated data It's very important to understand how the algorithm works because you will see that Practically you need to play around a little bit with the parameters and to get it to work and It's very hard obviously to play around with the parameters if you don't understand what do they do? the algorithm is Similar to genetic algorithm So basically the idea is that you start with the seed prompt and then at each iteration you try to sample new prompts See which one works and then I basically select the new ones and and kind of improve over time That's kind of the general shape of the algorithm and we're gonna go and look at each step So it's three steps basically you sample new candidates times evaluate them See which one good work well and then do some filtering using this kind of Pareto frontier I'm gonna talk about and then do it again and again So let's see how it works. So the way it works is first you start with a seed candidate Here in our case we're gonna use kind of a Very simple LLM as a judge like evaluate whether this customer service agent violated policy and start with assuming this the agent is compliant And now in each iteration we are going to sample new candidates from the filtered candidates from the last iteration Now here in the first one we have only one seed candidate, but in the next iteration we'll have a larger bag of candidates So GIPA has two strategies to sample new candidates One is prompt mutation and the other one is merging multiple candidates for prompt mutation Which is what we're gonna use in the beginning since we have only one candidate the idea is that you would run the first LLM as a judge here whether the trajectory and if it fails It's like this LLM as a judge will reflect and propose a new prompt basically There be some kind of reflection Which means basically we're using the intelligence of the LLM to try to improve the prompt because it looks at the input looks at the outputs Looks at the results and try to infer how to make it better The other strategy is the merge strategy, which we're gonna use in the next iteration and basically here It takes two prompts and then kind of put them together And if you think about it with an LLM as a judge usually you have like these guidelines and Basically probably what it's gonna try to do is to take guidelines from prompt A from B and put them together So now after we generated a lot of samples We need to select which one are good and basically the way it works is that it would evaluate these new prompts against mini batches of the eval so not everything And if they improve the performance Compared to the starting point Then we select them and They will be added to our bag of prompts and then starts the next iteration which is the other Innovation of this algorithm, which is the idea of the parity of frontier Basically the way we select which Prompt or which candidates we're gonna use as a seed for the new iteration is not that we select the ones that have the average best score like that would be the trivial Solution right look at my my prompt see which one Work most by looking at the average of everything and then select these Instead what they do is that they Try to add diversity by trying to look at What are the best candidate for task right you have a like a set of kind of tasks in your evaluation like in our case set of trajectories and you look for each of these trajectory which is the best candidate and that is kind of the Pareto frontier And then you try to select from these and basically what you do is You try to select a set at the end of the day that covers your whole task case So basically for each test case there is at least one candidate that solves it and obviously you see there that The idea is that you get like a good Pareto frontier and then you start merging things and at the end of the day You have this prompt that solves everything in your training now that you have like this Filters set of candidate again. We sample new candidates from these these ones using Kind of the mutation and the merge strategy and we keep doing this until basically the compute budget finishes Now there is a lot of libraries that implement this algorithm I think the most known is despite popularize the idea of optimizing prompts or harnesses but now there's I Think a new library by the authors of GEPA an open source one called GEPA and They haven't implemented the last month a new interface called optimize anything you API Which is what we're going to use and which can be used not only to optimize prompts but really to optimize Any almost any algorithm using this same idea. It's quite powerful Let me show you how it works. So basically The API here is called optimize anything you see this function and what it takes is a seed candidate The candidate is the configuration that you want to optimize in our case that would be the LLM as a judge prompt right we can make it even a dicked if we want so it's kind of for example the LLM as a judge prompt plus temperature let's say so or it could be a chain of prompt so on or so forth so it's not limited Then we have The evaluator which is basically the thing that would be used by GEPA to optimize and the expectation from the evaluator is that It would obviously around the system in our case it would run the LLM as a judge Parameterized by the candidate and then it would log But also the error the reasoning and in the idea you can add as much as you want and you see you'll see This is how we're going to do it with with our optimization for the LLM as a judge But the idea is that if you remember here we used some kind of reflection and and Reasoning to improve our prompt and that reasoning is something that we ourselves will build through this evaluate And then all of them that there's ways in the configuration for example to configure how many calls to do per iteration the objective Basically providing context for the refinement prompt on how to improve and so on and so forth but but the the corn Flow is actually quite simple So now let's jump to the Jupyter Notebook and really look into how to do it step by step So you'll find this Jupyter Notebook in the GitHub repository that you'll find in the links and in the last slide So we start by installing the library so it's dot and flight LLM and GEPA I'm not installing here GEPA because I did the optimization it takes kind of quite a time I think a couple of hours Before so I'm gonna jump this step But obviously if you want to run it within the repository within the Jupyter Notebook you should also install it So we install this and we kind of do our imports. We have kind of a couple of Functions that that I extracted outside we're gonna look at them later And we start by loading the data As I mentioned so we using the data from TaoBench I just kind of pre-processed it and the beginning to Change the type of the assertion so that they look like the annotation I showed in the presentation So I've already kind of split the data into a training and a validation Data sets so one which will use with the GEPA and the other one the second one to validator is at the end and The way that I did this split is based on different tasks that are created in In the TaoBench and if we look here is basically you have a training set with 480 traces with with around two-third that are compliant and Validation set with 112 traces that are compliant as I mentioned the data here is not not very very nice because There are some redundancies so there are sometimes the same task that is Being run with with the same model multiple times so there is a little bit of redundancies But there are no redundancies between the training and the validation set So we look here how the annotation Looks like and again Basically we have an example of a compliant annotation or non-compliant annotation basically a trajectory that That kind of aduse to the policy that's the LL and LL as a charge that you want to learn another one that does not and we can see here basically It describes like okay, it's compliant because It correctly identified the basic economy or reservation while here it did not identify the user membership as a regular and And this annotation is actually quite important for us for the LL as a charge to learn the policy especially in the case here This policy during it's kind of a very complex system The LL as a charge need to to learn and without kind of information about What is compliant like why something correct or not correct it would be quite impossible for the Gip algorithm to to reach Good and LL elements of charge right I mean it would be the same for a human right if you gave me all these trajectories told me this is Kind of correct. This is non-conform to policy, but you did not give me more information to be very hard for me to To basically make a judgment learn how to assess the policies So having this information and the quality of the annotation as I mentioned Quality of the data is really paramount of being able to learn this And again, obviously this is kind of bit of a complex LL as a judge to learn so First thing we start this is with a naive judge and This is the C judge we start with and and it's something that actually I engineer engineered I'm gonna talk a little bit about it later with the learnings on like how exactly did we reach this and Basically you can see here is like it evaluates whether the customer service agent violated policy and It tells that you should start by assuming that the agent is compliant and Only changed to non-compliant if there is a specific reason right? I mean here we are starting with an LL as a judge Which means that Like the C judge the initial judge should actually in my opinion start by by saying everything is alright I mean if I don't have any rules it should say that everything is alright I started and like in another example when I started as a created in LL and other jobs that says okay You should check whether the agents violated policy and what you end up being is basically the LLM With its own biases trying to make that decision by itself says okay This is violates this doesn't violate without having an information so without telling it in the beginning that You should start assuming that it's compliant I mean if you don't have any reason to to believe it's not compliant. It should be compliant Then Basically you start with with some kind of random LLM as a judge that would be very hard to fix later Unless you end up with one prompt that discovered this thing right at the start of the compliant So the I discovered you that the initial seed is actually very important in this case There might be simpler scenarios where you don't need this but in this case it's kind of Quite important so if we take this and we run this on the validation set like this initial LLM as a judge We basically find 61% accuracy but but we look like the bias It's actually most of the time saying it's compliant which is actually what we want right and if That's I would say saying it's compliant 98% of the time is actually the unbiased thing to do the logical place where to start So we run the metrics in the beginning right the accuracy 61% 98% is saying like the recall of compliance with very low to recall from uncompliance Um, and I think as I mentioned this is quite alright It's biased towards compliance and I've had experiments in the beginning where it was kind of almost random But then it doesn't learn at all and we can look into why or where it goes wrong and Basically by looking at the places where does work and we see like it says compliant But it's not compliant because it doesn't know the policy Right, so we start here like the main code of kind of optimizing the judge with GEPA and what you will see is that I have actually wrote Reflection template so that's prompt that GEPA uses to reflect and to prove To sample new candidates. I did not use their default but actually I wrote one I tried in the beginning by using the default prompt in in In GEPA, but the results were not as good as I expected it was very hard for it to learn and What I tried to do is is to provide Basically a bias and prior within the reflection Template So you see here for example. I mentioned obviously that it's basically the judges reading an airline customer service And it needs to kind of decide which is the basics, but then you can look at More information for example that our annotation that this Kind of reflection template sees Includes also the judge verdict the ground truth like the annotation like this is an important information that the reflection Template should look into and improve and Basically I explain to to to the LLM how to do this you can add rules or structure existing one reward things for clarity and try to think about it that the The reflection template should basically create a real policy rules, right? should find the right policies and abilities and I think adding that was very Important to kind of improve the quality Otherwise the default reflection template Did not understand that that should kind of try to Learn the policy more or less right with the LLM as a judge To some degree So that's the second thing kind of I change and I mean honestly it's only a little bit on it and then we run the optimization A run optimization is basically a wrapper around Optimize anything and it just Kind of parameterize it since I run like multiple experiments to find the one and sexually To be part of the GitHub repository and sexually something that you can play around with and start with when you're exploring the space of design and Gap and you can see here to basically it tries to build the configuration based on the parameters like the quarks and then calls optimize anything with these and for the kind of The make a valueator. It's basically it calls the LLM as a charge and all adds all the side information So it doesn't only provide the trajectory, but also kind of the annotation which is quite important And then we're on this as I mentioned it takes around an hour to run And you can see here the basically the results this is the optimized Rubik and you can see compared to the default where we started it's it learned Part of the policy criteria like the flight cancellation and refinance applied modification How to communicate and so on and so forth Now if we look at the results like with the kind of evaluate Rubik to evaluate it we see that the accuracy increased from 69% to 74% and We removed the bias right so what especially changes the recall for the non-compliant and the precision for the non-compliant which was basically zero in the beginning now the LLM as a judge is As less bias it's 64% so pretty to 98% and it's really learned parts of the policy So looking at the results as I mentioned like for the validation set we had quite a lot of improvement like 14% and For the training accuracy it improved by nine points and you can see that the parrato front here interestingly accuracy is now 100% meaning that for each task There is one candidate that we generated that solved it The issue the algorithm faced is how to merge all these candidates and all the information to have one prompt that solved everything And it's struggle to do this so at the end of the day we improve the LLM as a judge we improve the tech Currency But it's very still quite far from kind of 95% accuracy or something that is really well aligned with the human judgment obviously here I didn't invest extreme amount of time in it and as I mentioned at the beginning the quality of the data is also Would be better, I would guess In other cases, it's really a tricky example But but nevertheless it took actually quite a number of iteration I think that's the biggest learning to To reach this LLM as a judge it's not an algorithm that you just taken it works from from day one and let's far kind of toy examples and I wanted to show a little bit in the end like what are the experiments I tried in the beginning That failed and how did I think about fixing them? the first thing was actually Using smaller or older model like using GPD for all For both the refiner and the LLM as a judge and that was a complete failure like smaller models really Are very bad at least in this example to be either an LLM as a judge or a finer For the LLM as a judge providing all this policy, especially it has a lot of kind of complicated logic it just failed and could not improve it I tried other models I tried me and NANO and Gemini and DeepSeek and And you see that the best kind of results with this kind of using Gemini for flexion and Grok for a judge, but I would say also using GPD for Mini for both is actually quite good and the results quite well the other thing was actually trying how to try to debug it and And what I tried to do from the beginning is really to not Start sampling doing big experiments from the beginning but trying first Kind of a small iterations looking at the reasoning LLM looking at the candidate How do they improve how many improved in understanding what's happening and that actually what What allowed me to kind of think about improving the refined prompt And and basically adding some prior there to kind of helps of it basically what I did is I stopped at the first iteration Found some example and then looked a little bit kind of fine tuned in Claude Code that refinement prompt to To basically allow it to improve The candidate and as always in machine learning like what you always try to do is to overfit for the training data Not trying to run the old algorithm, but really trying to find a way so that it works And I think what we saw with the very different year each in 100% is we almost Overfit to the training data, but obviously for the merge there are like things I think we can do to improve And the final thing was kind of the situation on the seat prompt there actually iterated on multiple seat prompt and there were two families one which as you have seen here is Kind of was not did not include any information about the agent prompt because we have access to that Right and the agent prompt does have access to the policy and One which had access to the policy so basically it was this prompt that I've shown But then the policy of the agent like really copy-pasted and interestingly The prompt that did not have access to the policy did better because My hypothesis is that if you have access to the agent policy from the beginning Then it's very hard to find unit you're already stuck in the local minima that you can not improve on But if you don't have access to the policy and yet obviously you have access to the annotations that describe in this case all the policy Or a large part of the policy then you are able a little bit to To explore the space of the prompt much better And finally the last point is beware of the cost I mean Even these small experiments I've done. I think they cost like two three hundred dollars in tokens especially since the trajectories are long so there is a lot of input tokens in this case but The models that are used are actually quite expensive right GPT-4 I mean I tried a little bit to play around with GPT-4 but that ate a lot of money So I stopped the experiment but even GPT-4 mini It's quite expensive to some degree and then if you go nano At least from what I see in it doesn't work right also you go to kind of smaller model cheaper model It doesn't work usually what they say you should use kind of a bigger model for the finite prompt and smaller model for the LLM as a judge I think it makes sense especially if you're running LLM as a judge against a lot of traces like in the case of online evaluation It's obvious that it's it's a worthwhile the investment of spending money on the Optimization to lower the cost on the long term I think there is a lot of use cases where it worked so again first Overfit to the training data start with the small iteration Visualize so basically instrument the traces I instrumented in using agenda in this case Try to look at them try to look at the prompts that have been generated and understand how the algorithm is working before Increasing the sampling and in this case, I think we had around 200 300 iterations per experiment In addition to that there is actually a number of parameters like the batch size and so on that you need to Find you in to get the algorithm to work. So that's it Thanks a lot for watching. I hope that has been helpful and that you'll build good LLM as a judge that helps you to improve your applications I'd love if you check out agenda or open source LLM ops platform And you can follow me both on LinkedIn and X and Finally if you're thinking and working about auto optimization about how to Optimize prompts feel free to reach out or to write in the comments on YouTube Have a great day. Thank you
TL;DR
- LLM-based judges often fail to correlate with human annotations, leading to unreliable evaluations of AI agents in both development and production environments.
- To address this, LLM judges must be calibrated using human feedback, specifically through prompt optimization techniques like the GIPA algorithm.
- Effective calibration requires context-specific binary metrics, high-quality human annotations with detailed reasoning, and iterative refinement of the judge's prompts, often starting with a well-engineered seed prompt.
Takeaways
- Uncalibrated LLM judges provide misleading signals, hindering the speed of AI agent development and failing to accurately assess real-world performance.
- Calibrated LLM judges are essential for accelerating offline evaluation loops, enabling rapid online monitoring of agent performance, and establishing a "data flywheel" for continuous improvement.
- Design evaluation metrics tailored to the specific business use case, leveraging subject matter experts to identify concrete error types. Binary (yes/no) classifications are generally more effective than granular 1-5 scores for LLM judges to learn.
- High-quality human annotations, especially those including the reasoning behind a judgment, are crucial for the LLM judge to learn complex policies and criteria effectively during optimization.
- The GIPA algorithm optimizes LLM judge prompts by iteratively sampling new candidates (via mutation and merging), evaluating them against mini-batches, and selecting a diverse set of high-performing prompts using a Pareto frontier approach.
- The initial "seed" prompt for the LLM judge is critical; starting with an unbiased assumption (e.g., "assume compliant unless specified") can prevent early biases and improve learning trajectory.
- Customizing the "reflection template" used by GIPA to guide the LLM in analyzing failures and proposing improvements (e.g., by explicitly asking it to learn policy rules) significantly enhances the optimization process.
- Optimization is an iterative process requiring debugging and parameter tuning; begin with small iterations, analyze how candidates improve, and refine prompts or configurations based on early results, as the algorithm isn't "plug-and-play."
Vocabulary
LLM as a judge — An AI model, typically a large language model, used to evaluate the performance or outputs of another AI agent or system.
Calibration — The process of adjusting an LLM judge's behavior to ensure its evaluations align closely with human judgments or predefined criteria.
Prompt optimization — Techniques used to automatically or systematically refine and improve the text prompts given to an LLM to achieve better performance for a specific task.
GIPA algorithm — A prompt optimization algorithm that uses genetic-algorithm-like principles, including prompt mutation, merging, and Pareto frontier selection, to evolve effective prompts.
Agent — An AI system designed to perform tasks by interacting with an environment, often by using tools or following complex policies.
Hallucination — A phenomenon where an LLM generates information that is factually incorrect or inconsistent with its training data, presenting it as true.
Pareto frontier — In optimization, a set of solutions where no single objective can be improved without sacrificing at least one other objective; used in GIPA to select diverse and high-performing prompts.
Reflection template — A specific prompt structure used within GIPA or similar algorithms that guides an LLM to analyze its own performance or previous prompt failures and propose improvements.
Data flywheel — A concept in AI engineering where the output or observed behavior of an AI system generates new data that is then used to retrain and improve the system, creating a continuous feedback loop.
Policy adherence — The extent to which an AI agent follows predefined rules, guidelines, or business policies in its interactions and decisions.
Transcript
Hello everyone and welcome to my talk slash workshop judge the judge and today we're gonna talk about LLMS judge Quite sure you know this scenario you have an agent in production and someone from the team says We need to monitor the reliability so you go to one of the libraries and maybe use the hallucination LMS judge put it in production within your observability platform and things look fine but Customers are actually saying that the agent is not working and you look at the traces. It's not working You look now under the hoods about this hallucination LLMS judge and you'll find a prompt not very far from this one You'll be given an LLMS output right whether it's in a hallucination Magnum sticks Now obviously How the hell would the agent know whether it's a hallucination if it could then your output have worked from the day one? So today we're gonna talk about how can we build calibrated LLMS as a judge that works Calibrated mean calibrated with human annotation and the way we're gonna calibrate them is by using Optimization or prompt optimization specifically. We're gonna use GIPA Quite good algorithm for optimizing prompts Now why do we want this why do we want calibrated LMS charges or good LMS charges? First thing is for offline or offline evaluations as you know like usually to create Good agent or good prompt the way you do is you try to experiment with the prompt then run your evals see if it improves things or not if it does Good if it does not you go back and you improve it a little bit Prove the harness the prompt and do it again and again and The speed in which you move to production or add features actually the speed into which you can complete this loop and The bottom neck in this loop is actually the evaluation How fast can you evaluate? Obviously the slowest possible evaluation is having human annotator really look at your whole test set and annotated manually The quality is quite good by then each iteration will take a lot You can have faster ones by using an LLM as a judge But then if that LLM as a judge does not Correlate with a human annotation then you'll end up with usually signal and all the scoop will move fast. It won't go anywhere So having calibrated LLM as a judge with a similar quality Let's say as human annotator will make it development much faster The second thing is basically the online evaluation like our example from the beginning if you haven't online evaluation you want to see Basically in production if things are improving or not improving Same thing if you have LLM as a judge that are calibrated with your business Goals then you can quickly see whether changes that you've made are improving not improvement whether There is some change in the distribution of the data how people are interacting with with your agent or model and basically react rapidly And finally and I would call this is the holy grail of AI engineering is really to build this data flywheel Where you optimize your harness Observe some traces and then add new evils based on these traces the edge cases and do it again and again and again here if If you have a way to kind of add new evaluations quickly obviously automatic evaluations From the traces from kind of the annotations the data You can go through this loop faster and faster to the moment or to the point that you can think of it as an automatic loop Right because you can optimize the harnesses with optimization techniques like GIPA what we're gonna use here But the same thing you can do it for the evils and Basically over time your application will improve just with new observations So today we're gonna build and optimize these LLM as judges that are calibrated with human annotations But before going there Small entry about myself my name is Mahmoud the co-founder and CEO of agenda Agenda is an open source LLM ops platform Basically providing you all the tools from observability prompt management evaluation Covering the whole life cycle of building reliable agents My experience is in machine learning I have more than 15 years experience and that And the previous life I was in academia Worked on machine learning applied to computational biology protein structure prediction and right now We're working a lot on these sampling and auto optimization workflows So if you're interested in that please reach out with love to have a conversation and show you also what you're building So what's the plan for the stock? Basically we're gonna work on a practical use case. It's a customer support agent that we want to evaluate And we are going to build an LLM as a judge that is calibrated with the annotations basically human annotations For that customer support the plan would be ready to go over the whole process of building this Starting with how we design the metrics how to think about the data curation the labeling but the main focus would be really about The part about optimizing the LLM as a judge using GEPA and obviously then validating the results all the code and the data used in this We'll be found in GitHub and you can find them in the links and this video and the last slide So let's start with the data sets. We're gonna use Tao bench Tao bench is a benchmark in a large data sets built by Sierra customer support Scala I think and they have like multiple benchmark for real-world scenarios for customer support agents One of them is the airline agent which we are going to use in this example And basically what they have is an airline customer support agent that has access to multiple tools to manage reservations Access flight information access user information and has a quite complex policy to be held to like when to change your reservation When to provide information and so on so forth just like a customer support agent the human one and the data that we have is Is the agent itself but most importantly? 599 conversation traces that are generated with annotations Now the format the original format of the annotations is like in the format of assertion But I pre-processed or bypassed process the data so that we have for each trace and a notation like a human annotation Where it says for example like in this case you have the conversation that you see here and then you have an annotation that the agent is not compliant because it improved approved the cancellation without verifying that the reservation Met the airline cancellation rules so basically the evaluation failed and the reason is because the agent cancelled the reservation without verifying something so basically it did not be halt to the policy and the data is more or less kind of Not very good. It's 62% compliant 38% Compliant and it's generated with multiple models and trials Overall the data is is kind of the problem is quite complex because the policy is quite complex The data has caveats honestly. It's not very clean Due to the reasons how it's also generated but for our use case. I think it's very interesting use case to test Gipa and to kind of demo how it would work in a Test case the workflow that we have is four steps First thing is designing the metrics that is deciding what will the LLM as a judge measure What are the different access that it would look to? The second thing is annotating the data And then we will optimize the judge and validator results now the most important thing to take here is that the metrics Need to come from the use case itself. It does not make sense to have General metrics like hallucination When you're evaluating your AI agent it really depends on kind of the business use case and the best person the best people to To determine these metrics are the subject matter experts for example in the case of a customer support agent You need to have subject matter expert look at the conversation provide the feedback and I think the workflow to do that the best one that described it is Hamil From Hamil Dev I'm gonna share his blog and The YouTube video and really describe this idea of error analysis very well But but I'm gonna go over it very quickly and also the annotation workflow very quickly So the idea is that you you provide your subject matter expert with all these traces of the trajectories of the conversation and they would annotate them first by Comantic what did work or what did not work? but then kind of slowly trying to cluster create clusters of the error types like when it's failing why is it failing and Here I'm showing in agenda how it's done and basically here I have like these I discovered kind of these four error types while going through these Traces so there is sometimes issues with policy adherence Sometimes issue with the response style Information delivery basically the agent is not informing the customer that they made the change or something like that and Finally some tools has not been called called correctly and the idea is that we're gonna take these for error types and then we're gonna build for LLM as judges for these so it does not make sense to have one LLM as a judge which is success Basically try to evaluate all of these it will make it too complex and it's very hard to learn and you will see a little bit later That even with a simple like when we're gonna simplify it. It's still hard to learn Calibrated LLM as a judge or optimize a Calibrated LLM as a judge So it makes a lot of sense to make things very specific in These metrics that we want to evaluate and the second thing is to move away from One to five scores of like percentages and instead have like really binary a Solution like whether it adhered to policy or not With obviously some reasoning and the reason it's again It's already quite hard to calibrate an LLM as a judge with a true false like binary classification Adding another layer and saying okay, it should be a number between one and five. It's it's hard It's even hard for human annotator like to have two human and an annotator agree on the same score So the moment that we have defined these different metrics start the point of annotation Here again, I'm using agenda basically You would take your traces create an annotation queue and kind of specify for your annotator like the name of the The feedback or the evaluator policy adherence here and then providing like they should provide each one with whether it's a dear to the policy whether it does not and provide the reasoning and the reasoning here is very important because without the reasoning we will see the optimization algorithm will need to By discover itself like why it failed and It's gonna be very hard like in unless It's a very specific kind of feedback for example tool Failure where it can infer things. It's very hard to see for example from a conversation Why it did not adhere to policy without someone providing information in the beginning about it So having that reasoning is very important later for the LLM as a judge to learn and again This reasoning as I've shown previously with the annotation It it kind of describe why for example the agent here is not compliant because It approved the cancellation without before verifying That the reservation met the Conservatory rules so now that we have so now that we have the annotations we can go to the optimization But before going there I want to add the small note Although we went very quickly through the first step and the second step These are actually the hardest part of the problem like in reality as every data scientist know Getting your data is the hardest thing and You need to make sure to look at the data look at the annotation Make sure that the distribution is is good That the annotation and the information within the data is enough For the algorithm to learn a representation of the LLM as a trap that is meaningful in our case here The data is not that good The number of traces is a small the problem is quite complex They're not very well distributed because of the reason they have been generated Also the annotation is actually Kind of AI generated based on the assertion and the original data you can see a little bit more how it's done in In the repository So it makes things quite complex for this problem But it's still quite a good demonstration of how it would work So now that we have the annotated data we can start with the optimization and for the optimization We are going to use the gap algorithm some going to explain in the beginning how the gap algorithm works And then we can jump to the Jupyter Notebook and start optimizing Based on the annotated data It's very important to understand how the algorithm works because you will see that Practically you need to play around a little bit with the parameters and to get it to work and It's very hard obviously to play around with the parameters if you don't understand what do they do? the algorithm is Similar to genetic algorithm So basically the idea is that you start with the seed prompt and then at each iteration you try to sample new prompts See which one works and then I basically select the new ones and and kind of improve over time That's kind of the general shape of the algorithm and we're gonna go and look at each step So it's three steps basically you sample new candidates times evaluate them See which one good work well and then do some filtering using this kind of Pareto frontier I'm gonna talk about and then do it again and again So let's see how it works. So the way it works is first you start with a seed candidate Here in our case we're gonna use kind of a Very simple LLM as a judge like evaluate whether this customer service agent violated policy and start with assuming this the agent is compliant And now in each iteration we are going to sample new candidates from the filtered candidates from the last iteration Now here in the first one we have only one seed candidate, but in the next iteration we'll have a larger bag of candidates So GIPA has two strategies to sample new candidates One is prompt mutation and the other one is merging multiple candidates for prompt mutation Which is what we're gonna use in the beginning since we have only one candidate the idea is that you would run the first LLM as a judge here whether the trajectory and if it fails It's like this LLM as a judge will reflect and propose a new prompt basically There be some kind of reflection Which means basically we're using the intelligence of the LLM to try to improve the prompt because it looks at the input looks at the outputs Looks at the results and try to infer how to make it better The other strategy is the merge strategy, which we're gonna use in the next iteration and basically here It takes two prompts and then kind of put them together And if you think about it with an LLM as a judge usually you have like these guidelines and Basically probably what it's gonna try to do is to take guidelines from prompt A from B and put them together So now after we generated a lot of samples We need to select which one are good and basically the way it works is that it would evaluate these new prompts against mini batches of the eval so not everything And if they improve the performance Compared to the starting point Then we select them and They will be added to our bag of prompts and then starts the next iteration which is the other Innovation of this algorithm, which is the idea of the parity of frontier Basically the way we select which Prompt or which candidates we're gonna use as a seed for the new iteration is not that we select the ones that have the average best score like that would be the trivial Solution right look at my my prompt see which one Work most by looking at the average of everything and then select these Instead what they do is that they Try to add diversity by trying to look at What are the best candidate for task right you have a like a set of kind of tasks in your evaluation like in our case set of trajectories and you look for each of these trajectory which is the best candidate and that is kind of the Pareto frontier And then you try to select from these and basically what you do is You try to select a set at the end of the day that covers your whole task case So basically for each test case there is at least one candidate that solves it and obviously you see there that The idea is that you get like a good Pareto frontier and then you start merging things and at the end of the day You have this prompt that solves everything in your training now that you have like this Filters set of candidate again. We sample new candidates from these these ones using Kind of the mutation and the merge strategy and we keep doing this until basically the compute budget finishes Now there is a lot of libraries that implement this algorithm I think the most known is despite popularize the idea of optimizing prompts or harnesses but now there's I Think a new library by the authors of GEPA an open source one called GEPA and They haven't implemented the last month a new interface called optimize anything you API Which is what we're going to use and which can be used not only to optimize prompts but really to optimize Any almost any algorithm using this same idea. It's quite powerful Let me show you how it works. So basically The API here is called optimize anything you see this function and what it takes is a seed candidate The candidate is the configuration that you want to optimize in our case that would be the LLM as a judge prompt right we can make it even a dicked if we want so it's kind of for example the LLM as a judge prompt plus temperature let's say so or it could be a chain of prompt so on or so forth so it's not limited Then we have The evaluator which is basically the thing that would be used by GEPA to optimize and the expectation from the evaluator is that It would obviously around the system in our case it would run the LLM as a judge Parameterized by the candidate and then it would log But also the error the reasoning and in the idea you can add as much as you want and you see you'll see This is how we're going to do it with with our optimization for the LLM as a judge But the idea is that if you remember here we used some kind of reflection and and Reasoning to improve our prompt and that reasoning is something that we ourselves will build through this evaluate And then all of them that there's ways in the configuration for example to configure how many calls to do per iteration the objective Basically providing context for the refinement prompt on how to improve and so on and so forth but but the the corn Flow is actually quite simple So now let's jump to the Jupyter Notebook and really look into how to do it step by step So you'll find this Jupyter Notebook in the GitHub repository that you'll find in the links and in the last slide So we start by installing the library so it's dot and flight LLM and GEPA I'm not installing here GEPA because I did the optimization it takes kind of quite a time I think a couple of hours Before so I'm gonna jump this step But obviously if you want to run it within the repository within the Jupyter Notebook you should also install it So we install this and we kind of do our imports. We have kind of a couple of Functions that that I extracted outside we're gonna look at them later And we start by loading the data As I mentioned so we using the data from TaoBench I just kind of pre-processed it and the beginning to Change the type of the assertion so that they look like the annotation I showed in the presentation So I've already kind of split the data into a training and a validation Data sets so one which will use with the GEPA and the other one the second one to validator is at the end and The way that I did this split is based on different tasks that are created in In the TaoBench and if we look here is basically you have a training set with 480 traces with with around two-third that are compliant and Validation set with 112 traces that are compliant as I mentioned the data here is not not very very nice because There are some redundancies so there are sometimes the same task that is Being run with with the same model multiple times so there is a little bit of redundancies But there are no redundancies between the training and the validation set So we look here how the annotation Looks like and again Basically we have an example of a compliant annotation or non-compliant annotation basically a trajectory that That kind of aduse to the policy that's the LL and LL as a charge that you want to learn another one that does not and we can see here basically It describes like okay, it's compliant because It correctly identified the basic economy or reservation while here it did not identify the user membership as a regular and And this annotation is actually quite important for us for the LL as a charge to learn the policy especially in the case here This policy during it's kind of a very complex system The LL as a charge need to to learn and without kind of information about What is compliant like why something correct or not correct it would be quite impossible for the Gip algorithm to to reach Good and LL elements of charge right I mean it would be the same for a human right if you gave me all these trajectories told me this is Kind of correct. This is non-conform to policy, but you did not give me more information to be very hard for me to To basically make a judgment learn how to assess the policies So having this information and the quality of the annotation as I mentioned Quality of the data is really paramount of being able to learn this And again, obviously this is kind of bit of a complex LL as a judge to learn so First thing we start this is with a naive judge and This is the C judge we start with and and it's something that actually I engineer engineered I'm gonna talk a little bit about it later with the learnings on like how exactly did we reach this and Basically you can see here is like it evaluates whether the customer service agent violated policy and It tells that you should start by assuming that the agent is compliant and Only changed to non-compliant if there is a specific reason right? I mean here we are starting with an LL as a judge Which means that Like the C judge the initial judge should actually in my opinion start by by saying everything is alright I mean if I don't have any rules it should say that everything is alright I started and like in another example when I started as a created in LL and other jobs that says okay You should check whether the agents violated policy and what you end up being is basically the LLM With its own biases trying to make that decision by itself says okay This is violates this doesn't violate without having an information so without telling it in the beginning that You should start assuming that it's compliant I mean if you don't have any reason to to believe it's not compliant. It should be compliant Then Basically you start with with some kind of random LLM as a judge that would be very hard to fix later Unless you end up with one prompt that discovered this thing right at the start of the compliant So the I discovered you that the initial seed is actually very important in this case There might be simpler scenarios where you don't need this but in this case it's kind of Quite important so if we take this and we run this on the validation set like this initial LLM as a judge We basically find 61% accuracy but but we look like the bias It's actually most of the time saying it's compliant which is actually what we want right and if That's I would say saying it's compliant 98% of the time is actually the unbiased thing to do the logical place where to start So we run the metrics in the beginning right the accuracy 61% 98% is saying like the recall of compliance with very low to recall from uncompliance Um, and I think as I mentioned this is quite alright It's biased towards compliance and I've had experiments in the beginning where it was kind of almost random But then it doesn't learn at all and we can look into why or where it goes wrong and Basically by looking at the places where does work and we see like it says compliant But it's not compliant because it doesn't know the policy Right, so we start here like the main code of kind of optimizing the judge with GEPA and what you will see is that I have actually wrote Reflection template so that's prompt that GEPA uses to reflect and to prove To sample new candidates. I did not use their default but actually I wrote one I tried in the beginning by using the default prompt in in In GEPA, but the results were not as good as I expected it was very hard for it to learn and What I tried to do is is to provide Basically a bias and prior within the reflection Template So you see here for example. I mentioned obviously that it's basically the judges reading an airline customer service And it needs to kind of decide which is the basics, but then you can look at More information for example that our annotation that this Kind of reflection template sees Includes also the judge verdict the ground truth like the annotation like this is an important information that the reflection Template should look into and improve and Basically I explain to to to the LLM how to do this you can add rules or structure existing one reward things for clarity and try to think about it that the The reflection template should basically create a real policy rules, right? should find the right policies and abilities and I think adding that was very Important to kind of improve the quality Otherwise the default reflection template Did not understand that that should kind of try to Learn the policy more or less right with the LLM as a judge To some degree So that's the second thing kind of I change and I mean honestly it's only a little bit on it and then we run the optimization A run optimization is basically a wrapper around Optimize anything and it just Kind of parameterize it since I run like multiple experiments to find the one and sexually To be part of the GitHub repository and sexually something that you can play around with and start with when you're exploring the space of design and Gap and you can see here to basically it tries to build the configuration based on the parameters like the quarks and then calls optimize anything with these and for the kind of The make a valueator. It's basically it calls the LLM as a charge and all adds all the side information So it doesn't only provide the trajectory, but also kind of the annotation which is quite important And then we're on this as I mentioned it takes around an hour to run And you can see here the basically the results this is the optimized Rubik and you can see compared to the default where we started it's it learned Part of the policy criteria like the flight cancellation and refinance applied modification How to communicate and so on and so forth Now if we look at the results like with the kind of evaluate Rubik to evaluate it we see that the accuracy increased from 69% to 74% and We removed the bias right so what especially changes the recall for the non-compliant and the precision for the non-compliant which was basically zero in the beginning now the LLM as a judge is As less bias it's 64% so pretty to 98% and it's really learned parts of the policy So looking at the results as I mentioned like for the validation set we had quite a lot of improvement like 14% and For the training accuracy it improved by nine points and you can see that the parrato front here interestingly accuracy is now 100% meaning that for each task There is one candidate that we generated that solved it The issue the algorithm faced is how to merge all these candidates and all the information to have one prompt that solved everything And it's struggle to do this so at the end of the day we improve the LLM as a judge we improve the tech Currency But it's very still quite far from kind of 95% accuracy or something that is really well aligned with the human judgment obviously here I didn't invest extreme amount of time in it and as I mentioned at the beginning the quality of the data is also Would be better, I would guess In other cases, it's really a tricky example But but nevertheless it took actually quite a number of iteration I think that's the biggest learning to To reach this LLM as a judge it's not an algorithm that you just taken it works from from day one and let's far kind of toy examples and I wanted to show a little bit in the end like what are the experiments I tried in the beginning That failed and how did I think about fixing them? the first thing was actually Using smaller or older model like using GPD for all For both the refiner and the LLM as a judge and that was a complete failure like smaller models really Are very bad at least in this example to be either an LLM as a judge or a finer For the LLM as a judge providing all this policy, especially it has a lot of kind of complicated logic it just failed and could not improve it I tried other models I tried me and NANO and Gemini and DeepSeek and And you see that the best kind of results with this kind of using Gemini for flexion and Grok for a judge, but I would say also using GPD for Mini for both is actually quite good and the results quite well the other thing was actually trying how to try to debug it and And what I tried to do from the beginning is really to not Start sampling doing big experiments from the beginning but trying first Kind of a small iterations looking at the reasoning LLM looking at the candidate How do they improve how many improved in understanding what's happening and that actually what What allowed me to kind of think about improving the refined prompt And and basically adding some prior there to kind of helps of it basically what I did is I stopped at the first iteration Found some example and then looked a little bit kind of fine tuned in Claude Code that refinement prompt to To basically allow it to improve The candidate and as always in machine learning like what you always try to do is to overfit for the training data Not trying to run the old algorithm, but really trying to find a way so that it works And I think what we saw with the very different year each in 100% is we almost Overfit to the training data, but obviously for the merge there are like things I think we can do to improve And the final thing was kind of the situation on the seat prompt there actually iterated on multiple seat prompt and there were two families one which as you have seen here is Kind of was not did not include any information about the agent prompt because we have access to that Right and the agent prompt does have access to the policy and One which had access to the policy so basically it was this prompt that I've shown But then the policy of the agent like really copy-pasted and interestingly The prompt that did not have access to the policy did better because My hypothesis is that if you have access to the agent policy from the beginning Then it's very hard to find unit you're already stuck in the local minima that you can not improve on But if you don't have access to the policy and yet obviously you have access to the annotations that describe in this case all the policy Or a large part of the policy then you are able a little bit to To explore the space of the prompt much better And finally the last point is beware of the cost I mean Even these small experiments I've done. I think they cost like two three hundred dollars in tokens especially since the trajectories are long so there is a lot of input tokens in this case but The models that are used are actually quite expensive right GPT-4 I mean I tried a little bit to play around with GPT-4 but that ate a lot of money So I stopped the experiment but even GPT-4 mini It's quite expensive to some degree and then if you go nano At least from what I see in it doesn't work right also you go to kind of smaller model cheaper model It doesn't work usually what they say you should use kind of a bigger model for the finite prompt and smaller model for the LLM as a judge I think it makes sense especially if you're running LLM as a judge against a lot of traces like in the case of online evaluation It's obvious that it's it's a worthwhile the investment of spending money on the Optimization to lower the cost on the long term I think there is a lot of use cases where it worked so again first Overfit to the training data start with the small iteration Visualize so basically instrument the traces I instrumented in using agenda in this case Try to look at them try to look at the prompts that have been generated and understand how the algorithm is working before Increasing the sampling and in this case, I think we had around 200 300 iterations per experiment In addition to that there is actually a number of parameters like the batch size and so on that you need to Find you in to get the algorithm to work. So that's it Thanks a lot for watching. I hope that has been helpful and that you'll build good LLM as a judge that helps you to improve your applications I'd love if you check out agenda or open source LLM ops platform And you can follow me both on LinkedIn and X and Finally if you're thinking and working about auto optimization about how to Optimize prompts feel free to reach out or to write in the comments on YouTube Have a great day. Thank you