Skip to main content

Build a Prompt Learning Loop - SallyAnn DeLucia & Fuad Ali, Arize

TL;DR

  • Agents frequently fail not because models are weak, but due to weak instructions, environment, planning, or missing tools and context engineering.
  • Prompt learning, especially when leveraging detailed human explanations and evaluative reasoning, is a highly effective way to optimize agent performance.
  • It is crucial to not only optimize agent prompts but also to invest in "e-val engineering" to ensure the reliability and quality of your evaluation prompts.

Takeaways

  • Identify common agent failure points: These often include weak system instructions, static or absent planning, insufficient tools/tool guides, and inadequate context engineering.
  • Prioritize prompt optimization as a "lowest-lift" approach: Significant performance gains (e.g., 18% in coding agents) can be achieved by refining system prompts with clear rules and instructions, often without changing models, tools, or architecture.
  • Incorporate detailed feedback for prompt learning: Move beyond simple score-based feedback by using human explanations and LLM-as-a-judge reasoning that specifies why an agent failed and how to improve.
  • Embrace "expertise" over "overfitting" for agents: Unlike general models, agents benefit from specializing and becoming highly proficient on their specific domain or codebase, which should be viewed as expertise rather than a flaw.
  • Implement co-evolving optimization loops: Continuously improve agent prompts by collecting and annotating failures, and simultaneously refine evaluation prompts by assessing evaluator confidence and accuracy.
  • Design evaluation prompts (e-vals) with the same rigor as agent prompts: Ensure your e-vals are reliable and high-quality, as their accuracy directly impacts the effectiveness of your prompt optimization.
  • Define success criteria with stakeholders: When creating e-vals for less quantifiable tasks, involve domain experts and security to define what success looks like, then convert those criteria into concrete evaluations.

Vocabulary

Agent — An autonomous entity, typically powered by an LLM, designed to perform tasks by interacting with an environment or tools. Prompt Learning — A method for optimizing an agent's performance by refining its prompts using detailed feedback, often including human explanations and evaluative reasoning. System Prompt — The initial set of instructions or context given to an LLM to define its role, behavior, and constraints for a specific task. Context Engineering — The practice of designing and providing relevant background information or data to an LLM to improve its understanding and performance on a task. Reinforcement Learning (RL) — A machine learning paradigm where an agent learns to make decisions by taking actions in an environment to maximize a cumulative reward. Autonomous Prompting — An approach where an LLM is asked to improve its own prompt, often based on numerical scores or outcomes, but typically without detailed textual explanations. E-vals — Short for "evaluations" or "evaluators"; mechanisms, often LLM-based, used to assess the quality, correctness, or adherence of an agent's output to specific criteria. LLM as a Judge — The technique of using a large language model itself to evaluate the output of another LLM or agent, often providing explanations for its judgments in addition to a score. Static Planning — A predefined and unchanging sequence of actions or steps for an agent, which lacks adaptability to dynamic or unexpected situations. Prompt Optimization — The process of systematically improving the effectiveness of prompts to achieve better performance from an LLM or agent.

Transcript

I'm going to get started here. Thanks so much for joining us today. I'm Celia. I'm the Director of Public Advise. I'm going to be walking you through some of our product learning. We're actually going to be building a document optimization move for the product that we're going to have. I can go back around and start off in data science before I made my way over to the product. I do like to still be touching code today. I think one of my favorite projects that I work on is building our own agent into our platform. So I'm very familiar with all of the pain points and how important it is to optimize your problem. So I'm going to still have a little time slides until I just set the C nature. Everybody here has contacts on what we're going to be doing and we'll jump into the code. I applaud you. So I'll let you do a little bit of an intro. Yeah. Thank you so much, Celia. Great to meet all of you. Excited to be walking through prompt learning with you all. I know you got a chance to see a partner stop yesterday, but hopefully that gave you something good background on how powerful an prompting in prompt learning can be. So my name is Boa. I'm a product manager here at Arise as well. Like Celia said, we like to say in code, we do a few slides and we'll walk through the code and we'll be floating around and I'll be going to see about things like that. Hi, back to this also. Technical, so I was a back industry with systems implemented for the long time. So no stranger to how important and observability infrastructure really is. And I think it's an appropriate setting in AWS for that. So yeah, excited to dive deep into prompt learning with you all. Thank you. So all right, so we're in constant. Just to be able to move the things I'm going to be covering. So we're talking about why agents fail today. What is even from learning? I want to go through a case study and I've shown you all why this actually works. We'll talk about how long you've been skipping. I think everybody I had a few people come up to me over the conference about like, what about Kappa? We have some better smart opinions that and then we'll hop into our workshop. But with the form master's question, how many people are building agents today? Okay, that's a big expectation. And how many people actually feel like agents are building are reliable. Yeah, that's what I also thought. So let's talk a little bit about why agents fail today. So why do they fail? Well, there's a big set we're seeing with a lot of our folks and we're seeing even internally as we build with Alex for why agents are making great things. So I think that all the time it's not because the models are weak. It's a lot of times the environment and the instructions are weak. So having no instructions from their learned environment, no planning or very static planning. I feel like a lot of agents right now don't have planning. We do have some good examples of planning. Like we have a clock code cursor. Those are really good examples, but I'm not seeing it make its way into every agent that I come up with. Missing tools, big ones. Sometimes you just don't have the tools set to me and missing kind of tool guides. I'm like, which of the tools we should be picking and then context engineering continues to be a big struggle for folks. If I was to fill this out, I think it's like these three core issues. So adaptability and self learning. So no system instructions learn from the environment. So I'm determined as a versus non-determinism balance. So having the planning or no planning versus doing like a very static planning, you want to kind of have some flexibility there. And then context engineering. I think it's a term that just kind of emerged in the last like six to eight months, but it's really, really important that we're finding missing tools, tool guides, just not having context or your data and not giving me all enough context. So these are kind of the core issues to still, but I think there's one other pretty important thing. And that is kind of this distribution of who's responsible for what. So there's these technical users, your engineers, your data science has developers and they're really responsible for the code automation, pipelines, actually managing number of elements and costs. But they have our domain, subject matter experts, a product manager. Is these some of the ones that actually knew what the user is going to be? They probably are super familiar with the principles that we're actually trying to build into our applications, their track and our e-vails, and they're really trying to ensure that the problem is success. So there's a split between responsibilities, but everybody is contributing, but then there's this difference. And sometimes it's like maybe technical abilities. And so what the problem is, it's going to be a combination of all these things where everybody's going to really need to be involved and we can talk about that a little bit more. So what even is proper? On the first kind of go through some of the approaches that we kind of borrowed from making up with complinings, this is something that has been really, really dedicated to doing some research. And so one of the first things we borrowed from, which is reinforcement learning. How many folks here are familiar with how reinforcement learning works? All right, cool. So if I were to really silly kind of analogy, I'd be able to reinforce that model. We're talking about a student brain that we're trying to kind of boost up. And so they're going to take an action which might be something like, it's going to take a test and exam. And there's going to be a score. Teachers going to come through and actually score the exam year. That's going to be a solution that they're like a scalar reward. And you know, pretend the student has an algorithm in their brain that can just take those scores and update the weights in their brain. And kind of like the learning behavior there and then we never be processed. So you know, in this kind of reinforcement line, we're updating weights based all those on scalars. But it's really actually difficult to update the weights especially in like the LLN world. So reinforcement learning isn't going to quite work that well. Well, we're doing things like prompting. So there's the amount of prompting, which is very close to we do with prompt learning, but still not quite right. So here with the amount of prompting, we're asking LLN to improve the prompt. So again, we use that kind of like student example. We have an agent which is our student and it's going to produce some kind of output. Like that's I think you're asking a question getting out, but that's our test from the six sample. And then we're going to score. Eval is pretty much what you can think of there. We're going to output score. And from there, we have like the amount of prompting. So now the teacher is kind of like the amount of prompt that's going to take the results from our score and update the prompt space off of that. But still not quite what we want to do. And that's where we kind of introduce this idea of prompt learning. So prompt learning is going to take the big data, going to produce an output. We're going to have our own web e-vails on there, but there's also really important keys which is the English sheetback. So which answer for wrong? Why were the answers wrong? Where the student needs to actually study really pinpointing those issues. And we still aren't using that problem. We still aren't asking an LLN to improve the prompt. It's just the information that we are giving that LLN is quite different. And so we're going to update the prompt there with all of this kind of feedback. So from our e-vails from such as another start going in in labeling and use that to kind of boost our prompt with better instructions and sometimes examples. So this is kind of like the digital prompt optimization where it's like, we're going to be treating like an NL where we have our data and we have the prompt where saying optimize this prompt and maximize our like prediction of variables. But that doesn't quite work for LLNs. We're missing a lot of context. So what we've really found is that the human instructions of why failed to imagine you have your application data, your traces, that data set, whatever it is, your subternaut expert goes in. And they're not only in a treating correct or incorrect. They're saying this is why is wrong. It failed to adhere to this key instruction. It didn't adhere to the context. It's missing out of whatever it is. And then you can also have your e-box explanations from LLN as a judge, which is same kind of principle where instead of just the label, it provides the reasoning behind the label. And then we're pointing in half the exact instructions to change. We're changing the system prompt to help it improve so that we then get prediction labels but we also get those e-bals and explanations on the topic. So we're just kind of optimizing more than just our output here. And I think the really key learning that we've had is that explanations in human instructions or through your own as a judge, that text is really, really valuable. I think that's what we see not being utilized in a lot of other problem optimization approaches. There you are kind of optimizing for a score where they're just paying attention to the output. But you can't think that this way it's like these elements are operating in the text domain. So we have all this rich text that tells us exactly what it needs to do to improve. Why wouldn't we use that to actually improve our output? So that's kind of the basics of prompting. But everybody becomes up to you and it sounds great to time. Does it actually work in dust? And we have some examples of what we do. So we did a little bit of a case study. I think coding agents, everybody is pretty much using them. At this point, there's a quite of you that have been really successful. I think Caut code is a great example cursor. But there's also a client, which is more of an open version of this. And so we decided to take a look and compare it to see if we could do anything to improve. So these are kind of the baseline of where we started here. You can see the difference between the different models. Caut code is obviously using two and throughout the state of the art there. But we also had this opportunity where Caut code is using 4 or 5. And it was working, seemingly all at 30% versus 40. And there was kind of the conversation around. So this is where we started. And we took the last step optimizing. This is some prompt errors. You can see this is what the old woman's looking like. It has no rule section. So it was just very like you are a hot agent. You're built on this model. You're here to do coding. But there was no rules. And so we took a pass at updating the system. So there were all of these different rules. So it's when dealing with errors or exceptions, handle them in a specific way. Make sure that the changes aligned with the systems design and changes to be accompanied by a project. So really just kind of building in the rules that a good engineer would have, which was completely missing before. And so we found that kind of the rules with our updated system prompt. Pretty kind of simple. It's kind of the whole concept here. So you can see these different problems. And we're seeing things that are incorrect now being correctly done just by simply adding more instructions. So really demonstrates pretty well here how those systems problems can improve. And we've been smart again with the SAB-Benech light. So again, there's kind of coding by Spark for these coding agents. And we were able to improve by 18% just through the addition of rules. So I think that that's pretty powerful. So no budgeting, no tool changes, no architecture changes. I think those are the big things folks like reach for when they're trying to improve their agents. But sometimes it's just about your system prompt. And just adding rules, I think we've really seen that. And that's why we're really passionate about from learning prompt optimization in general. Is it still the lowest-lift way to get massive and improve my gains in your age? For more energy, performance, years, solid form and five, which is pretty much considered right now, the state of the art when it comes to coding questions. And it's two thirds of the cost, which is always really better. So these are some of the people's here who will definitely distribute this so you can kind of take a closer look. But I think the main point I want y'all to come away with is the fact that like 15% is pretty powerful if you're meant in performance. Now, what I'm going to be doing all the time is we're taking these examples for prompt learning. So how does this really go? We're going to take a data set. A lot of time that data set is going to be a set, for example, set didn't perform well. There are human-wondering label them and found that they weren't incorrect or you have e-bals that are labeling them incorrect. And so you've got all these examples and that's the word we're going to use to optimize the prompt. So I get a question all the time. It's like, well, are you going to overfit based off of these bad examples? But there's this role of generalization where metting properly enforces high level or usable coding centers rather than repost specific fixes. And we are doing this training test split to ensure that the rules are going to analyze beyond just like these local quirks and whatever our training data set is. But if you kind of think of this as like you hire an engineer right to be an engineer or company, you do kind of want them to overfit to the database that they're working on. So we kind of feel that overfitting is maybe a better term for its expertise. We are not kind of training in the data now where we are trying to build expertise. And as we'll talk about this is not something we feel that you do want. You're actually going to kind of continuously be running this. So more homes are going to come up. We're going to kind of optimize our prompt for what the application is seeing now. And we'll kind of... So we don't actually think it's a flaw. We feel like it's expertise instead. We can kind of adapt as needed and kind of mirroring what humans would do if they were taking on a task themselves. This is just another set of benchmarking again kind of proving here that this diverse violations we've got looks on the task for those difficult tasks that are difficult for letters, language models. And we're seeing again success with our improvements. Now we just kind of came up recently and I think that's something everybody's really excited about. I think the previous DS5 optimizers were a little bit more focused on optimizing a metric and as we talked about, we really want to be using the task modality that these applications are working in that has a lot of the reasons or how we need to improve. And so we definitely wanted to do some benchmarking here. So how many of you are familiar with your data or are trying to run about it? All right, cool. Well, I'm just going to say a little high level. I just kind of know that the main difference between and whether or not you pro optimizers is that they are actually using this positive reflection in the evaluation while they are doing the optimization. So it's this evolutionary optimization where there's this parental based candidate selection and couple of sick merging of prompts. What this really does is we take candidate across. We evaluate them. Then there's this reflection on them that's reviewing the evaluations and then kind of making some new changes and kind of repeating until it feels like it has the right set of prompts. So I think something that is important to note spoke about is it doesn't really choose kind of just why does try to keep the top candidates and do the merging from there. But we've just marked it and probably actually does do a little bit of a better job and I think something that's really key is it does it in a lower number of groups. And I think something that we'll talk about in just a second here is that it does actually matter what your eat also click and how reliable those are. I think that's something we really feel strongly about at our eyes is you definitely want to be optimizing your age and problems. But I think a lot of people forget it all the time that you should also be optimizing your e-bounce prompts because if you're using e-bounce as a signal, you can't really rely on them if you don't feel confident in them. So it's just important to invest there making sure you kind of rely on the same principles that your age and prompts. As your e-bounce prompts, you have a really reliable signal that you can trust and then feed that into your promptization. But in both of these graphs, the pink line is prompt learning. We did also manage partner against need pro-ther older optimization technique that I was mentioning had a function off like optimizing around this for. And e-bounce really did make the difference. So kind of I highlighted on this slide here, like the bullet width e-bounce engineering, we were able to do this. So we did have to make sure that the e-bounce we were using as part of the prompt learning were really high quality because again, it's this only worse at the e-bounce itself is perfect. So e-bounce make all the difference. Kind of sometimes optimizing a prompt here. Again, it's all about making sure you have proper instructions, same kind of rules apply. So I wanted to kind of walk through it with a lot of content. I think it's really important to have a context. But before we jump into any of the workshops up, any questions that they'd answer about what I discussed so far? I have a question or a general comment. So I think coding is the greatest example in terms of having a structure in e-bounce. One thing I've sort of curious about is if you have other examples, sort of general prompts for additional interactions with systems that are not as easily quantifiable and as we curious about experience you guys have there. Yeah, is that for e-bounce or just like the prompt in general? Well, I think it's just clear how you would set up what the e-bounce would look like. I'm just wondering how you would do that for other types of things. Look, that's the question is, is there any kind of instruction in which is that your e-bounce coding seems like a very straightforward example? You kind of want to make sure the code's correct, right? But where some of these other agent tasks is a little bit harder, I think the advice that I usually give folks is we do have a set of like out of box. You can always start with things like QA correctness or focus on the tasks. But what I always suggest is like getting all the stakeholders kind of in the room. So getting those, you know, sort of better experts, the security, you know, leadership, and really defining what success would look like and then start kind of converting that to different evaluations. So I think an example, certainly Alex, I have some task level evaluations so like I really care, did it find the right data that it should have? The should it, did it create a filter using semantic search or structured like making the right tool call and then I care to do a couple things in the right order was the plan for like, so kind of thinking about like what each step was and then like even security will be like, well, we care how often people are trying to jail, break Alex. So it's just taking each of those success criteria converting into e-bounce and we do have different tools that can help you. But that's usually the framework I give folks is like start with just success and then worry about converting into e-bounce. Yeah, and just that, and I'll say that maybe like, more like a subject of the use cases like, for example, like booking.com is one of our clients. And so when they do like what is a good hosting for a property? Like what is a good picture? Defining that is really hard, right? Like to you, you might think something is a very attractive hosting for like a hotel or something, right? But to someone else, it might look really different. And sometimes as time sounds, I was looking to, it's sufficient to just gate it as a good bad and then kind of get away from there. So like, is it's a good picture of that picture let alone decide and then gate from there into specific back up like, oh, this was the new lid. The layout of the room is different, et cetera, et cetera. Yeah. Yeah, thanks. Thank you, actually. Bill, I was gonna ask which is that peanut butter binary outcome which doesn't necessarily give you a great advance upon. Are you then effectively using those questions like digitally lit not like get like more continuous space? Right. Exactly, right? And then from there, as you get more signal, you can refine your evaluator further and further and then use those dates. And you can actually put a lot of that in your prompting itself, right? So yeah. Yeah. Okay. I have two questions. I'm not sure if I should ask both of them or maybe your workshop will answer it. One is about rules and the rules section or like operating procedures. I'm curious how you do you just continuously refine that into English language and maybe reduce the friction of any contradictory rules? That's the first question. And then the other is I would love to see this slide on evils. And if you could just say a little bit more on how you approach that because my issue in doing this work is whether or not to have like a simulator of the product and then the simulator is evaluating or to do what I'd like to do, which is like an end evaluation that I build. But I would love to see you talk about that if you could. Yeah, absolutely. So learn the first one about like how the instructions is something I think that like you iterate over time on them. So a lot of times I think we take our best out of like we write them by hand, right? And I think what we're trying to do with proper optimization is like let us send data to dynamically change them. And as I think great at like removing redundant instructions things like that. But the goal is, is we want to move away from static instructions. We feel very confidently that like that is not going to really scale. It's not going to lead to like sustainable performance. The idea exactly with pump learning is something that you can kind of run over time. We see this even like a law running task eventually, where you're building up examples of incorrect things, maybe having a human annotate them. And then the task is kind of always running, producing optimized prompts that you can then pull the production in. And it kind of is like a cycle that repeats over time. Sorry, just to intervene. So are you saying that when you're doing this over a long period of time and then you have examples, you're just running the shots back into your rules section. Kind of. And then we get to the optimization actual loop we're going to build. The kind of data is like you are feeding the data in. That's going to build a new set of instructions that you would then push to produce. Good. And I think your set motion was around eBals and like how to start how to like write them and like how to optimize those is that? Yes. Yeah, so it's a very similar approach. I think it's like the data that you're reviewing is almost a little bit different. So I should have pulled up the loops. I don't know if you can find it. I just try something really quick to kind of show this. There we go. So this is kind of like how we see it is you have two co-evolving loops. I've been talking about the one in the left. The blue one a lot about we're improving agent, we're collecting failures. Kind of something that to do kind of fine tuning or prompt learning. But you basically want to do the same thing with your eBals, where we're collecting ideas and failures. But instead of thinking about the failures being that output of your agent, we're actually talking about the eBalot. So having somebody go through and evaluate the evaluators or using things like log props as competent scores or jewelry as a judge to determine where things are not confident, we're going to do the same thing. So figuring out where eBalot is low confidence and then you're collecting that annotating, maybe having somebody go through and say, OK, this is where the eBalot is wrong. And so it's the same pretty much process of optimizing your eBalot prompt. I think folks think they can describe something all the shell or write something once and then they can just forget about it. But this loop, I said it a few times, but the love loop only works as well as your eBalot so far. So I think my question is actually way more static and basic. Do you are you talking about this orange circle as like, are you building a system or simulator for the eBalot or are you just talking about like system prompt user prompt eBalot? Yeah, I think it's more right now what we're talking about is just like kind of the different props. You can definitely do simulations. I think that's a whole different workshop. Thank you. Any more? Maybe go ahead and go to the workshop. Any? Go. Nice little fact. All right. So here is going to be a short quote for our prof learned repo. So all of your units to get such, do we do it that? Get them in your octubse. I know it's a little bit clunky. The static you are doing like Air Java. I was not sure a better way. I can just show you also here. You want to find it. It is going to be an organized AI repo here. And your prompt learning. You just want to kind of pull that. We are going to be running it lowly here. You go back to the page to your own. Yes. Sorry about that. I know the page with the URL. Oh, I want to get folks just a few minutes to get. What's your process? When we're building a new agent or a new form, anything that could be done with, do you nice start by just like a track of the prototype and then see where it's bad and then do e-vows. Yeah, I think there's different perspectives on this. Our perspective is e-vows should never block you. You need to get started and you need to just build something really scrappy. We don't think you should waste time doing e-vows. I think it's helpful to pull something out of the box sometimes in those situations, just because it's hard to come through your data. And it's something you've experienced with Alex. If you're getting started, just running a test manually or viewing. Like it's kind of painful. So I think that having e-vows is helpful, but shouldn't be a blocker. Pull something up the shelf. Maybe start with that. Then as you're writing your other thing, where your issues are, then you're trying to refine your e-vows as you're refining your future. So it makes sense to optimize this sometimes. And we use it on a blocker sub agents for coming out. So how do you think it is? Like, from all I have to do. Yeah. So the question is, is like, you're just doing one single prompt or how do you think brothers and a multi agent? I think we're kind of thinking that this right now is kind of independent tasks that can optimize your prompts independently and your own tasks to get into the agent simulation, writing them all together. But right now our approach is a little bit isolated. But I definitely see a future where we're going to meet the standard of like some agents and everything else that's going on right now. No, I think that's pretty accurate. And also like, I mean, even in a single agent use case, versus like a multi agent use case, like ultimately like each of those agents may be specialized and they have their own prompts so they need to learn from. So I think doing this in isolation still has benefits for the multi agent system as a whole back in the pound over time, especially in scenarios like AdAlver, etc. And making something like really, really simple. So I think it's like what we're talking about is like the overfitting as well. Just again, like the question we get all the time, but really you want to be overfitting on your own code that he says an engineer. You don't want to be so generalized that you're no longer in it. I think it specifically works here. But yeah. Everybody can get into the read left mode, OK? Anybody help? All right. So we are going to be using OpenAI for this. I think the next thing that I'll have everyone do is probably just doesn't know, describe a very accurate guide. Keep all that's right. And then I'll just kind of start walking through our notebook here. So we are going to do the JSON webpage prompt examples. You're going to find that under notebooks here. So we'll give everybody a simple bit out there. So you're just going to slide into this slide. And just going to add to this example just to make it run the little faster and work a little better. The first is what this is even doing. This is going to be a very simple example for just a JSON web page prompt. If anybody has a prompt or a use case that they want to kind of code along, follow in there absolutely help, like, glad to help kind of, again, put your work on to the use case. This is something very simple, just to kind of demonstrate the principles. And we are going to be using forum. We can definitely experiment if you want to swap out any other providers that you want to use. We can also definitely help you do that. But the goal of this is essentially going to be to iterate through different versions of a prompt using a data set. And we will optimize. So the first thing is obviously we need to do some installs. I am just going to have you all update it. It says, like, greater than 2.00. But we're going to actually just use, I think, 2.2. And then the next thing is just make this run a little faster. We're going to run things in ASync, which is missing. So you can go ahead and add these lines in the cell as well. All right. Everything's kind of all alone. I never know. No one will move to that. Seems to have not. It's cool. Let's talk about the variation. So I haven't talked about it a little bit when I was going through the slides. We are going to be doing some blue. So the general idea is we start out with the data set with some feedback. And we'll look through the data set once we get it. But you're going to want to have either the human evaluation or the auditions, the free text labels. Or you're going to want to have some evaluation data. But the feedback is very important. That's what makes this kind of work. We're going to then pass that to Alan to do the optimization. And it's going to basically have eVals in the loop. So as it's optimizing, it's using that kind of data set to then run. And assess whether or not it should kind of keep optimizing. And then it also reminds you of data that you can kind of like use to gauge which of the prompts that it outputs in a production setting. So we're going to do some configuration. So I've kind of wrote out here kind of what each of these means. So we have the number of samples. So this is going to tell me rows of the sample data set. You can set to zero to use all data. You can use a positive number to learn it for master experimentation. So I think that sometimes folks use different approaches here. Sometimes you want to just move really quickly. So a little sample. Sometimes you want to be a little bit more representative. So you have it. I have it here set as 100. You feel free to adjust. And the next thing is train split. So I think folks are really pretty familiar with the concept here of like a train test split. But it's just how much of the data do we want to use into our training? Again, that's what we're using to actually optimize. Then how much of it do we want to use when we're testing? And we're running the e-vow on the new prompt. And there's a number of rules. Basically, the specific number of rules to use for evaluation. This just to come in to which prompts to use. And so this is like the as we're running these loops for outputting a bunch of different prompts. And this is just saying how many we should use for evaluation. And then keep on your number of optimization loops. So this sets how many optimization iterations for our experiment. And each loop just basically generates those outputs, evaluate some and refine the prompt. And so these are just central experiments where we do this flooding just not much the whole problem learning loop and how much data we want to use. So you can just run these as you are if you want to go to some field free. And then the next set is pretty set. We're just going to grab that, open the IKIF. You haven't already set that up. So again, pass it just to some like pop up. I'll show you here quick. It's going to pop up there. You can just paste in your API key. Post-operative research. Looking at the data a little bit. And just a library, rather than telling you issues, you can just do this a little bit. But I think the part is to see. So hopefully we get through this. I'm going to be running my IKIFs. It's cool. I'm doing good. But if you have a free one, you want to do me even. Is that the first? All right. I'll start by the day. So we provided the data with you with queries. You can see here that we're doing the 80, 20 split based off of kind of configuration we set above. We're just going to pull this train set here. And let's just. Yeah. I run the train because I need to use my tools. Yeah. Yeah. You're right. That's a mistake on my point. Yeah. It is the 50. Let's take a look at this. It looks like. No. Just so folks can kind of understand. So kind of starting here with some just basic input in. But. Printing. We'll have any of the feedback in these rows that are printed out here. But you can imagine you can have different correctness levels here. Explanations and a real knowledge. You can be whatever it is that you've. Some folks use multiple emails and are just one. Sometimes it's human feedback. Sometimes it's a combination. You really want to have you input and output that. Should my output of transit be the same as you. Not necessarily. Yeah. I didn't know if it was sort of not. Yeah. Yeah. And all the things are kind of with the. But we could look at it like, you know, if I did this, this should be the same for you. Maybe. Just a little bit. Yeah. That's what you're saying. Okay. Yeah. The question. It was probably input could be like a check history and not just. Great question. So I think it depends on like what is your trying to do. You're doing just a simple kind of system. Yeah. Yeah. So if you put just the original. The. The probability of you heating, you know, a failure in the middle of the. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. It's not easy to say what that looks like. It's not that. They're not trying to. In a way like that. This is sort of about the same US. It's not easy to hate us. It's fine and easy not. Okay. Well, I agree with you and. Oh, you have as you look at that. We go right here. Do you have any egt what? Is something called up and we will take that out. In a lecture, And how we can buy for us like Easter and versus we have some context also. Yeah, so not such the context it should only whatever the manipulative. The metrics and all the prompting. context is to be the state it's gonna be like. Based on the answer is you've changed my context. You're saying like with the input and I'm ready to call you context to pass that. You can absolutely include that in your data set. So that the publishing has with other or the publication but the prompularies. L alone can understand all of the data that's kind of like the whole. You just have that pass in a special column if you want most people start with just kind of input in the feedback. But you can absolutely have with other data you think is wrong. And it is waiting for the free running when we're doing the experiment of testing. You'll definitely always want to have the data that would be required. And that's one is very interesting. I've come to all some of the projects is putting some API calls. Whatever the project is doing it should be based on the after getting the output. And whatever the context is, my problem is less whatever the tools are like. I have them API call all the. Contact engineering and then last. Yeah, so again, this way we're just one prompt and not that kind of end end. But you definitely want to have everything that like slowing into the prompt that you're optimizing. So your system from takes in the user and for example, some data from externally. Yeah, you definitely want to provide all that data. Does that make sense? Because because you're saying that life, do you like trajectory? They're like two calls and what the agent's going to do. And you know what the tool hall was. Is it what you're trying to prop or do? Yeah, you want to just like we're going to try to replay off as one step of it. We don't want to do it completely in isolation. So if there's like data that flows into that prompt. That's how it says using us producing the output, right? So we want to be sure that we're including that we don't want to exclude anything. But if it's data that comes like at a different set. Probably not really. You don't want to do that that way. It's just like think about what's relevant for the the step that we're trying to optimize. So this is. All right. And then the question is. So. For initial system, I'm talking to you. This is something very, very basic. Definitely, I think you do a lot better than this, but I just kind of want to illustrate something that we're going to test and optimize. So we're just saying you're an expert in a geosub web page creation. Your task is input. And so all these inputs that we're seeing you're going to be what we're actually generating. I'll put this for you and trying to optimize. Now. I already kind of touched on this. Evaluators are extremely important to make all of this work right. So we're going to. And it's like two evaluators that use almost a judge to assess the quality of generate offense. So we're going to judge if you have any other like code based evaluations, whatever you need to do to evaluate. You can. Like swap those out. But we're going to do a value output. This is going to be comprehensive evaluator that assesses the JSON web page correctness. Again, it's the input query and the evaluation rules. It's going to provide an output label of correct or incorrect. So pretty simple binary. Again, you can use multi label. And then it's going to have the detailed estimations as well. And then we have a rule checker. This is a more specialized evaluated. It forms a granular rule by a rule analysis. And an examiner's if each rule was compliant. And then both of these are going to the right feedback goes into our optimization loop to iteratively improve the system prompt. Excellation rule by the vision guy. And we'll get to this the prompt learning optimizer and creating the more. So. So. So it works here. Let's see what the actual value output has. So we do have some rules that are in. In here. Sorry. There will be in repo. So we're going to open as a file. We have this element provider and we're using a penay here. And then we're going to do our costification evaluator. So we're just calling it. We have an evaluation template that we're reading. The file in here. Then we just have choices correct and correct. Now we're not being able to a source. And then we're going to be able to add or just sort of some of the numbers easier than just looking at a bunch of labels. It is optional. You want to map these if you have like a multi-class use case. You can set the scores. But these are just going to be our choices like the real set we want our own is it's too. And then all we're doing here is getting a result. See if I have it being some printing so you can kind of take a look. So this is going to be something that I'm you're seeing in the middle. So I'm just going to pause here. And then we're going to make a code changes so what you're seeing in probably your version. This is a good time for that. Just going to set up the evaluator. Make sense. And the key points. It's going to be the rails. It's going to be the output. And of course, our template. Yeah, you will want to grab your own open AI key. Here. This set. And we can help you if you want to use a different writer can help you swap this out. So I'm going to talk about anybody. I'm going to start walking through the. Open generation. So this is kind of you know, you can imagine this as your own agent logic or the part that you're kind of posting. This is just about a bunch of actually generates the Jason. I'll pass. We're using 401 here with Jason. For someone. So you're coming here for consistent outputs. It's taking a data set assistant prompt generates outputs for all rows for transfer results for value. And it's called each iteration to produce out. So this is like our experimentation function that we're running. So as we're passing a data producing new prompts, we need a way to test it and I understand how we are kind of moving the needle here. So that's all the system. Pretty straightforward function. Just call generate output. We have that output model. Again, we're using open AI if anybody wants to switch anything around happy. We are using response format because we are dealing with Jason here. So we know that when you just found I mean some of the newer models are decent at it. But I using response format is really helpful. And then we're also setting temperature to zero. And here it's just kind of where we're passing all the data in. So the data set because again, we want to run this on all of the testing data. This is some problem that will be input. So as we get to the optimization loop, we're going to be passing in a new prompt to this with the data set. And then evaluating. We have our output model that we've already passed and currency all that good stuff. And it's just returning all of the outputs there. Would you for the current generation models is this one's basically like in AI terms, H and would you like to say recommends any of the temperature to zero. Would you actually want to try to encourage from the creativity? So. And you can use this. Hold on. Your. You can definitely experiment that and kind of take it to the lens and how how the most consistency to use something like I feel like Jason on webpage. I feel like this is probably like temperatures here makes sense. But I definitely think not for every agent ever use this. Do you want to use zero? Any options to move in? All right. So we're going to talk about before that we are kind of using some score wrapping. This part is optional. You want to use the metrics that makes sense to you. We're not directly using this as like we are kind of like. Easy and to know when and how we optimize but it's not like we're, you know. Using this as our sole kind of indicated for the success. Here we are just going to have a very basic metrics. It's just you know, choose something like accuracy. I've one precision recall just some basic kind of classification metrics for us understand again because we are using binary mapping scores. We can do that. And so that's what you're seeing happen here. I'm having to binary and then just based on the score. We calculate the metrics. So very simple. Our function here. All right. The good stuff. The position of the opinion. Okay. So this is the core problemization algorithm. It's a three part process. So we want to generate and evaluate. So generate outputs using the current palm on the test. See the set and evaluate their correctness. We want to train and optimize if results are unsatisfactory, generate. I was on the training set evaluate them use the feedback to improve the prompt. And then iterate. So we kind of want to repeat until either the threshold is met or all the leads are repeated. Kind of completed. So if you remember about. And then we can kind of repeat. Based on that or the first. It's going to take our metrics to solve the iteration. So it turns the detailed results, including a trained test accuracy scores, the optimized prompts. And the wrong. So as I kind of mentioned at the beginning, as we're running these different loops on the experiments, we're going to be producing a lot of different roles. And so we're kind of getting that information back that you can use. And then these are the results. And so we're kind of getting that information back that you can use. And these are our key partners. I'll kind of go through them, you know, as we get to the code, but just to give you a heads up. This is the target accuracy score to stop optimizations. It could also be whatever other metric you'll see. We have a score. So you can kind of determine the number of loops of optimization iterations. We've set that score and the number of rules. Again, these are some configurations we've already set. So optimization loop, this is going to take in all those, you know, printers that I mentioned, then it just kind of kicks off saying hey, we're starting. It's going to do the initial valuation. So we understand how things are starting off. You can get a pass in data to skip this initial valuation. We're kind of running it at the start of year. But if you were writing a production setting, you might have a value. You might already have feedback. You can kind of adjust this for that. And then it's going to assess the threshold against kind of our initial valuation. Again, this could kind of be skipped over coming from a production setting. But one of the first orders off from scratch, so that we can get a real feel for this. And then it starts the loop. So we're generating output. It's sending that as the train output. So when I printed train, you kind of saw the outputs that kind of stick to head there. And also we'll set, you know, a graphics explanation, any rule violations. And then we'll actually use our prompt learning officer. So this comes with like the SDK, the problem in SDK that you can use with the rise. So we're sending in that prompt implementation of all the toys. And then the API. So under the hood as we talked about in size, taking that feedback, taking in your original prompt and trying to optimize to get better results. And then setting out prompt. And then can also add in a value, so again, those three kind of feedback columns. We're looking to get back as correctness explanation for that. If there are any rule violations. And then from there, we just kind of kept off the authorizer and not much of our train set. Those feedback homes again. And then you know, any context size, the limitations you want to add. Next step. So the authorizer is going to take the producer prompt. And then we want to evaluate. So we understand how we're doing with this code block. Do is doing here. So trying to get that new company into again with all the data, understanding our results. And then we do that with our test set. Pesful. And then we're getting back like our score. And then doing the checks. And then we repeated all again. We get there to get about our results. Or we hit the max number of moves. And then returning our results. So it's going to be having another one across here. Any questions on that. Just some results. And more help the questions here. So we do want to obviously see all these results. We don't have to speed up. Emory. We can never access again. So just saving them off. You can also see all the single experimentation. So you have all of that data. Berksian will kind of pull this. And you can work with the best. Is. But these are just very basic. I'll spend too much time just saving them to see the end of the day. And then we're going to. And then we're going to start getting. So this is the problem. The CSC format. And includes the contribution number, the number of rules. Test training accuracy source. All data that we're actually going into. Evaluate. Whether or not the same successful, and then we're going to start getting. Results here. So. This is a point of all the run to run. you are running it, you're going to start seeing the different loops. Kind of what's coming out as well. And yeah, we'll just kind of like work through it as it runs. It's probably going to take like 20 or 30 minutes for things to run. But I'm happy to take any questions and help everybody out as they run into issues. This one thing, can you scroll down to the bar, go that we need to change the data? Oh yeah. Yeah, you can. Yeah, it's going to be. It's going to be. Yeah. Yeah. Yeah. So when you're running this and we do. So this line here, when you're doing everything, so you do want to be. It equals 2.2. Because I think there's a little bit of a package issue. So just make sure that the data and the errors, but the ebound is probably why. If not, why don't you get a lot more credit for it? This is the reason why. It uses like a generic evaluation. Yes. And you kind of see the manager, if you go to the. We've kind of just taken that part out of this. We can definitely go through that. So if you vote here. And this line here, we're reading in. Under prompts here. You can find some. About your. Curious. And this is the reason why everyone hates on doctor. This is why we use doctor. Yeah, absolutely. I don't work a lot. The noble happens with all of us. So I would also recommend. As you're going to be. Yes, they think you if you haven't already. It helps to run a lot faster. Also for the purpose of the workshop. So just our loops to one. 36 minutes to run. So we recommend also doing that. Instead of having fun. Obviously, we would recommend doing that. And your actions. Optimize your prompt for now. It'll help you get through the workshop. All right, I just want to kind of call out the. The last little back here. Um, the last step. This. Before folks. I'm going to go here. Is just to shut the project. Achieve the best has accuracy. So I mentioned how our kind of like saving up all the results. Use. We just have a function that essentially gets the last. For the best version of that kind of showing you original. And then the best optimized version. Which you can then use. Pull and put into your. Code. I didn't want to just get one kind of call out. As it kind of salt today. So that's a good. Difficult to manage. And so I want to call for those of you who are kind of maybe looking for more. I'm like an enterprise solution to this. Interest. You do have a pump up position task. You can have your promise living in our prompt of. Data sets with all of your human annotations or e-bals. It can either create from traces or decide and just. Into a rice. And then from there all you really need to do is like give it a task. I choose what you want your training data set to be where the output lives. Where all your feedback columns are. And then you can just take it off and it will produce an optimized prompt in the hub for you. So if I'm over here, I think I have some. It will obviously just create a new version here that says it's optimized prompt with all the results. And we are building on this. You can add all your e-bals to it. Have that all run in the loop. This one to call out that if you're not interested, maybe maintaining code loops and having to build. And then you can just add a task and structure yourself. It is something that we do offer and arise. Hopefully, so folks, everything out. We're sitting here around here for a while. As we can help you kind of work through issues, but. Thanks so much for joining us. Hopefully you learn something useful.

Feedback / ReportSpotted an issue or have an improvement idea?