Why building eval platforms is hard — Phil Hetzel, Braintrust

Evaluating large language model (LLM) agents is crucial due to their inherent variability and increasing deployment in production environments, necessitating confidence to mitigate brand, compliance, and cost risks.
While initial evals can start simply with spreadsheets, building a robust eval platform quickly becomes a complex systems problem, especially when integrating production observability with offline experimentation.
The primary challenge lies in managing the unique data characteristics of LLM traces—which are large, semi-structured, high-velocity, and require diverse query patterns—making the data layer significantly more complex than the UI.

Start Simple, Then Mature: Begin eval efforts with basic tools like spreadsheets and for-loops to acknowledge the problem, then mature to bespoke UIs and databases for better persistence and collaboration.
Encourage Experimentation: Implement "playground" features for both technical and non-technical users, allowing them to tweak agent parameters (e.g., system instructions) and compare performance across different configurations.
Integrate Observability with Evals: Establish a "flywheel" loop where production trace data informs offline evals. Observe agent behavior in real usage, analyze failure modes, and use that insight to improve agents in a safe, offline environment.
Evals Become a Systems Problem: Recognize that scaling evals and observability for LLM agents is fundamentally a data layer challenge, not just a UI/UX one, due to the unique nature of LLM trace data.
Design for LLM Trace Data: The underlying data platform must handle very large, semi-structured, high-velocity LLM traces with diverse read patterns (low-latency lookup and complex analytical queries like full-text search).
Consider Headless Evals: Anticipate use cases where agents themselves perform evals on other agents (e.g., coding agents improving an agent's quality), requiring a data backend that supports programmatic access beyond a UI.
Automate Tracing and Governance: Look towards implementing automatic tracing through AI proxies or gateways to ensure all LLM interactions are captured for evals and central governance.

Evals — Short for "evaluations," the process of testing and assessing the performance and behavior of an LLM or AI agent. LLM — Short for "Large Language Model," a type of artificial intelligence trained on vast amounts of text data to understand and generate human-like text. Agent — An AI system that can autonomously perceive its environment, make decisions, and take actions to achieve specific goals, often powered by an LLM. Observability — The ability to understand the internal state of a system (like an AI agent in production) by examining its external outputs (logs, metrics, traces). Proof of Concept (POC) — A small-scale project or demonstration designed to verify a concept or idea's feasibility and potential before full development. Sandbox — An isolated testing environment where changes can be made and experiments run without affecting the production system. System Instructions — Specific directives or prompts given to an LLM agent to guide its behavior and responses, often defining its persona or task. Trace Data — Detailed records of the execution path and events within a system, crucial for debugging and understanding complex AI agent interactions. Spans — Individual operations or logical units of work within a distributed trace, representing a specific segment of time and activity. Topic Modeling — A statistical technique used to discover the abstract "topics" that occur in a collection of documents, often used to uncover patterns in user interactions with agents. AI Proxy/Gateway — An intermediary service that sits between users/applications and an LLM, often used to add functionality like tracing, logging, or security without modifying the core LLM.

All right, it's 11.15. We're going to go ahead and get started before we do. Everyone say e-vails. E-vails. I was telling my colleague Rose, who's at the door that I was an adjunct professor for a number of years. And the first year that I did, I thought I was going to have this full class of 130 people every single week eager to learn. And then as the weeks went on, 130 became 60 became 30 became 10. So I always tell myself that whenever I give a talk, that only about four or five people are going to show up, but I'm going to be really excited to teach those four or five. Today's a real blessing because we have a packed house here today. Everyone's excited to learn about e-vails and I'm excited to teach it. These are what we're going to be talking about today. I'll give you a little bit of intro about myself and the company that I work for, an overview of the problem statement. We'll go into the different stages of when people are building e-vail platforms. And after that, we'll talk about at least in my opinion where I think e-vail platforms are going to go. Yeah, this is me. My name is Phil Hetzel. I lead solutions engineering at brain trust. I'll go into what brain trust is in a second. Solution engineering basically means on the person and my team are the people that make sure that people are getting the most value out of our platform and as quickly as possible. So I'm fortunate because throughout all of our customers, I see what the state of the art is in both e-vails and agent observability. Prior to brain trust, I spent 12 years in consulting and systems implementation. I worked for KPMG for four years. I worked for a company called Slalom Consulting for eight years where I led the Global Databricks Business Unit. And I noticed that as I was helping my clients with those implementations, they were great. They were so good at generating these generative AI proofs of concepts and not only were getting to production. And I wanted to be helpful in making sure that those POCs could get to production. So I actually started using brain trust because I knew it helped out in this space. I started using it as a user. And I like the platform so much that I applied for a job and I've been here for about a year. And I'm outside of work. I like to play chess, but I'm very bad at it. And I like to spend time with my wife and my doxan is named Pistol Pete. And he's pictured. He's the person in brown. He's not the person in black. The person in black is me. Has anyone heard of brain trust before? Anyone? A couple of hands. How many people have heard about brain trust for the first time this week? Okay, great. Wonderful. Brain trust for just a reminder, I think of ourselves as an agent quality platform. And there's a lot of things that can go into quality. The way that we can get to agent quality to main pillars through eVows and through observability. Which we think of as really similar problems to solve. That's what you're doing with your agent before it gets to production as you're experimenting so that you can become confident in your agent. And observability is really similar, but you're already in production. Your agent is in front of real usage from real users. And you want to be confident, you know, remain confident, I should say, that your agent is performing the way that you thought that it would when you were building it. So that's brain trust. I was specifically told to not make this a sales pitch. So that's like really the last brain trust slide that you'll get today. Although, of course, I'm very happy to answer questions about our company this week. But mainly I wanted to talk more conceptually about how people start to mature and build these platforms. Spoken from a place where we have a lot of experience in the space. That's why evals are important. Evals are important because this sounds obvious, but LMs have extreme variability. We love LMs because they're highly variable. There are so many different types of problems that LMs can reason to solve. That's why we're so attracted to them as a technology. LMs are also, of course, agency LMs as the brain of the agent. Agents are becoming the norm in how customers are interacting with companies. People expect an agent to experience now. So if you combine both of those things together, you really need to be confident in how your agent is going to perform once it is in production. Without doing so, you're going to incur, or you can potentially incur, a great deal of risk from both a brand perspective, a compliance perspective, and even more of a cost and maintenance and systems perspective. So we want to avoid all of those things happening and make sure that our customers are having a great experience and that our agents are acting the way that we thought that they would act. Now many people are doing e-vows right now, but it's just on a Google sheet or some spreadsheet. There's no shame in that, my friend. Raise that hand high. That's great. I think that's great. Just making the step is really important. It's an acknowledgement of the problem space. A lot of folks will come to us and they'll say, well, I don't really understand brain trust because all I need to know is how to loop through my agent with a couple of different inputs and be able to display some handwritten notes and scores about that agent. So the things that I mentioned there, three things. Some way to execute your agent, some UI, sometimes it's as simple as a spreadsheet, to show those outputs and scores, and then also a way to gather input examples. What I mean by input example is the thing that can initiate a run of an agent, the thing that can invoke an agent, whatever information that's necessary for that. It'll be a really short presentation if this is all e-vows was. I would thank you for your time and I would walk out the room, but that's not what you're here for. There is a whole other part of the iceberg. It's way more complicated than that. There are a lot of things that you end up having to build when you're really serious about e-vows. We're not going to talk about every single one of these things today, but we will touch on many of them. And of course, if there's anything here that I don't cover that you're interested in, I'll leave some time for questions for that. I'll also see you like a lot of phones up. So I'm going to pause for iceberg pictures. A couple of things while that's happening. Why is this a complicated problem? We already talked a little bit about how the underlying technology is quite complex. LMS are not a superficial engine, but it's also a multi-persona problem building these agents. It's not just something that engineers do in isolation. It's something where engineers, whether they're product engineers or AI engineers, or both systems engineers to get the thing running. SMEs that have the domain knowledge, all of these people need to be involved. And then lastly, e-vows themselves become a systems problem. That'll be the last thing that we touch on today. So what are the different stages of building an e-vow platform? My friend over there that raises hand proudly about starting out in a spreadsheet. This is a great place to start. And the most important thing is that you just get started. So you've got a spreadsheet and you've got a for loop. You've got a bunch of input examples that you can iterate through and you have a way to execute your agent. So you can see every time you tweak your agent, how the outputs are different over time. While this is a great place to start, because there is no barrier to entry here. Everyone has some way to access some type of spreadsheet technology. The returns can be diminishing for a couple of reasons. This is more, I would call this documenting. It's not really experimenting. So while you have this spreadsheet of a bunch of input examples, maybe you keep track across each time you are tweaking your agent the different output that emitted. That can become cumbersome to manage over time, of course. It's really challenging to be able to compare directly experiments over time. You're probably not doing a lot of analytics across those experiments and the analytics that you are doing or performing. They're likely coming from some type of human score, which is really valuable, but challenging the scale in practice. If all of our team sport is talking about before, we want to make sure that we're bringing a ton of people into the fold, not just technical folks, but also non-technical folks. They can add a lot of value to your agent because of their unique domain expertise and proximity to users. They're probably not coming into the spreadsheet is my point. And it's slow. Each time you e-vail, you have to go through probably a little bit of a cumbersome process to recreate or append to the spreadsheet. Probably one of the most fun conversations that I have in my job is I'll have a very proud product engineer that gets on a call with me and they puff their chest out and they smirk at me and they say, why can't I just vibe good, grand trust? There's no problem. I think if you're just getting starting your journey, it's a really nice step to go to. So now instead of being in a spreadsheet land, you're making something a little bit more bespoke for other people to bring them into the fold. So now you've probably got a for loop. You've got a nicer UI now, so it's more approachable. And hopefully you've graduated into some database that isn't Excel or Google Sheets. You probably use Rola a new database in something like Neon or something. So now you have a better story around persistence of e-vails. And because of this, you bring more people into the fold. You are making UIs that are a little bit more bespoke for your specific users. The thing that's a problem here is that you're still not really iterating yet. You're still performing work that is a little bit more just reporting, just documentation, rather than encouraging a lot of iteration. So more of a reporting tool here. How many people have vibe go to their own UI? Yeah, make sense. Next step here, so you want to encourage a lot of experimentation, not just with technical users but with non-technical users. So showing this image that is more aligned to allowing experimentation for non-technical users. But of course, as you're building these platforms, you want to allow for more SDK-driven experience as well. That just doesn't make for a very nice image in a presentation. So experimentation to me means that you can give a user access to an agent, a configuration of an agent in a sandbox. And you allow them to tweak certain parameters within that agent. In my example here, I'm allowing a user in the UI to change the system instructions to an agent running outside of my e-vail platform and allowing them to compare two different configurations of that system prompt. And I'm running e-vails across those two different agent runs so that I can bubble up scores. You can see that in the image now. I can bubble up different scores to understand both technically and functionally how my agent is behaving. So you'll hear about a lot of platforms, how they play ground feature. You're going to want some type of play ground feature both for technical and non-technical users. This is where the rubber starts being the road because the best way to perform e-vails is to really think about the failure modes that your agent can fall into and build scoring functions around those failure modes. The best way to find those failure modes in the first place is to have access to production trace data, i.e. your agent in front of real users and real usage. So the next step here is a really important one. We want to make sure that we can connect what we at least internally we call the flywheel. Observability in e-vails to us is actually the same problem from a system's perspective. Funny story, we used to be three years ago when we started, we were only in e-vails platform. And then we noticed one of our customers was running this massive e-vail every hour of every day. We reached out to this person and they said, oh yeah, I'm just piping all of my production traffic into this database and I'm running an e-vail against it. So we're like, okay, we should probably just make that ability to trace and observe actual traffic and be an account for that use case without having to cram it into offline e-vails. So this is really important to make sure that we can observe things in production, understand the actual behavior of our agents. Also understand the real lift that the changes that we're making to our agents are we're having. So we're analyzing that data. We pull that back, those actual examples back into an offline environment and then we improve upon those using offline e-vails. This is a loop, so it's not just a process. You're going to be performing this loop hopefully for the lifetime of the agent that you're pushing to production. You should be iterating this loop as many times as possible. That's how you improve. So as a result of that, you've changed your scope a little. You've widened your scope a lot actually. You are now a tracing platform. You're now a logging platform in addition to being an offline e-vail's platform. Again the benefit of that is that you're starting to get far higher signal from how users are interacting with your agents and you can use those real interactions. So you can almost think about e-vails, almost like you're re-running production in a safe environment. You're now getting to that point with this example. You can also perform online e-vails so you can point scoring functions to your observability traffic and perform things like alerting all things that you could build in when you're at this phase of maturity for running e-vails. The bad here, if you build it, you have to manage it. So just because you've vipcoded a platform, guess what? You might get a promotion for it, but also that's going to be your job now is to manage and continue to grow your e-vail platform at the pace that the industry is moving, which can be an exciting challenge that's kind of the bet that our company is making and we're excited to solve that problem. The more important challenge though is that agent traces specifically, if you look on the screen, these are really nasty. They're not like normal application traces. They are really semi-structured. A lot of times they're unstructured. There's just a ton of text inherent to LLM problems that we're solving. They're just very large in addition to being complicated. If you're trying to cram one gigabyte trace into a Postgres row that can lead to a lot of performance problems and they're numerous. It's high velocity because there's so much usage happening in production, hopefully with the agent that you've pushed. This is how we used to solve this problem. Just as an example, if you're at this stage in maturity, if you've got traces coming in, you're going to need to account for two query patterns. One, if you're performing observability, you need it away for folks to instantly be able to see their traces. It's very important to people. You'll need a very low latency way to ingest data. You also need a second layer of persistence for the query pattern of, I want to be able to analyze and aggregate these data. We used to use an open source data warehouse for this. We used to stitch these two sources together through a domain-specific language that we created called BTQL that no one liked. Including us, we hated it. Then we would perform a third level of aggregation using duct DB in the browser. This worked for us for a bit. Then it didn't work when I would use one of our customer examples. A customer like Notion, as an example, just a ton of unstructured data that they're sending us. They want to be able to perform things like full text search across a trace. None of these technologies are really equipped to perform text style analytics, which is a challenge with the LM domain because they're just so much text. That leads us to this. Measuring Asian quality, performing E-Valus, performing observability, it's actually a systems problem. It's not just a UI-UX problem. We recognize that it's quite easy to vibe code the UI of E-Valus, but it's way, way, way more challenging to create that data layer of running a successful E-Valus and observability platform. Not just from a scale perspective, although that matters, mostly from a functional perspective of allowing people to do the things that they would expect to do like performing full text search across millions of traces in their platform of choice. I talked about this a little bit. The reason why this is such a novel problem to solve is across a lot of these dimensions, which I won't drain this slide, but the data comes in really fast. The data are just really large when they come in. Even though traditional spans in a trace, it's just like one part of a trace. A lot of span will be like a couple kilobytes. Here we've seen spans that are 10, 20 megabytes in size, so much context within those spans, highly unstructured. And then also there are a lot of different types of read patterns. So you might be performing aggregate types of read patterns, but also you want very low latency types of read patterns. So none of these problems are individually unique, but together they make for a very unique problem from a system's perspective. So what we've done is, and what you all would have to endeavor to do if you were building this yourselves is you really have to think about making the right data platform for traces, so that you can perform some of the more functional requirements that eventually come down the line. The example that I have here is that, let's say that you want to let a coding agent loose on your eVal's platform so that you can be able to be a bit more self-healing with grabbing data from your eVal's platform, using a coding agent to grab that into context and change your agent within a coding agent session. That's something that's going to be really challenging to do if you can run a lot of just pure SQL on the data back end of your eVal's platform. We've actually noticed a lot of these headless style use cases come up where people aren't interested in the UI at all. The only thing that they're interested in is how can I perform eVal's in a way where I can use a codex or I can use a Claude Code to help increase the quality of my agent for me. The last problem here that I'll talk about is the so what problem? We'll skip this for now for the sake of time. This is how brain trust does this. We have a blog about this if you're interested and that just got released. What comes next here for what you can expect to build into your eVal's platform is you want to be able to tell folks the unknown unknowns of your agent, i.e., don't make me look across a whole bunch of traces. Just tell me how people are using our agent. So you want to be able to uncover those unknown unknowns through topic modeling techniques so that you know where to spend your engineering time. You want to make sure that you are building your platform not just for humans but also for agents because that's one of the main media for how people are creating technology now. We didn't even talk about the non-functional requirements that go into building these platforms like role-based access control, data masking. That's also something that's super important that comes up when you want to operate at scale. And then lastly, a consideration for adding automatic tracing through some type of AI proxy or gateway so that people don't even have a choice but to trace their LMs. We can govern very centrally by adding tracing automatically to your eVal platform. So I appreciate the time. I've got like a minute and 20 seconds left for questions. I can probably take two of them if anyone has any questions. Yes. So how does brain trust specifically handle multi-modal outputs and inputs and traces? Yeah, we just very technically put them in some object storage, reference them, and then display them directly into the trace. So if you have an audio file or a video file, you can play it in the trace when someone's reviewing the trace itself. We don't want people to have to exit the platform for that. It could be, a question was, is prompt management in brain trust? It could be, or it doesn't have to be. Yeah. Okay, perfect. Thank you so much for your attention today.

Why building eval platforms is hard — Phil Hetzel, Braintrust

TL;DR

Takeaways

Vocabulary

Transcript