Harness Engineering: How to Build Software When Humans Steer, Agents Execute

Music Speaker is here to speak about harness engineering. How to build software when humans, steer and agents execute. Please join me in welcoming to the stage, Member of Technical Staff at OpenAI, Ryan LePopolo. Good morning London. I'm super excited to be here today. I'm Ryan LePopolo. And for the last nine months, I have had the privilege of building software exclusively with agents. I am a token billionaire. And I believe that in order for us to get into our AGI future, we want everybody to be token billionaires, to use the models to do the full job. And what that means is to lean into the idea that the models are capable of being a full software engineer. And I've lived that experience by banning my team from even touching their editors to have to work through the models in order to get the job done. And today I'm going to talk to you a little bit about what it means to lean into that. And operationalize the way you work, the code spaces you live in, and the processes on your teams in order to get the agents to do the full job. I believe I'm preaching to the choir here when I say that the way we build software has changed. In the last six months, we have seen coding agents take over the world and capability has continually advanced at a super fast pace to have these models and the harnesses within which they live, take more complex actions, do more complicated work with higher reliability over longer time horizons. And the place we've gotten to here is that implementation is no longer the scarce resource of what it means to do the job of software engineering. Code is free. We have an abundance of code to solve the problems that we come across in our day to day as we run our teams, build software, and solve user problems. Hiring the hands on the keyboards as part of our teams is only constrained by GPU capacity and token budgets. And each engineer today in this room has access to five, 50, or 5,000 engineers worth of capacity, 24, 7. Every day of the year, the only thing that needs to happen, our rules, is to figure out how to productively deploy these resources into our code and into our teams to make use of this new capacity. And in this world, skill sets are shifting more towards systems thinking, system design, and delegation in order to make use of this abundant capacity to produce code to solve problems. And there are three reasons that this happened, all of which happened in late 2025. For me, the magic moment was GPT 5.2, which when it came out was able to do the full job of a software engineer. The models at this point are good enough where there are isomorphic to you and I in terms of the ability to produce code at high quality that solve real user problems in real code bases. Code is free and I know this is maybe a scary thing to hear because code carries maintenance burden but it's free to produce free to refactor and it is not a thing to get hung up on anymore. We think of code as burden because it's a synchronous attention drain on the human engineers on our team. But the models are incredibly patient, they are infinitely parallel so the ability to produce, maintain, refactor and delete code is no longer a forcing function on figuring out how to allocate resources on your engineering teams. So sort of be AGI Pilled here is to believe that the models are capable of producing every line of code we could ever possibly need. Figuring out when to delete them, figuring out when to refactor them or make them more reliable and it's your role as software engineers to figure out how to unblock your team of agents and humans driving those agents from being able to drive them over long horizon work to do the full job. The idea here is that every one of you is a staff engineer you have as many team members as you can possibly drive concurrently and have tokens to support. And you need to look one day, one week, six months into the future to figure out what structures you need to put in place to productively harness this infinite capacity to produce code. The scarce resources in this world that we see today are three things human time, human and model attention and model context window. And in the world where human time and attention is scarce, the role is to think about where that time is going figure out ways to productively automate it and move that synchronous human time into higher leverage activities. In a world where human time is scarce and human time is required to produce code, we have a stack rank. Things are either p zeros or p tws, those p threes will never get done. However, in a world where code is free and infinitely abundant, all those p threes get kicked off immediately, maybe 4x in parallel, we pick one that solves the problem and in it goes. I've had the privilege of building a ton of agents internally at OpenAI to improve the productivity of my co workers. And when code is free, all these internal tools can have good localization and internationalization from day one. I can make tools that my colleagues in London, Dublin, Paris, Brussels, Zurich and Munich are able to experience in their native languages without really having to trade against any of my other team's capacity in order to make high quality tools. We should be working with the assumption that the best parts of software engineering that we all know, live and breathe are available in any product that we could ever build all the time. Humans no longer need to concern themselves with implementation. The important thing is not the code, but the prompt and the guardrails that got you there. This is why, leaving breadcrumbs, documentation, ADRs, persona oriented documentation around what a good job looks like, all the historical logs of tickets and code reviews, this is the process that got you and your teams to the code and products that you have today. And this is what needs to happen in order to get your agents there as well. Your job is to build systems, software and structures that enable your team to be successful. And to do that, we need to make them legible to those agents that are driving the implementation. That means structuring them in a way that's native to the agents, writing them in a way that is respecting of scarce context, which is this other scarce resource here. And figuring out ways to make the tokens that are required to do the job easy to predict. That means making things the same as much as possible so we can limit the amount of attention the model needs to activate in order to do the job. Large scale refactoring in this world is free. So making things the same is something that you are all able to do. There's never going to be a migration that hangs open for six months now that you can't get the last parts of the code base to do. Because you can just fire off 15 agents to drive that work to completion. This is what it means to have a migration, right? We can finish them now. Come on, that's good. That's good. Clap. There's sort of this like meta epistemological question here about what it means to do a good job. And doing a good job as a software engineer is hard. It requires us years of being in the industry to fully internalize what it means to write high quality, maintainable, reliable code that our teammates are able to build on top of that is going to accrue leverage to the code base. To do a single patch well probably requires 500 little decisions along the way around the under specified non functional requirements that go into producing good code. The agents, the models during their training have seen trillions of lines of code that make every possible choice of those non functional requirements that you could ever imagine. So it's our job to specify those non functional requirements to write them down in a way that the agents can see this is what it is to do a good acceptable job that's going to produce a merged patch. And if the agents aren't doing that, it's our job to figure out ways to refine and restrict their output such that the code they write is acceptable. You can just simply say do not produce slop. Don't accept slop. You won't get slop in your code base. But to do that requires taking short term velocity hits in order to back up or double click into a task to figure out what it is the agents are struggling with in your environment. Put the guard rails in place so they stop making those mistakes and then figure out ways to step back and spend your time on higher leverage activities once you solve some of the blockers in the short term. When I think about empowering my team in this way, everyone is an expert in what it is they bring. I have a diverse full stack team that is experts in front end architecture, back end scalability, being product minded and each one of those different personas fleshes out the skill set of my team by bringing a different understanding a different set of solves for those non functional requirements. Getting teammates to write those down actually means that every engineer driving agents gets the best of every single person on my team. I don't need to block on low signal code review in order to learn what it means to write a good QA plan. To have one engineer on my team document that in a durable way means every agent trajectory is going to get a good QA plan and we can do this once in a high leverage way that we're able to stack on top of. So how can we get the agents to do a good job what are some of the tools and techniques we have in order to essentially prompt eject our agents and continually remind them of what it means to make those specific choices that we expect around those non functional requirements. And there's a bunch of ways we can do this. We can write good agents dot md files. However, with auto compaction, which is a thing that has continued to improve. GPT 5.4 and codex is fantastic at auto compaction. I essentially never have to write slash new anymore. I've got some pictures on my Twitter of me strapping my laptop into the back of my car so I can continue to do running inference while I'm commuting to and from work. And in this world, you have to kind of build for that expectation that context will get paged out over time. We need to be continually refreshing context as the agent goes about doing a task. And the ways we can do that are by having reviewer agents look at the code along the way through the lens of what it means to be successful. Right. We have security and reliability review agents in our code base that are continually running as part of every push and CI that look at those documentations and the proposed patch and do simple things like say are there timeouts and retries on this bit of network code has the code that has been introduced have a secure interface that is impossible to misuse. I'm sure everyone here has been paged at some point for network code that failed in production causing an outage that could have been remediated by a retry in a timeout. And I know I'm guilty of putting that retry in timeout in merging the bug fix and otherwise ignoring that I am not a reliable reviewer or author of code with respect to this non-functional requirement. However, taking the time to write some docs, write a lint that is bespoke to my code base that is going to look at every time I call fetch to make sure that there's a retry in a timeout wrapped around it means I've durably solved this problem and I'm able to do it because I lean on this axiom that code is free, that the agents are able to do a good job that I can completely migrate the code base to solve this problem durably once and for all. And in order to kind of operate in this way we need to step back and look at the durable classes of failures that the agents and the humans in the code base are making time after time figure out why we're spending time on it devise a solution to systematically eliminate this class of misbehavior and then continue to observe, refine and make additional choices on those non-functional requirements. One really neat trick I use here is that you can write tests about the source code as well that are separate from lints right if we know that context is limited we can write a test that limits the fact that files are no longer than 350 lines. We're adapting our code base to the harness to the models to do a little bit of engineering to be context efficient and squeeze more juice out of the model capability that we have today. The other things we can think about are providing good error messages that give actual remediation steps to the model and to humans for how to proceed next. It's not enough to say we've got a lint failure because we're awaiting in a loop or that we haven't unknown at this deep part of the code base and why is the model writing a function called is record. What we need to do is provide a prompt via a lint or a test failure that says no no no you shouldn't have an unknown here at all because we parse don't validate at the edge and you certainly have a type here which was derived from ZOT load bearing infrastructure for our AI future. You can just prompt things everything I've talked about here today is a prompt you can do this without touching the model weights at all. Kind of funny digression here is it seems like each advancement we've had in the complexity of the way we write code to interact with these models comes from both increasing capability in the models and increasingly niche ways for injecting prompts into those models prompts I'm sure you're aware our prompts powers prompts rules files prompts skills prompts these lint error messages that I am talking about prompts review agents. That inject comments on to the PR that we require the agent to address before it is able to propose it for merge prompts. You're going to find lots of ways to insert prompts into your code and one way you can do that is by embedding agent SDKs into your tests they're going to review the code base for acceptability using prompts that get embedded into the code. And if I find myself spending a ton of time writing prompts we can actually shell out to the agent for that as well. I've pointed codex at all of the prompting cookbooks we have on the open AI developer guide and told us synthesize a skill out of them for how to write prompts which means when I find a need to write prompts in order to improve my agent performance locally in the code I use the skill to write prompts that I wrote with the agent looking at the prompts to write the prompts. All the leverage that you're encoding into your repository your team and the agents in this way stacks incredibly well to kind of pull back to this idea that a single product minded engineer on my team was able to give us a big lift. They know what it means to write a good QA plan to write a good QA plan though you have to document all the features that you have the critical user journeys and how users engage with your applications web apps APIs and services. Once you write those down on how to write a good QA plan with the expectation that all user facing work has a QA plan now a review agent is able to assert expectations around what it means to prove that you have effectively written the feature a QA plan indicates what media should be attached to the PR for the humans and agents to know that you've done a good job which has the consequence of me trusting the output more needing to shoulder surf the agent less. And removing myself from the loop even more to delegate more and more of the work to agents and all of this is just making sure the agents have the tools and tokens and context to do the full job to remove myself from the need as a synchronous driver. The models crave tokens we can operationalize our code base to give them tokens to drive them forward using sub agents and all these other techniques to refine the agent output. I'm excited to let you all know today in the way you all do that you can just go build things do not hesitate to remove yourselves from the loop by getting the agents to do the full job because they can thank you. Very excited to bring on our guest we've got Ryan the pop-alow today just gave the keynote very exciting speaker the man is full send hyper engineering at open AI so a little bit of background we did a latent space episode with him we shipped it the other day the story he wrote this great article called harness engineering and we're like wow this is pure gold we have him on the podcast he's a token billionaire spending over a billion output tokens a day. That's like over a thousand dollars so you know man is really living it we want to keep this exciting ask good questions ask interesting stuff ask things that people can learn from but you know let's welcome right on to the stage. Hi folks how's it going excited to be here one has been fantastic and excited to kind of walk through what it is that we do and how we work here I think you had to come on this. I just here so I got blinded by the QR code so we're good. So background we have about an hour scan this QR code you should get Slido Slido will let you ask questions if you see interesting stuff you can thumbs them up and we'll try to get through them unfortunately the first one I can't super do but let's just kick it off right and can you show us your actual working setup. I was here beach margarita linear right. I'll say watch what's the podcast we put out we go through some of the work but if you want to talk about it I guess without actually showing us what's your what's your workflow like what's your setup how do you how do you approach a task. Sure so the way me and my team work is to start with tickets right we have chunks of work that we want to do features we want to add to our apps reliability work that we want to do we give that ticket to an agent along with a couple of skills that enable it to manipulate our app. We want the entry point to the development process to be codex not on environment which we build around it so we kind of do things outside it right like codex is the entry point the same way you would be and we give it tools we give it instructions on how to cook so rather than like creating a shell that our app and codex gets spawned into we have a skill that teaches codex how to launch the app that teaches codex how to spin up that local observability stack to give it logging into the entry. We give it a skill that enables it to boot up Chrome DevTools and attach to the application with a local CLI that will connect via some Damon that we have so the whole way we have set up the repository and all of the local DevTools is for codex to invoke them first. That means we have kind of like a bunch of little mini harnesses within the code base that make it really easy for us to slot in additional guard rails. We have a big package of custom ESLint rules which get wired into every PNPM package in the workspace. We have another sort of local Dev harness that allows us to add sort of like higher level wholesome tests that assert the structure of the code itself rather than like either the syntax or the behavior of the code. Things like package privacy dependency edges between different layers of our stack, these sorts of things. Making sure that across multiple files, odd schemas are deduplicated that there's a single canonical implementation of like our async helpers, these sorts of things. Because the way we have seen the agents work is to sometimes optimize for local coherence of a package rather than using like our shared utilities and things like that. So having observed that behavior, we kind of have built a bunch of little pseudo-linter source code verification things that shake out some of that bad behavior. So the humans don't get distracted paying attention to that and reviews stuff like that. But the setup optimizes for the agent to do the job and for the humans to not have to keep track of the high churn in the code base. We kind of centralize our leverage around five to ten skills. We don't go super wide on skills, preferring to make the existing skills better because at least I find that the infrastructure within the repository, all the local developer tools change super frequently. And I don't really have the bandwidth to keep track of this. So we hide all that complexity beneath the skills that the human has to invoke and let the agent just kind of figure it out. One kind of neat thing here is when we moved from using Chrome DevTools protocol directly to having this like Damon thing, like I didn't know that had happened for like three weeks. It was like totally fine because Codex was able to do the thing with the documentation and things that we had in place. Part of this you can get more detail in your articles. So some background you wrote a great please called harness engineering. There's a whole section in there on how you thought about skills thousands of skills versus simplifying it to just quite a few. But okay, continuing on how do you stop yourself from over engineering harnesses and a little bit of a similar follow up is do often build small tools for yourself if ever do you build custom tools. Yeah, so I think this is kind of gesturing in the direction of the bitter lesson here, right, which is how do I make sure the work that I do isn't like completely obsolete it by an increase in model capability. And the way I have thought about that is doing sort of the bare minimum amount of context management to kind of pull in requirements for the agent to do an acceptable job over the course of its work. And context is a thing that I don't think will ever be obsolete it right like the models must be told like the requirements of the task which guard rails to pay attention to these sorts of things. So a good harness is really operational rise around giving the model text at the right time so we can look at the work it has done and the information around what a good job looks like. And you know fundamentally the models are trained to follow instructions all the harness should do is surface instructions to the model at the right time so we do want to minimize that too right you don't want to front load all those instructions because then you kind of like overwhelm the agent but all of these sort of requirements around what a good job do need to be paid attention to over the entire course of the PR right so figuring out ways to either defer or just in time surface those instructions is kind of like the way I'm going to do it. So the instructions is kind of what a good harness should do right if you know that you want your react components right to be decomposed so that they make good snapshot tests for individual more stateless pieces right you don't need to load that up front instead you should kind of let the agent cook and prototype and experiment with the UI you want to build and then at lint or test time say okay you've done the work in order to finish it you have to break the support so that your components are small and as soon as you do that you can do that. So you can do that for a small and as stateless as possible and have local dependencies on hooks instead of prop drilling or whatever it is that you want the code to look like and then the agent will say oh this is a new instruction for me let me take the patch as written modify it to make sure that it adheres to the instructions and then up it goes to get up and this sort of thing is not going to be a good example of a good harness so a lot of people are asking about the codex model the codex harness how does that compare to other harnesses so Claude Code open code how do you guys take these decisions into play you don't work directly on codex but if there's you can if there's stuff you can speak about about the codex harness what you guys see as you architect it out. So one thing that I think is super powerful is this notion that the labs are not just post training the models but post training the models in the context of the harness in which they are primarily deployed in right like the apply patch tool or like the specific quoting semantics of how to invoke the bash tool or like in the loop for the post training process for the harnesses from the labs which means like there is leverage to be had by depending on the sort of like first party harnesses directly at least this is what I believe and as such kind of being able to direct through them via things like the SDK or manipulating the codex app server directly means you kind of get to ride the wave of all that leverage in post training instead focus on the parts that you care about which is like what correct code looks like. I kind of have high confidence that things like Claude Code and codex will continue to get better that is the responsibility of like the teams working on these coding agents so in my role where I don't really want to focus on the coding harness at all is figuring out ways to plug into them in ways that kind of like steer the agent that means my job can sort of like move up to thinking about differences in model behavior between releases. So I think that is rather than deeply understanding the nuts and bolts of the harness instead I can think about what it means to drive the behavior that I want based on the observed behavior rather than like the inner mechanics of the thing. It's a perfect follow up to the next question which is do you have any recommendations for collaboration platform so when you're in the software development life cycle is there any platform that you use for agents, engineers, developers all to collaborate on working on anything. So it's not just about the types of tools that you use for the types of tools. Yeah so in this world it has largely been just mark down files in the repository and GitHub that have been the primary sort of hub and spoke sort of thing. So you're collaborating on a document like you open Google Docs, you write something, you ask for feedback, people comment, you apply suggestions, these sorts of things. This is kind of like a little clean room environment just for this work artifact that you're producing like a PR kind of has a similar purpose. So we kind of treat that as a big hub and spoke broadcast domain where all of the agents and humans collaborate together. And because we optimize for throughput we don't block on any sort of like contribution to that like folks can either review or not agents can either review or not the implementation agent can acknowledge defer or reject any feedback that it gets really allowing each participant in the production of diffs to kind of make their own judgments around what it means to deliver receive respond to feedback. And this has a nice property of like not putting the model in a box in a bunch of places we want them to use their good reasoning sort of thing. So being super prescriptive around like every bit of feedback must be addressed can kind of have this like catastrophic failure mode of your coding agent being bullied by older reviewers when really we want to bias toward code being accepted not perfect not drowning in minutia and these sorts of things. How should people get started with using coding agents people that have been using a lot of doing a lot of manually written code how do they start to transition what should they offload how do they kind of come over that barrier okay I'm still checking every PR copy pasting from code X how should like the average engineer start to use these tools. I think there's two ways to approach this problem one is to start using the coding agents to improve your confidence in the code itself as it is written today right I think we would all agree that like more tests is probably a good thing right to assert the our programs are well specified and behave correctly as our users interact with them is a good thing. And the agents are super good at looking at the existing code with some context around how it is meant to be used and writing tests that assert that behavior so kind of using this to improve your confidence in the quality of the code will also increase the agents ability to successfully navigate it which means you don't have to worry as much around doing super detailed review of the agent output. The other way to think about this is to look at how you are spending your time is it you know staring at your editor writing code is it waiting for tests to run is it waiting for human review feedback is CI slow and you're like waiting on that maybe you have a ton of flaky tests and using the agents to incrementally automate the parts where you are spending your time. Ultimately the high leverage parts of our jobs is to define the work that must be done prioritized and schedule that work and then effectively empower folks on our team to do that work and the more and more we can delegate and move into sort of this like sequencing and orchestration rule even if you just think about like managing your teams right the more parallel and the more like deeper individual executions of those delegations were able to do right if I put. Primitives in place that make it super easy to like spin up ways to respond to events like off the queue right like I don't really need to be in the weeds with every engineer making sure they like implement a consumer correctly right and these same sort of like building block style techniques apply really well to the agents and stack really well to. Fun one how do you work with agents in your car. So I have not used the new voice mode that launched in car play recently not ready for that but usually what I'll do is kick off a task right before I leave the office tether my laptop to my phone buckle it into the backseat and kind of let it cook in the 30 minutes it takes me to get home most of the time with the skills we invoke that tell the agent you know. Your operating on a task you go until the tests are green you know I don't have to reach back there and poke us continue on to the thing and I'm basically able to more fully saturate you know my day with token consumption. The dream here is that I actually have 50 agents running 24 seven and I don't have to interact with them at all and the way to do that is to define the work well figure out ways for it to automatically be scheduled and remove myself from having to click the button right every time I have to type continue to the agent is like a failure of the harness to provide enough context around what it means to continue to completion. Good good statement at the end of the time you have to interact with the agent is a failure okay so the following question kind of skills this out right as your or knowledge map scales what practical steps do you have to like enable progressive disclosure so as you have a larger and larger code basis you have more people how do you scale your agents to work better with this yeah so when I sort of initially started this project that I was working on blank repository. So I was working on a repository create electron app right you know the single package all this sort of stuff and eventually ended up with a mess right because there's no package privacy that allows me to enforce and variance around what APIs are public versus which ones are not. So we didn't have like concrete hooks in the file system to determine which domains were separate from the other ones so we ended up going like full 10,000 engineer organization heavy on the architecture 750 packages in the PNPM workspace isolated by business logic domain or layer of the stack individual small utl packages that encapsulate reusable functionality that we've linked on being used that we can encode leverage in. And I do think that like in this world even if you don't actually have microservices structuring your repositories in ways that you can actually scope like the directory subtree you are looking into be able to do most of the change helps and you know code in the file system is also text which means it's effectively prompts that you're giving to your coding agent so making the code as much the same as possible. Kind of makes it so that regardless of where in the repository your agent is looking it develops a ton of transferable context right like you should have one way to like do a bounded concurrency helper you should have one way to construct a observable and instrumented side effect full command you should have one or right like you have one programming language you have one way of writing C I script you should have one way of adding additional link rules these sorts of things because it means that like. So you can look into the tokens that you want the model to produce or easier to predict and more consistently predicted regardless of where it looks so I would say figure out ways to structure the code so it is local to a subtree in the repository for most of the ways you would interact with that system and then figure out a way to use these agents to completely migrate the code base to be the same. Empower someone on your team to be a dictator to say this is the way it must be done right or you'll figure that out together and you know write it down right evolve the code so that it reflects that reality be source of things. We've got a few questions on code review. So do you approach code review now that you have such high velocity do you just not read the code do you just trust the test coverage how do you write good tests how do you off load that sigma of like you know you have a mental blocker I need to manually check everything before I merge pure. So that same sort of idea where you have to look at where you're spending your time and figure out ways to spend less of it you know when we started right the first thing to do was figure out how to get the agent reliably producing code that we would accept and a big challenge we ran into is with each engineer producing three to five PRs per day even on a team of three merge conflicts were super miserable right because these PRs tended to be pretty big we were working on the same parts of the code base so that's where we moved into directions one was to like tree out the code a bit more to minimize these merge conflicts but also minimize the amount of time PRs were all over the code. So that's not a time PRs were open so that we were reducing the likelihood of a merge conflict actually occurring and the reason PRs were staying open so long was because we needed code review because humans were being the blocker in this scenario so in order to do that piece automatically I essentially asked every engineer on the team to take one day a week where our entire job was to take every bit of slop we had observed over the course of the week that was making a PR difficult to merge and figure out ways to categorically eliminate it from ever happening in the first place which is where we kind of started closing this loop between the feedback that humans were giving on the PR indicates some context failure on behalf of the agent getting that into the repository and then figuring out ways to automatically prompt and check the agent so that it would self heal. When it produces bad behavior and this is kind of how you go from synchronous human time spent giving feedback as code review comments to documentation in the repository to automatically serving this documentation either via a failing test or a viewer agent who is primed to review the code as written in the context of these docs. But all of that happens by putting those docs in a single place that all these processes are able to attach to. We kind of asked folks to basically block it the types of review feedback they were giving into like the persona they were operating as front and architect reliability engineer scalability sort of thing and then basically for each of those personas we spun up a review agent that gets triggered on every push that says is this code good surface any p2s or above that would block this PR from merging based on these documentation that says what good looks like. And with that and just continuously appending to these files we started to see slot reduce reduce. People have questions about your billion tokens where do you think those are split up so how much of it is on code review where where is the majority of that usage coming from and follow up for people that are just getting started say they have they've jumped and done a $200 pro plan right if you had to cut your usage by a fifth how should people maximum. So you know you don't want to just copy paste million lines of code every six hours no prompt it prompt cash it but how should we how should we think about that yeah so I would say probably it's probably a third a third a third between like planning ticket curation documentation implementation and stuff that runs in CI. Do you use plan mode. We have used exact plans which was kind of like an early version of this that we publish which is sort of like a proto skill that says this is how you should structure a plan with milestones and acceptance criteria. I haven't really used plan mode as part of the harness at all my my sort of expectation here is that I should be able to drop a ticket in and have it do the job anyway without diverting through a plan because most of the time I'm never going to read it anyway. So I find that if you do use a plan and you approve it without reading it at all you're actually encoding a bunch of instructions that you don't necessarily want followed so if you are going to use plans my recommendation is to push those up as single PRs with just the plan where you actually have human review every line of it and like block on human approval before they get merged and then kicked off because you're effectively potentially wasting your time on a rollout with instructions that like are bad so you want to kind of like minimize the time that happens but I do think that kind of getting tokens to be spent in CI is a necessary part here because writing code no longer is the hard part like getting code accepted and advancing the code and product forward is like what it takes to make that written code be valuable and you know kind of all her the aphorism that like senior engineers give good code reviews like we expect our senior engineers as agents to do the same. Someone asked is code a disposable build our defect? Yes. I think we we touch on this with symphony which is sort of this agent orchestrator that we release this idea that you know we can publish a library that's actually a super well defined spec that the code is a compiled artifact of. And I think like using llm as fuzzy compiler is like an interesting mental model to have right like all of the context that we're putting in the code base for harness engineering is effectively like constraints and optimization passes on which code is acceptable to build in the first place. And this is pretty similar to like the static analysis and optimization passes that something like llm would do in the process of compiling rust code. And sort of swapping out one model for another is sort of like changing your code generation back end from you know llm to crane lift in the rust compiler. And you would expect that all of the sort of rules around what acceptable rust code looks like produce valid sound machine code at the back even if the generation process is different and you end up with different x86 instructions. So same sort of mindset for llm's swapping out different models sort of thing. We want the structure around the code to basically limit how it is written to things that would be acceptable to us. And at a high level can you give us a picture of what future you're building for this context no matter how do people do engineering harness engineering context engineering what is the future looks like. So the future that I want to build toward here is where I'm able to take a token budget and a quarter a half or a year's worth of work takes the human input to rank what is most important success metrics reliability metrics give it to the machines and have them continually work in advance my product forward without sort of my hands explicitly on the wheels at all. As we have gone through like very early prototyping to internal alpha internal beta external alpha I kind of have felt that like new parts of the software engineering process have kind of like started from zero and we've had to build up capability kind of like these like you know pentagonal like personality charts right or like I spike in this direction maybe I'm weak over here. And you know when we get to deployed software for the first time right the agents ability to do like QA smoke testing on our built artifacts before they're promoted to distribution was weak we hadn't invested any time in this there were no docs there were no tools that the agents could use like download the built artifact launch it poke around to make sure that are like most critical user journeys were well validated and tested. So because I don't want to be touching the computer we needed to figure out like ways for the agents to build themselves tools to do that part and there's a whole universe of software engineering outside of writing code right like I am triaging user feedback I'm triaging pages I am making sure that we don't have any PII leaking in the logs and production I'm making sure that like the Twitter vibes are good and people are enjoying my software that I'm trying to do that. Our user operation staff are supported with well written run books that allow them to triage and mitigate high volume user issues and then moving that into the code itself so they don't happen in the first place and as I no longer have to produce code like my mind can shift to these other higher level or more squishy activities but the agents are good enough to do these things too and figuring out how to like write down the processes and the acceptance criteria becomes like the sort of like metaprogramming part of the job. I think that's a great way to end it with an exciting future. Give it up for Ryan guys. Thank you folks.

Harness Engineering: How to Build Software When Humans Steer, Agents Execute — Ryan Lopopolo, OpenAI

TL;DR

Takeaways

Vocabulary

Transcript