Live coding session with Claude Code and Boris

The Bun project employs a sophisticated multi-agent AI system to fully automate issue reproduction, code fixing, and comprehensive code review, drastically reducing the time developers spend on these tasks.
This setup enables autonomous generation and refinement of pull requests, moving the developer's role from debugging to high-level decision-making on merges.
Crucial to the success of this automation is a meticulously documented development environment and an iterative approach to addressing bottlenecks, ensuring agents can operate effectively and deliver high-quality code.

Automate Issue Reproduction: Implement an AI agent (e.g., RoboBun) to automatically reproduce reported issues, generate code fixes, and submit pull requests (PRs) complete with verifying tests.
Leverage Multi-Agent Code Review: Utilize a combination of AI agents for code review, assigning different strengths; for instance, one for stylistic consistency (Code Rabbit) and another for identifying subtle, context-heavy bugs (Claude Code review).
Prioritize Development Environment Documentation (CLAUDE.md): Maintain thorough documentation of build processes, testing methodologies, and common issues within your codebase, as this is essential for AI agents to understand and correctly execute tasks.
Enable Full Loop Automation: Configure agents to monitor and interpret CI/build logs and error messages, allowing them to autonomously iterate on code fixes until all automated checks pass.
Utilize Auto Mode for Agent Permissions: For continuous, uninterrupted agent operation, enable auto mode for permissions, allowing agents to execute actions without constant manual approvals.
Adopt No Flicker Mode for CLI: Enhance the developer experience when interacting with command-line interfaces by using no flicker mode, which offers virtualized scrolling, constant resource usage, and mouse event support.
Iteratively Automate Bottlenecks: Continuously identify and automate sequential bottlenecks in the development workflow, such as writing code, running tests, or verification, to achieve incremental efficiency gains.
AI-Generated PRs as Suggestions: View AI-generated pull requests as suggestions rather than traditional human contributions; this allows for a higher bar for merging, ensuring only valuable and high-quality changes are integrated.

RoboBun — An AI bot specific to the Bun project that automatically attempts to reproduce software issues, proposes code fixes, and submits pull requests. Claude Code — Anthropic's platform or tool suite that facilitates running AI agents for various coding tasks, including code generation and review. Bun — A fast, all-in-one JavaScript runtime, bundler, transpiler, and package manager. Code Rabbit — An AI-powered code review tool, often used for identifying stylistic issues and ensuring adherence to coding standards. CI — Continuous Integration; a development practice where code changes are frequently merged into a central repository, followed by automated builds and tests. PR — Pull Request; a request to merge code changes from one branch into another, typically involving code review before acceptance. CLAUDE.md — A project-specific form of Markdown documentation that provides detailed instructions and context to AI agents for executing development tasks. LLMs — Large Language Models; advanced AI models capable of understanding, generating, and processing human language. No Flicker Mode — A user interface setting for CLI tools that optimizes rendering to prevent visual glitches and provide a smoother scrolling and interaction experience. Auto Mode — A setting for AI agents that grants them automatic permission to perform actions, reducing the need for manual approval at each step. Hill Climbing — An optimization technique where an AI model iteratively improves its performance towards a target metric by making small changes and verifying their impact.

Welcome to the stage, head of Claude Code of Anthropic, Boris Cherny, and creator of fun at Anthropic, Jared Sumner. All right, so this is a developer conference. We're going to be doing a little bit of talking, but mostly we're just me like coding. So this is for the developers in the room. I'm going to start by talking a little bit about how fun uses Claude Code to build and maintain fun, and also kind of how our setup works. And because it's kind of a slightly more advanced setup than what's common today. But first I'm going to get a few agents running to just fix and get out of issues. This is classic Jared doing work during a talk. So in Bunz repo, every time somebody submits an issue, we have a Claude block automatically run and try to reproduce the issue. So you can see this person has this side effects, and this is one of the most recent issues. And we can see that RoboBun, which is our bot, when managed to reproduce the issue and submitted a PR automatically. And this PR is like, it has, all these PRs always have tests. It's one of the actual hard requirements before you can submit a PR. So the challenge here is, does this code look correct? And one of the things we do to check that is, does the test fail in the previous version of Bunz in this debug branch? And the bot actually can't submit a PR without that being the case. And so this is just to make sure understand. So this is like every single issue that goes up in the Bunz issue tracker, you have RoboBun automatically try to reproduce it before anyone will accept it. Yeah. And this saves a lot of time because we have so many open get-out issues. It really moves the challenge from just fixing and debugging the issue to, is this the right thing to merge? Like, is this the right fix? How good is it? Is it still like 100% of yours? Is it like 10%? We can go to the insights and go to contributors. And then if we go last three months, and this is specifically to main, we can see that RoboBun is now a bigger contributor to Bun than I am. And that's with merging not all of its PRs for sure. You can see we have a lot of PRs open right now. The challenge is really how do we know can we merge the PR? And that's the test. And then the other thing that's really interesting about this is we have automatic code review bots that run and then they're going back and forth. So like code rabbit leaves a comment and then RoboBun leaves a comment and then they go back and forth and they, and code rabbit did the... How about this? And it also marks the comments that's resolved when it's done. And you can see they actually went a lot. There's a lot of back and forth here. It was like 30 comments or something. And so you're using like a combination of agents. So this is like code review, this is like Claude Code review, and then also code rabbit. And like you're using them together. Yeah. And I think basically like code rabbit is good for like kind of stylistic issues and things that are like make sure that it follows the Claude empty. And then the Claude Code review is really good. Here's this really subtle edge case that would have taken me like 30 minutes of reading all the code and having all the context to like figure out. And so it's really good at surfacing bugs that you need the full context to really understand. And I think basically it's really hard to actually have all this automation without having code review that is in the loop with the with Claude their replying or replying is like very performative. But like at like fixing. And that's also a big part of like what used to take so much time when like YPR's would take so much time to merge is because you'd have to like like check out the branch locally, fix a land error, then run the land for locally, then push it back up. And there's all this switching cost. Constantly there. And so when you so I think this is like an especially good use case for LLMS because otherwise like it just takes up so much time to ship. And I guess like especially for the bun code base because it's like it systems code it's very easy to repro an issue and then see if the issue was fixed because this is kind of back to what we're talking about before with like this kind of verification loop. It's all systems code so it's really like a test case on a particular architecture and you can essentially like repro or verify anything. Yeah, that's one of the one that makes it easier in bun code basis because it's a CLI tool because we don't need to run a browser to test things but you can also just like have something set up to like take a screenshot or record a video or those sorts of things. In bun case we don't need that at least not yet. There's a couple of things we could do that for like we have some front end stuff that would be nice. But yeah, I think this is like the direction that I think is really interesting is because it saves so much time. And this is not this is something that like this is specifically for bun but the more generalizable thing when because most products are not open source is like instead of an issue maybe the starting point is like a customer support ticket. So like you could you could imagine and automatically passing customer support tickets to a Claude bot to then go and try to reproduce the issue and then submit a PR and then having code review go back and forth. And that's where I think for a lot of companies it becomes a lot more impactful because it just saves so much developer time. Would you think of some kind of name for this pattern? It's like adversarial code review or something like that? Yeah, I don't know. But I do think like there's also a few other things about this that's like if you just do this then it doesn't quite work. The very first step you need is to like make sure the development environment is set up. I think this has been talked about before but like like CLAUDE.md is very important because otherwise it's going to just submit PRs that don't quite make sense for you to merge. So like we very much emphasize and BUNs code base that it runs this special command to do the build. And this build and runs the command so it like forwards the arguments because that's also one confusing thing is like because BUN has to be compiled you want to make sure that it's running the actual changes and not like a debug build that's like stale. We also go into a lot of detail about how to run tests, how to write tests, where to put the test. And a lot of like here's how, here's all the issues that we've run into previously. Like basically the pattern here is like every time that you find yourself repeating something, it should probably go in CLAUDE.md. Because the question now is like how do you make it maintainable to have lots of clouds running all the time. And to do that it needs to be written down, it needs to be documented. So like a really small detail is like we check that we have it like to make sure that Claude sees the error message, we make it print the error message before the less informative conditions. So this is sort of like you have Claude write a test and then the test is bad or something about it doesn't work. And then you see this kind of repeated like once or twice and then you just talk about add it to the CLAUDE.md so that every time in the future when you write a test you do it correctly the first time. Yeah. So this is like a like compound engineering, it's kind of this. And then also it's helpful to like give it overview of where all the folders are, like what I'll, like how the code is laid out. Like about dependencies. Another thing I think is interesting is like making sure that it can read your CI errors and like build logs. Like you want to you want to set up the agent to be able to be able to read the code, like to do the full loop of like writing the code, testing the code works, checking CI, monitoring CI and reading all the errors. So that way by the time it gets to a person everything is like set up. Like the ideal is that you read the code and you have very clear indications that you can be high confidence to merge it. And the only way for that to be true is if it is set up for success. It's interesting. I remember when we when we first met you were talking about like your vision of like everyone being able to run hundreds of agents in parallel and how that would work. And I feel like I didn't really get it at the time, although now like every night I'm running like hundreds of agents every single night. And I feel like now I'm finally there, but this is a thing that you've been thinking about for a long time. So it feels like this is sort of like the setup in order to be able to kind of scale up agents way more. Like you need the self-refacation so that agents can run autonomously. Yeah. Like this is done through many iterations and buttons code base. We previously just had a discord bot where I could just I'd mentioned the bot and it would spin up a container. It didn't have like the CI stuff. It didn't have the code review stuff. And it's so much better now, especially with like opus 4.7. It's all this stuff is getting so much better. Oh yeah, we can also check on how it's going. It looks like it created a PR. The first one. It wrote tests. So maybe while we look at it, I'm curious for like just to get a show of hands for people in the room. Like as people think about their development process, raise your hand if it looks something like this where you have like a bunch of terminal windows or desktop tabs and you're going to pasting in issues. Okay. So this like maybe half of people. And then what if it looks like Robobun or something more like that where it's like closing the loop a little more. It's like the next bubble of abstraction. Sorting together. Yeah. I think it's like it's not surprising because I think model capabilities are just getting there. Like I think 4.7 is the first model where it's really felt like it's able to do this. And then the past maybe you could do it with like a bunch of scaffolding. Like you just throw a bunch of tokens at it and it can kind of work. But now it's like efficient enough. You can actually do this day today. Yeah. Let's see. So the first pair is there. Let's see if it did any others. Okay. Did two PRs. This looks very plausible. It's cool. And this like this before after is this like a do you tell it to do that or is it? Sometimes it does it. It's pretty good about knowing like when it should do that. Like when it's like a string formatting thing. It's also kept like one style of the label which is good because no style is slightly different. Let's see. Does this change look good? Yeah. Mostly what I'm thinking right now is it did this. This is good because like you don't want to write one bite at a time. You want to write in chunks. And then it used saturation to it. But I don't like that. Or like it shouldn't have to do that. Does anyone here actually know Zigg? Yes. You're looking up it's good. And you can see a follow the patterns from the CLAUDE.md await using. And then that pattern of reading all the like resolving all the front is at the same time. And so like what's your workflow? Like when you see something like this are you usually going in and like commenting or are you just going to wait for like co-ta review to come in and drop a comment? Usually it depends on how complicated it is. In this case, usually I'll wait for like this one is actually pretty simple. I feel pretty high confidence that if the test pass, then I would probably merge this. But I still would wait for the code review for at least for the Claude Code review wanting to run just in case. Because what I really like about that is it will find things that like from that aren't in the diff that are like from tracing the control flow, which is what you want when it's like a human reviewing it is like somebody who has a lot of context who can think what are all the edge cases that this might run into. And the signal to noise ratio is pretty good. It's something maybe like 10% of the time is wrong. Like that's I didn't and for like how that used to be like with other code review products that we've tried. It was like basically you had to ignore most of what it said. It's pretty cool. How long has something like this worked? Is it like a latest model thing or like have you had like like Robobun or this kind of like automated repo, automated fixing like this whole pipeline, like how long has that been actually possible? We can probably like see this in a chart somewhere. It's kind of a lot of commits but I think that's that might not be on main that might be that the rust thing. Yeah, heard button is going to be written in rust soon. Is that I don't know. Or there's a I just have a Claude running and we'll see what happens. But you can see like the volume of commits there is like kind of lower and then it's definitely gotten up a lot and then it's gone up a lot. Now really the bottleneck is like do I feel good about merging this and my confidence that it's changes are correct. And that's new because it used to be like the code wasn't good enough. What do you think what do you think is left like like what's it going to take before? Well like is there like a missing tool or like a missing model capability or model version or something for you to kind of feel like Robobun can fully close the issue comes in and then like fix goes out automatically. I think it needs a little bit more. It takes a lot of time to verify the changes are correct. This was kind of already true like when is a person pushing up a PR. But I think the challenge is like how do we make it how do we make sure to communicate a sufficient proof that the changes are correct or making it easier to like roll back things. I think those are kind of the two directions. But I think like for the majority of like simple issues we should probably be pressing merge a lot more and the bottleneck now is actually like CI and like making sure that the and having like fully running the code like making sure all the tests stuff works. But I think it's like basically there for like and the large projects are still non trivial but also I've been doing some pretty large PRs lately with with Claude mostly with in the not as much as Robobun but with like Claude Code. Like we recently added support for built in a new processing library to bond I could probably pull up the PR and that was Claude and also we did a bunch of follow up PRs too. Yeah it's like it's interesting because I think like when I look at different people using Claude Code everyone is like at a kind of different level of like sophistication or kind of like adoption of this. And I think for me the hardest thing is the model changes very often. So I have to like constant retune and kind of recalibrate to what it can do. And like as an engineer it's hard because it's like a very weird technology. It's like the first technology I've used that's like that. And I sort of feel like this is actually the way that you do it is ahead of how the Claude Code team does it for Claude Code itself. And to me like that like the way the Claude Code team does it is actually very automated but this is like even further ahead. This is like almost like full lift off like full fully closed loop. Like in the last two weeks we've added an HTTP 3 server to bun. There's a PR for an HTTP 2 server. There's fetch support for HTTP 3 and HTTP 2. There's this image processing API. There's the ongoing Rust 3 right which may not ship. That's like the most it patients one I've done so far. And I've done is too strong will work because it's very much not done. So even something like this like this is like a benchmark. So Claude ran the benchmark for you. Yeah, Claude ran this benchmark. I gave it like this ran in like a separate like on like a Linux box. Yeah and it and I was like make it faster than sharp. And that's basically what I did. I gave it like a few ideas like oh you could try like read this code in JavaScript core to like figure out how to like avoid cloning the type to raid when it's not strictly necessary. But like it then went and did it and figured it out. And like yeah it's pretty crazy honestly because it was these this wouldn't none of this would have worked several months ago. Yeah I feel like like within like within an anthropic like within a AI web you call this kind of thing hill climbing and this is this idea that like if you give the model some sort of metric and then you give it a way to verify its result you can just make it iterate and keep going and keep going until it hits that metric. And this is something like for seven I think is uniquely good at. And I think it's something really underutilized because I think it's the first model that's actually very good at that. And if you just give it a target you give it a way to like improve the performance and you give it a way to measure it was just like keep going until until it's done. If you let it go in auto mode. Yeah. And you can also see like this is another case where like the code review comments was really really helpful because like there was like in this PR there was like a hundred comments or something. And it's just going and jixing everything. Yeah. Yeah. Yeah. Like it goes on for a while. And in the meantime you're just working on something else. Yeah. I was this was not like my the thing I was 100% focused on. I was maybe like 10% focused on this. I was doing like five things that once. And this definitely wasn't possible like six months ago. Three months ago like this is like very recent that this is doable. Okay. So how are sessions doing? Yeah. So we have one PR there. There's almost another PR coming up. It looks like pretty soon. This one should be the trickiest one. It mostly looks good though. Like there are like it looks plausible based on these changes. I wouldn't exactly do it this way but I think that's we need like a better or optimized way to do this because that's a lot of checks. And looking looking at your setup here. So you're using you mostly use CLI. Yeah. And do you always use auto mode for permissions? Yeah. And before that I use dangerously skip permissions. No you guys can delete stuff if you do that. I don't know. I think I'm not supposed to recommend that. But I think it's just not fun to wait for Claude to like press approve because then you just like go off and do something else and then it's just been sitting there. So that's why auto mode is really good because it's actually like a real way to fix that instead of just like trusting. And I also notice like the input like the little composer it's tucked to the bottom of the screen so you're using no flicker mode. Yeah. I'm using no flicker. Honestly I think we should just like make that the default because it's so much better. Like you can see I can scroll really fast. And like you could scroll fast before but like sometimes there would be a flicker. And now there's not. Have folks tried no flicker mode for CLI? Yeah. Few people. Yeah. So it's like we launched it on April fools. So you can think it in hindsight it came across as a joke a little bit. If you actually do Claude Code no flicker equals one Claude. So just like set that environment variable. We totally rewrote the renderer that's running in the CLI. So it's using virtualized scrolling virtualized selection. And so what this means is like constant memory usage, constant CPU usage. And also some nice stuff like if you're a type you can actually like click around the composer. And so you can actually click and mouse events work which is pretty crazy for terminal. So I'm just also having it monitor the PR. You can see it's here and some commands and then it's going to go to sleep for 20 minutes and wake back up. 20 minutes is probably a little bit too long but it's okay. And what's that like using like a loop or something? I think so. Yeah. And then it's let's see how else is it doing. And then the other ones are still apparently fixed in extra bug as well. Okay. So we got okay. So it's been what it's been like 20 minutes or so 25 minutes and we got one. How many pairs have we gone? Three pairs. Three pairs. No, it's not bad. Yeah. And I think we'll get a fourth one once it finishes running this. And then in the meantime, Roboblin is still running and kind of like generating even even more PRs. Yeah. Every time somebody submits an issue it tries to reproduce it. Yeah. I kind of feel like every like the way Claude code makes you think is every time there's a new bottleneck you have to kind of automate that bottleneck and then there's always some other bottleneck after and you kind of move on to that. And like it's sort of like writing code was the bottleneck and it's no longer the bottleneck. And then like verification and running tests, that was like the bottleneck and it's no longer the bottleneck. And now there's like a deeper layer of verification. Maybe that's it. What do you think of the bottlenecks remaining? It's definitely this deeper layer of verification. I feel like the bottleneck after that's going to be like planning like what should we do and what should we not do and what is the right way to fix this. And like ideally Claude would be smart enough or like we could trust Claude enough to merge the PRs by itself. And I think like in certain projects you could probably do that and just have that be automatic completely. I think one is not yet for like it's not yet there for Claude or sorry for for bun. But I think it'd be really cool if like we had the tooling for us to feel confident enough to do that. So like right now Robobon it doesn't like build features. It doesn't do like feature requests yet. That's true. Yeah, it doesn't do feature requests but we do also use it sometimes. So we can also at mention it in either discord or Slack and it will like try to implement the feature. So sometimes when people are like hey bun is missing this thing. And I just at mention the bot and maybe like an hour later there's a PR. A bunch of times I've somebody like tweeted at me something like this can you fix this bug or whatever and that's basically what I do. And then I reply with a link to the PR. Should we add a Robobon account on Twitter? And I think like so it can do feature requests but I'm hesitant for it to implement literally everything anybody asked for in a good of issue because that's kind of a lot. Because in some ways it's kind of crazy to put something like an image processing library inside a bun. But like we talked about engineering tastes and there's like an element of taste that goes into that. Like you felt like that's a good idea. And like we're not sure yet if Claude is at the point where it would also think this is a good idea. But you know at some point in the future it will get there. Yeah. And I do think that like PR's become suggestions. Like having not merging PR's used to be like you feel bad if you don't merge like a co-workers PR because like they put work into that. But you don't have to feel bad when it's like Claude. So like if the PR is wrong or whatever reason then you can just not merge it. But it does mean that like the bar for what you merge is like should it be there. Because I think there's also a difference with when it's like people because with people you don't want people to feel bad about their lost work. So sort of in some ways it does actually end up raising the bar for what you decide to merge. Yeah, it's interesting. As the bottlenecks move the dynamics change a little bit. It's sort of like having to trust each other, having to trust people on the team. This kind of changes a little bit. Now it's a little bit more about like do we have the right automation and like do we trust automation. Like as a group. Yeah. So I think we're almost at time. Is there any like maybe one last thing we want to show people where we can kind of check in on kind of like the progress that we've made? I don't think so. Yeah, it's still going one last onto that fourth PR. It's going back and forth like found a bug and then fixed a bug. Looks like it's about to submit the PR now. Okay, let's wait for this one. It's about to have one more. This is like the cool thing about auto mode. It's like in auto mode I can wet like Claude runs like for hours and hours at a time. Like I run it almost every night. I'll just have a bunch of quads running in auto mode. And as I before this, it just didn't work because it always got stuck at some kind of permission request. And then those crazy like this entire thing was one prompt. And that just ran for 30 minutes. Yeah, this is all I said. Okay, so it's pushing it. It's about to submit the PR. That sounds like the right fix to this has been an issue that's been open for a long time. And we got a PR. Yeah, we can go in this issue and we can see how many outputs it has. 20. Yeah, kind of a lot. Cool. Maybe we can pause there. But to me, this is just like such a cool vision of where engineering is going. I think for everyone in this room. And you know, we're going to see this first. We're going to have to figure it out first and then everyone else is going to have to figure this out. So you know, like we were talking about this morning just excited to be on this journey together. And like you can see, we haven't figured everything out yet. But I think like the mode that we're in is just constantly experimenting, constantly trying to see what the next bottleneck is so we can solve it. It's very exciting because it's so cool. I like this stuff. Cool. Should I like, yeah, we can weave it.

Live coding session with Claude Code and Boris

TL;DR

Takeaways

Vocabulary

Transcript