Caching, harnesses, and advisors: Building on Claude at GitHub scale

Achieving large-scale AI inference efficiency, like for GitHub Copilot, hinges on rigorous prompt caching to minimize costs, aiming for cache hit rates above 94%.
Strategic AI model utilization, such as the "advisor" and "critic" models, allows for leveraging powerful, expensive models for complex tasks while primarily using less expensive ones, optimizing both performance and cost.
Integrating new AI models requires a methodical approach involving extensive online benchmarking, A/B testing, and continuous optimization of prompts and tools within a multi-model "harness."

Prioritize Prompt Caching for Efficiency: Implement robust prompt caching mechanisms; even 1% efficiency gains on billions of calls can mean significant cost savings. Aim for cache hit rates above 94% to indicate optimal system performance.
Keep Prompts Static: Ensure system prompts and tool prefixes contain no dynamic content (e.g., UUIDs) to prevent cache invalidation and maintain high cache hit ratios.
Use Regression Tests for Tools: When dynamically loading or changing tools, implement thorough regression tests to ensure changes do not negatively impact caching or overall system performance.
Longer Context Windows Can Be Cheaper: Counterintuitively, longer context windows can be more cost-effective than frequent "compaction" (summarization), as compaction generates expensive output tokens and invalidates the cache more often.
Employ Advisor/Critic Models for Cost-Effective Intelligence: Use a "junior" model (executor) like Haiku for most tasks and sparingly call a "senior" advisor model like Opus for complex problems, or use a "critic" model (RoverDuck) to review plans or complex implementations before execution.
Measure Outcomes, Not Just Activity: Beyond metrics like acceptance rate, track "survival rate" for generated code to ensure the AI's output delivers lasting value and truly impacts developer flow and team velocity.
Embrace Online Evaluations for Model Integration: While offline benchmarks provide a baseline, online dogfooding and A/B testing with real users are crucial for fine-tuning new models and their integration into a multi-model harness.
Optimize Tooling per Surface: Manage the number and complexity of tools carefully, tuning them specifically for each user surface (e.g., VS Code, CLI, mobile) to avoid confusion and optimize performance.

inference — The process of using a trained AI model to make predictions or generate outputs based on new input data. prompt caching — Storing previously generated responses or common prompt components to avoid redundant computations, significantly reducing costs and latency for AI models. cache hit ratio — The percentage of times a requested item is found in the cache, indicating the efficiency of the caching system. system prompt — The initial, often static, instructions or context given to an AI model to define its persona, rules, or task. tool prefix — The initial part of a prompt that defines the available tools or functions an AI model can use. multi-modal harness — A system or framework designed to integrate and manage calls to multiple different AI models (from various providers or families) within a single application. context window — The maximum amount of text (tokens) an AI model can process or "remember" at one time during a conversation or task. compaction — The process of summarizing or shortening conversational history or context to fit within a model's context window, often at the cost of generating more output tokens. advisor model — A larger, more capable AI model (e.g., Opus) that is sparingly consulted by a smaller, less capable "executor" model (e.g., Haiku) for difficult tasks to optimize cost and intelligence. critic model (RoverDuck) — An AI model used to review or critique plans, code implementations, or tests generated by another model, improving quality before full execution or deployment. offline benchmarks — Performance evaluations of AI models conducted in a controlled environment using predefined datasets, often before real-world deployment. online evals (dogfooding) — Real-world testing and evaluation of AI models by internal teams or a subset of users, providing feedback and performance data after deployment.

Please welcome to the stage chief product officer of GitHub Mario Rodriguez. Hello, hello everyone. It's great to be back. I code with Claude. I love developer conferences. You know, they're a little bit messy. They're vibrant. They're still deaf first, not agent first. And this is what we took half fun. We're here to have fun and to learn together. Now, you're here because you want to learn more about the platform. And I'm here to share with you some of the top things that we end up doing to running, co-pilot, and all of our inference on top of this platform. Now, I want to get started with the why. Like, I gave a word, very mission driven. Now, our vision is to kind of empower developers to provide the best tools out there to advance human progress. But when I talk to customers overall, I'm telling them something that they want to achieve. They want to achieve a set of outcomes. And for them, they want to keep their people in flow. They want to keep developers in flow. They want their teams to gain velocity. They want teams to actually achieve more with the people that they have. And they want to do this at scale. And to be able to do it at scale, you have to have efficiency. And you also have to have trust. And I start with this because almost every single product decision that I make kind of is grounded on these pillars. Even when we actually talk about the platform and integrating with our cloud platform, I want to make sure that I keep developers in flow. I want to make sure that teams gain that velocity. And I want to make sure that companies can do that with intelligence and trust. Now, as I think about it forward, I was thinking really hard. Like what do I want to say here? If I was sitting there, what I would like to understand is, okay, how do you operate? We do billions and billions, probably over a big enough time since, trillions of messages into against the platform. So I wanted to kind of give you the best practices, or maybe not even best practices because things change so often almost at a weekly basis. But kind of like the key learnings, the things that are at the bottom of almost all the decisions that we make to be able to integrate with the platform and achieve that scale and that efficiency that is necessary. So I want to divide it into three things. Number one is prompt caching. Without that, we're not dead, but oh my god. Like the amount of money that we spend compared to what we do, it's incredible, right? So just one percent efficiency on this means a lot to us. It's kind of like high frequency trading. One percent efficiency means millions overall. So I just want to spend a little bit more on how we're doing that. The second thing is we're collaborating with Anthropic on capabilities to make sure that our customers are using the right inference, the right amount of intelligence at the right time. And then one of those is this advisor model. We have two ways of looking at it both through an advisor and a critic. So Brad and I will share a little bit more about that too. And then the third one is every time a new model drops, and if you're integrating with the platform, you know that this happens constantly, and you get a call at 5 AM on a Saturday that's like, hey, we're launching on Tuesday. So how do we actually go through that the right way? How do we make decisions like what is the default model to keep developers in flow? How do we make the decisions to actually route to the right intelligence at the right time overall? So I'll start with this. And this is a dashboard. This is not our dashboard, but if you are on top of the platform, I believe Anthropic released this last week, we got a little bit preview of it. And what it allows you to understand is how are you doing, you know, like your cache hit ratios, how many messages you're sending, against messages API, right? And I think this is great. This is the first step because without data is really, really hard to make decisions, really, really hard. So if you haven't checked it out, please do so. Now the one that I want to spend a little bit more time on, it's ours. And this is just one look. This is not the entirety of all of our dashboards that we have. But it's one where we take a look at deltas between models. In this case, I want you to look mainly at the left hand side that's Opus 4.6. And then on the right is 4.7. And then the next corner listed delta between them. So when a new model drops, and sometimes we get it in EAP, we go ahead and run a set of benchmarks against it. So think about terminal benchmark two. And then or one of the sweet, we have our own ones as well. And then we try to decide, okay, how is the model performing? And then we ship it and then we pay attention to that data again. And then after a period of probably 30 days, we're probably done, or maybe sometimes even sooner, we done with all of the optimizations for that model. Now you can see over here, kind of the median cache tokens, and at the end, the cache rate, for us to operate the service at scale, we need to run above 94, usually 94, 95, 96%. If we operate at 70%, that means usually that we have a bug, believe it or not. Like we're doing something not right. And then we need to change the approach on how we're calling the model, how we're assembling the prompt, how we're doing the end to end to keep that developer in flow. So we pay attention to that. The same thing, if I want to actually change the cache rate from, I'm sorry, change the default model from 4.6 to 4.7, I need to understand what's the cache rate difference. And what actually means to me, again, we make billions and billions of calls over all. So just 1% in that, in this case, is 1.3%. It means a lot to us. Now you also have to take into account that from an input perspective, your cache rate is only 10% of the cost, right? So there's a 10X difference in there. So you're constantly invalidating that cache. That is not a very, that's not a very good thing. We're all because you're going to be paying 10X more. And then we kind of go across that in this case, we did a baseline against 4.6 on it and then also on high queue. Now imagine it stays like this for a second and there's a bunch of red as you can see in this screen overall. Then we have a decision to make. We have the decision to make to try to figure it out how to get all of those reds into green. And it's not luck, it's a lot of hard work. And I want to make sure you understand that. It's a lot of hard work to get it, maybe from 50 to 70% is not because you're just kind of easy hunting fruit. But from 70 to 80, from 80 to 90 and 90 and above, there's a lot of hard work that goes into this, an engineering. And there's three things that we take a look at. Number one is, we make sure that there's no, and these are lessons learned. We have made these mistakes before and that's why I wanted to share it with you. We put no dynamic content in the prefix, right? You need to keep that prefix as static as possible. And as an example, at one moment, we had UUIDs in the actual system prompt. And then we're getting constantly reset and then that was invalidating the entirety of the cache rate. So remember, system, then there's tools, then there's the conversation, and then there's the last message that you're sending. That's kind of the hierarchy. So you want to keep that system prompt as stable as humanly possible, no dynamic content on it. Then goes tools. We made a lot of mistakes in tools at times. If you're actually loading tools dynamically and you're changing that tool's prefix, then all of a sudden, everything, the entire conversation, gets invalidated again. So you have to do work and you have to make sure you have a lot of regression tests because you're going to be experimental with what? Skills and tools overall in your end to end, you have to have regression tests to make sure that you're not affecting your tools. Then the third thing then overall is that we need to have that cache affinity. And let me tell you, it's really hard when you're doing a multi-hard, a multi-modal hardness. So take for example, Copilot. The customer can be calling, let's say, Opus, then calling a GPT model, and then going to an OSS model and then coming back into Opus again. And then through all of that, I have to make sure that the next Opus call, that last one that happened, actually has the right cache affinity too. So we do a lot of work when it comes to our multi-modal hardness to make sure that that is guaranteed. So the next slide. So here's an example on something that we ran in, also, to the bonken mesh. So one of the key things that I heard many times, actually when I go to customers and they ask me about integrating with a platform, is, hey, its long context means that it's more expensive. And the answer to us is no. And this is a quick test that we ended up running. So there's a smaller context window. And you can see in there that my average compaction went up three times, compared to a largest context window. This was something that we simulated. We kept the same model with the same context window. We just ended up doing more compaction on it and filling it at a rate. Now the key thing over here, if you remember the math of how many input and output tokens and sub-happening is 5x, right? So for Opus, it would be $5.25 as an example. In that case, whenever you do compaction, you get 4,000 tokens of output. And then you have, because you have to summarize the message, etc. So whenever that happens, then you end up usually paying more, because you end up doing a lot more compaction, which means that your output tokens go through the roof. And then your cache rate also gets a little bit more invalidated because of that too. Well, significantly invalidated because of that too. So longer context windows does not mean you're spending more. In fact, what you have to understand is how compaction is being done. And depending on the scenario, you want to manage that for the user appropriately. So, at a high level, then instrument your cache rate, have a dashboard that actually shows it to you. Invest the time to do it. They're already shipping one for you, so at the very least use that one. But go and invest in delta, go and invest in pre-the-modal launch and then post-the-modal launch as well. Now, no other efficiency matters until you see that one again. You want to be driving that from 50, 70, 90s overall, and it will take a lot of hard work, a lot of engineering time. And then you want to measure per surface. We have VS Code, which is one of the dashboards that we show you there. But we also have the Kobaylo CLI. We also have our coding agent in the Claude. We also have IntelliJ and we have mobile too. So it's a lot of surfaces that you're going to have to go and understand either share the hardness across them or then tweak individually. So you want to have a very good sense of we ship something, we regress or not, we ship something we improved or not. Now, another thing that I was telling you about is, okay, if you've got prompt cash in already done, what about making sure that the writing intelligence is the user at the right time? And for that, Anthropic and us have been partnering on this advice for model. And what I want to do is bring Brad Adams from the Anthropic train to talk more about it. Go ahead. Right. Oh, thank you, Mario, for being here. It's so great to have a partner like GitHub Copilot. We get the Copilot team get such great feedback. We give them very little time to evaluate models before we launch and they give great and insightful feedback. And it's true on our API features as well. In fact, if you're using the Cloud Platform, a lot of what you're seeing is thanks to the things Mario and his team are doing on Copilot. And it's actually one of those pieces of feedback that I want to talk about today. One of the pieces of feedback we got from the Copilot team is they really wanted Opus level intelligence, but it's at high-coup level prices. That sounds like a good deal, right? Opus intelligence. So I don't get to write the deals, but what I do get to do is build fun API features. So I want to talk about the advisor strategy. And really the insight on the advisor strategy comes from software development teams. We all know if you take a junior engineer and you give them a mentor who's a senior engineer, that junior engineer gets a lot better. Because the senior engineer looks over their shoulders, they review code, they review code, they look at design docs together, the senior engineer can make the junior engineer a lot more productive without taking too much time from the senior engineer. And it turns out the same thing is true for models. You can take a junior model like Hikou who were listing here as the executor and give it access to Opus and you can very conservatively use those Opus tokens. So in this beautiful diagram that Claude created for me, what you see is the executor, Hikou is able to identify every shape as it comes in, no problem, except for one little weird shape. That weird shape, it's beyond what Hikou can do. So Hikou has a tool, it calls the advisor. It calls Opus, Opus then because it's a bigger model, it does more reasoning and it knows that. And what we see from this in our eVals is that we get close to Opus level intelligence at much lower prices because we're being very conservative about the tokens that advisor actually sends. Because of the price difference between the two, it works out really well. And what I'm excited about is an integration that we work with with the co-pilot team. So let's switch over to the demo machine. Yeah, let's switch over to the demo machine. And what you're seeing is GitHub co-pilot on the left hand side is just GitHub co-pilot with Hikou. And on the right hand side you're seeing GitHub co-pilot with Hikou plus the new advisor tool hooked up. So I'm going to hit Enter here to give this time to run. And there's a little kind of brain-teasery problem here that it has to solve. It has to give the exact same problem. And we see Hikou is taken off whereas on the right hand side is consulting advisor. So it's going a little bit slowly. So keeping my fingers crossed here that Hikou will, I mean, Opus will come back. And so on the left hand side, Hikou is just spinning away trying a bunch of things. Just like you would see a junior engineer who's very eager trying the lots of things, but hasn't found it yet. On the right side, the Opus advisor just returned. And because it's Opus, it actually knows this bit of data. And so it's able to bring that back into context. And now we're done with Opus. And everything's still in Hikou. But because it got that little bit of hint, the right hand side has finished here. So we see the right hand side finished. The left is still trying to figure out which way is up here. So just that little bit of hint from Opus, very small cost, very small latency. It makes Hikou so much more powerful. So this is an experiment we're doing in the GitHub Copilot CLI that we'll release soon. And looking forward to you playing with it. Thank you. And welcome back, Mario. Thank you. Thank you. I love some Opus intelligence, a Hikou process. That's kind of what we've got to do. Okay. So we'll show you some advice from model. There's another algorithm that you could actually employ as well as your tinkering more how to get the right call to the right intelligence. And we call, you know, where GitHub, where a little bit weird on some of the things, we'll call it RoverDuck. And what RoverDuck is really all about. And here's a demo I'm going to show you in a second. It's really about inserting a critique at the right moments in time. So I have a feature that feature, you know, it had an issue and over here I'm implemented in at the moment. And what I want to highlight and it's going to come really soon is the model now is asking for a critique. And you can see how it's asking for a critique from 4.6 and 4.5 Opus. Then it receives that critique and then is able to actually change the plan prior to implementation and then continue with that implementation the right way. And then we get that feature and we deploy it. So that's very, very quick. But really what I wanted to highlight is a different model. A different model that acts more, less as an advice or a more as a critique on, hey, here's what I think you should do. And we insert this in three core places. One of them is after drafting a plan. We have a lot of our users that do a plan first and then after drafting that plan they go into execution. We have another one after a complex implementation. So think about it. I just finished all of this. Go on critique. It's like a pre-code review to a degree. And we do it sometimes there because it ends up saving tokens then to wait all the way into an official code review as well. And then the other one is after writing tests but we four running them. And in certain places where your CI suite with tests takes a significant amount of time, then you can see how that at the end gets you to that flow faster and it keeps that developer in flow. So those are the three places that we're doing it right now. And from my end, what I really have seen work well is kind of at the plan phase. Like a lot of these systems are going really, really good at planning. And then if you catch it there, then you end up with the most gains post as well. Now Robert Dog is already an experimental. So if you download the Copa LoCLine and you enable that experiment, you're going to see it there. You can send invoked whenever you want. So you can say, hey, create a plan on this and consult with Robert Dog. And then you get a quick critique across model families as well. Now, let's continue moving. The third thing that we wanted to talk about is, okay, how do we at scale take new models? Like what is the process that we do? At the very beginning, it was messy for us. But over the last kind of two and a half years, we've done a lot of work to make sure we're very methodical about how we take this model. So you can see, and Robert has a model to try. We onboard that model into what we call KAPI, which is our Copa LoCLine API. And then we have an endpoint. And then from that endpoint, three things really need to happen. Number one, as you all know, you have to go ahead and work in the harness and work in your prompts. So for us, we have a multi-model harness, right? So it's not just one model family. So we have to go over there and make sure that we update the system prompts. It's a very, very close loop and traffic on that. We have to then make sure the two interfaces are right and optimized. We then have to, and we don't do this anymore as much. But we have to do the agent loop a little bit. And then a lot more work goes into context management, into compaction, into making sure that we're heading the right cache hit rates overall. Then from there, we do two things. We run offline benchmarks. And then we also do internal dogfruiting. Think of that as online benchmarks overall. So a lot of the Microsoft developers use it, a lot of GitHub developers use it. And we have a set of other users that we kind of partner very closely with as well, to give us feedback. So we have that offline, and then we have that internal dogfruiting. Then from there, we sure find this with Anthropic. And this is a presentation, before we used to write a document, now we write a document and kind of expand it in detail, where we're like, this is everything that we're seeing. And then we go on a loop with them on changes that we need either at the API, or even changes on some of those models as well on the checkpoints that we end up getting. And then from there, we kind of do another loop all the way onto the release of the model. Now it's very important to things. Like yes, we do offline. And offline gives you an indication, but I would say it's not going to be a time's reality. You learn a lot more from your online evals overall, and the online experiments after launch, then really the offline. Offline just kind of sets a base. And if that base is consistent, you kind of know what to expect out there. And it gives you a pretty good indication on that, what to expect. But all of the details do not get done through offline. So it takes us usually days, sometimes even weeks, to tune everything of that model out through those online experiments. We do a lot of A, B testing and we're very methodological about that. And we do weekly reporting that we report up to entropic and also across the entire of the teams that we have to make sure that we are tuning up the right model at the right time. So when it comes to optimizing the hardness for us, there are kind of four things. One is there's the build prompt and context. There's calling the model as it's their executed tool and then depending on the results and pull back and look back. And what I was saying here, the places where I spend the most time with the team is usually in the execute tool and that build prompt and context. Tools are very important to us. The more tools you have, the more confusion ends up happening. The more you have to turn. So hundreds and hundreds of tools is not a good thing. So you want to make sure you're tuning by surface and you're tuning for the exact scenarios with the right tools in that package. So we spend a lot of time in there and optimize in that model. And then the check permissions and then run. We end up again in that execution of the tool. You have to do the right things to pass the right context on it and then to read the right results over all. So whenever we want to introduce a new tool as an example, we spend a lot of time making sure that we optimize in the hardness both at the model level and at the tool execution level. So what does this mean at high level? There's two things. You have to go in and have both published benchmark and turn out benchmark. But more importantly, dog food with AMD and make sure that you have online evil setup. And make sure you have to, you could trust them with stats as well. Then the second thing is, you want to measure outcomes, not activity. When I talked about the tools like, you know, as an example, acceptance rate on a code line, that's okay. Survival rate is actually an even better metric overall because if you ended up accepting it and then you ended up deleting it after, well, that actually did not accomplish the outcome overall. So even though I could have an amazing acceptance rate, if the survival rate ends up being very low, that means that we did not do the right work. So it's very important if you're in product and product really changes in the RFAI that you're optimizing for the outcomes and not just the individual rows. Like I said, you end up then optimizing for a metric, then then tags, tags the measurement of your entire product. Okay, so then with that, again, we walked through prompt caching, the advice of strategy, and then we talked a little bit about measurement and how we improved our harness. And then for us, you just want to leave you and say, thank you. Really excited again to be with you. Hopefully you learn a couple of things today.

Caching, harnesses, and advisors: Building on Claude at GitHub scale

TL;DR

Takeaways

Vocabulary

Transcript