Skip to main content

Getting more out of the Claude Platform

TL;DR

  • Prompt caching is the single most important technique for production agents, offering up to a 90% discount on input tokens, faster response times, and exclusion from API rate limits.
  • Context engineering is crucial for managing token usage and improving model intelligence by proactively deciding what information is truly necessary within Claude's context.
  • The advisor strategy allows agents to achieve high intelligence at a lower cost by leveraging an inexpensive "executor" model for most tasks and consulting a more powerful "advisor" model for critical, high-impact decisions.

Takeaways

  • Implement prompt caching for long-running agents to significantly reduce cost (up to 90% discount on input tokens), lower latency (especially time to first token), and prevent cached tokens from counting against API rate limits.
  • Use the Claude Console's prompt cache dashboard to monitor your agent's cache hit rate and aim for 90% or higher, leveraging the prompt caching skill in Claude Code to guide marker placement and prompt reorganization.
  • Adopt a disciplined approach to context engineering by actively reviewing Claude's full transcript and making proactive decisions about what data is included, rather than relying on abstractions that obscure context.
  • Utilize the Tool Search Tool to defer loading numerous tool declarations, only adding them to the prompt when the model specifically needs them, thereby optimizing context and potentially increasing model intelligence.
  • Employ Programmatic Tool Calling to manage large tool results; instead of stuffing all data into the prompt, the model writes Python code to inspect schemas and extract only the necessary bits into its context.
  • Implement Compaction for long-running agents to summarize and remove stale conversation turns, preventing context window limits from being hit while maintaining the model's understanding of the ongoing thread.
  • Apply the Advisor Strategy by pairing an inexpensive model (e.g., Sonnet) as the primary executor with a more powerful model (e.g., Opus) as an on-demand advisor for complex or high-stakes reasoning tasks, balancing cost and intelligence.
  • Explore Workload Identity Federation for enhanced security concerning API keys and leverage the Anthropic CLI for command-line management and integration with Claude Code.

Vocabulary

Prompt Caching — A technique that stores and reuses common segments of a prompt's KV (key-value) values to reduce processing time, cost, and latency in subsequent inferences. KV values — Key-value pairs used internally by transformer models to represent parts of the input sequence, which can be pre-cached to skip the initial part of inference. Agentic Loop — The iterative process where an AI agent plans, executes actions (like tool calls), observes results, and refines its plan based on new information, often involving multiple turns. Context Engineering — The discipline of deliberately structuring and managing the information (context) provided to an LLM to optimize its performance, cost, and intelligence. Tool Search Tool — A feature that defers the loading of tool declarations into the LLM's prompt until the model determines they are specifically needed for a task, saving context space. Programmatic Tool Calling — A method where the LLM writes and executes Python code to selectively extract and use only relevant data from large tool outputs, rather than including the entire output in its context. Compaction — A technique for long-running agents that summarizes and removes past, unused turns of a conversation or tool calls from the LLM's context to prevent exceeding context window limits. Advisor Strategy — An architectural pattern where a smaller, less expensive "executor" LLM handles most tasks, but can call upon a more powerful and intelligent "advisor" LLM for help with complex or high-impact decisions. Workload Identity Federation — A security mechanism that allows workloads (like AI agents) to authenticate directly with cloud services using short-lived credentials, reducing the need for long-lived API keys. Anthropic CLI — A command-line interface tool provided by Anthropic that allows users to interact with and manage various aspects of the Claude platform programmatically.

Transcript

Please welcome to the stage product management lead claw platform of Anthropic Brad Abrams. Thank you. Well good good afternoon you've almost made it to the end of code with claw. Thank you for hanging in there with us. But I got to tell you you are in the most important session of the day. So you made a good choice being here. Thank you. This is an important session because we're going to talk about putting agents into production. So you're going to have a real world case of agents in production. And you know just to get us started. How many of you already have an agent in production? Not some demo proof of concept but actually in production. Okay now keep your hands up. Keep your hands up if you're happy with the cost, reliability, latency. You can put your hands down but if you're sitting next to one of these guys you should just talk to them after and get the real tips. But we're going to talk through a few things in this session that will help you manage things like cost and latency and reliability a few techniques. So let's drill in here. And it turns out the most important technique to think about is prompt caching. I caught a couple of sessions. A lot of people are mentioning it. But if you're not already doing prompt caching then you absolutely should. And I'm going to tell you I can look at the analytics and I know a few of you are not doing prompt caching. So it's absolutely worth mentioning this. With long running agents, prompt caching is very important because the context continues to grow over tool call after tool call. You have the tool call, the tool resolve, and then the tool call and tool resolve again. And all of that gets appended to the prompt. And so there's a lot of common elements in that prompt that unless you do prompt caching, we're reprocessing that every time. With prompt caching, if you mark which sections are common in your prompt, then we're able to compute the KV values essentially pre pre cash the models, you know, part of the inputs to the models in KVs and save those. And that saves a lot of latency when that happens and a lot of processing time. We skip the whole first part of inference when you do that. And so it's a big cost savings. In fact, it's a 90% discount. So if you're doing a long running agent, you're not doing prompt caching, you're missing out on a 90% discount on input tokens, which is the largest share for most customers. You also get faster response time, especially time to first token. And then a little known fact is that it also prompt cash tokens don't count against your API rate limits. So if you have a rate limit, and you want to like manage that rate limit as well as you can, the cash tokens don't count against that. So we almost 10X those rate limits today, but still if you still want to manage those, you can with this. And some of our customers have done a really amazing job with prompt caching. I just want to highlight a couple of these cursor, a replete, perplexity, have all are all well into the 90s. And they have done significant engineering effort. I've sat with some of these engineering teams, and I can tell you they have done a lot of effort to get prompt caching into the 90s. But I'm going to tell you a hint. So we have a couple of tools that are available to you right now that you can use that make it way easier. The first is on the screen here, you can go and do some deep inspection of what's happening with prompt caching in your app. So I encourage you feel free if you would like to pop open your laptop, go to the claw console and check out the new prompt cash dashboard under analytics. And you can actually see what your production agent is doing in terms of prompt caching. And if it's not in the 90s, you have some work to do. So that's the first thing. The second thing is we've recently launched a skill for a claw code that's an expert at prompt caching. And it's installed by default with claw code. So all you need to do is go to claw code and say improve my cache hit rate. And clawed will walk you through the process of adding the cache control markers in there, maybe reorganizing some of the prompt. Just walk you through that process so you can get a very high cache hit rate. So that's an absolute no brainer. I think you should, I think you should take a look at doing. So let's take a look at a demo of how prompt caching looks. So I'm going to write bin on stage, bin, come on out here. And we're going to take a look at this demo. So bin and I worked on this demo. It's executive dashboard. So let's say you're the CEO of a company and you have all these objectives and we're running it. Wait, bin, is this the demo we agree? We're at code with claw. This looks like a like late 90s share point UI. I don't know. What do you have claw code? Okay, pop up in claw code. Let's see what we can do. Okay, bins got claw code. And it's attached. He's got the source code for this. And we're going to see if we can improve this theme. So how many people want a better theme? Okay, there we go. Okay, so a better theme is this like a little bit more appropriate for the venue that we're in. We are now not some boring 90s CEO. We are the CEO of hero corp AI. And what hero corp does is rinse out superheroes to battle villains and protect metropolis, come to your child's birthday party, whatever they do. And then we're seeing the objectives. This is objective one. I'm told that retention of superheroes is a very important thing. And so objective one is around retaining them. And bin, I don't know. Maybe we're not paying superheroes enough. It looks like it looks like they're a little low here. And you can see some updates from each of the superheroes. And then a CEO, some tasks that we can go do. So, bin, you know, what's the cash hit rate on this? Do you know? No, no, okay, okay. You don't know what the cash hit. Okay, okay. So first off, you got to know what the cash hit rate is. So what we've done is we've implemented a dev console for this little demos or slide open the dev console. And let's take a look. So in this little dev console, what you see here is our context usage tool calls. And then there's this agentic transcript that's happening. So don't you wish all your apps came with this beautiful dev dashboard. But I noticed in this dashboard, one thing that's standing out to me immediately is the cash hit rate is like zero. I mean, I don't know. So, bin, is there something we can do to improve the cash hit rate? So he's going to open up Claude Code, go back to that and just and just improve the cash hit rate. And now we'll rerun that. And notice when we're rerunning it, we're hitting all those same tool calls again. But this time in the agentic transcript, you're seeing cash rights and cash hits. So cash, the first time the inference system sees a prompt segment, it writes it to the cash. So we store those KV values for five minutes by default. You can extend that with some options. And then the next time that a loop comes around, we that becomes a cash hit. So in a normal agentic loop, you'll see some cash hits and some cash reads. So we're doing a little better here. And I think you'll watch over the course of the demo. You'll see that cash hit rate get better. Okay, so that's prompt caching. But if you scroll down a little bit, let's look at some of these other objectives. Ben, look, Ben, I gave you a million tokens of context, one million tokens in this opus 4.7 million tokens. And that's not enough apparently. So with all the tool calls that we're doing to get information from Slack, from Gong transcripts, from Salesforce, all the fact, why don't you probably have a copy of one of those and just show there's just an enormous amount of data that's getting flooded into the context. So even at a million token context, we're running out before we even get through objective one. So we should think about how we want to handle this. As you might guess, I have a couple of techniques for handling this. Let's switch back to the slides and let me describe some techniques for context engineering. Okay, so context engineering is really a discipline. It's the discipline of deciding what belongs in Claude's context. One mistake I see developers doing is using abstractions over top of the platform that obscure what's in the context. And then as a developer, you don't really know what Claude's seeing in its context. And that makes it difficult for you to optimize makes it difficult for you to be a context engineer. So I encourage you to really pay a lot of attention to look at the full transcript that Claude has access to that Claude's using because it'll be very insightful. And the discipline is you making a proactive decision to decide what should be in. And I'm going to talk here about three different tools that we have available today in production for you to use to help manage the context. The first one is about reducing the tool declarations that's in the platform that's in your context. The second is about reducing the tool results that pollute the context. And then finally, compaction reduces all those stale turns that are no longer needed. So let's drill into each of these and see how they look. So tool with tool search tool, we have many customers that have tens or even hundreds of tools loaded. If you have a long running multi use agent, oftentimes it does need many tools to get its job done. That's one of the beauties of LLMs is their general purpose and can do lots of different things. So we want to encourage you to have a lot of tools. But if you look at this without case, if you load all those tools in upfront in the system prompt, then that leaves very little space to do your actual work. So if we look it without, sorry, with tool search tool, what we're doing is defer loading all those tools. You still declare them upfront, but we don't put them in the prompt yet. We defer loading them. And then we load them just in time, just as the model needs them. So if the model, if in a particular, agentic trajectory, the model never needed that tool, then it never gets loaded. And that really optimizes the context pretty well. And we see customers like lovable have reduced their token uses by 10%. And it actually not only doesn't save you money and latency to do that, but what lovable saw, and I think many people will see, it actually increases the intelligence of the model to be more careful about what goes in the context. So that's tool search tool. The next one to look at is programmatic tool calling. So first off, don't you love these animations. Can you tell that I have free tokens from Claude to be able to build these animations? So what programmatic tool calling does is it solves the problem of tools that return too much data. And many of the tools that we just showed you return huge amounts of text that just gets stuffed in the, in the prompt. And you know, that works fine if you're building a little demo, that works fine. But here in this session, we're talking about going to production, not building a demo. So to go into production, you need to be a little bit more thoughtful about this. And you can try to mess with the tools to have them return less. But oftentimes what happens when that is you miss a case. There's a, there is some case where the model needed that data. And then you've removed it. And now the model doesn't have it. So the insight that we had here is that the model is actually pretty good at writing Python code. I don't know if you've noticed that, but the models are actually very, very good at writing Python. So what we do is we expose the model. We see here are the set of tools that are available right now in context. And it writes Python code. You'll notice the first time it writes code to inspect the schema of what's returned. And then the second time like you're seeing here, it knows the schema. So it actually, the tool returns all of its data. It stays in memory. And then the model writes code to pull out just the little bits. So just one bite here, a few words there, and it uses just that in its context. So the model is deciding what it needs in its own context. And with this technique, Kora is saving a lot of money and seeing increased intelligence for Python, a lot of HTML parsing that they're doing. So finally, I call this one the sledge, the sledgehammer technique, because even if you do a great job at managing your tool declarations and the tool results, if you have a long enough running agent, then you will hit the context window limits. So the compaction does is it removes all those unused turns, all those turns of the conversation that are not needed anymore. It just compacts those down to a short summary. And that summary is important. And in fact, there's a lot of like intelligence that goes in to create that summary so that the model can continue on. So when you lose the thread, it kind of knows what it needs to keep doing with compaction. And we see hex is actually using that in production now. Okay, so let's I know you're dying to see let's switch back to the demo and see our superhero hero corp agent. And let's see. We should add tool search tool. Like, you know what, let's just add them all just add all of them. So we're going to add tool search tool, programmatic tool calling and compaction in one go. So now when we reload the page, note, watch the context bar as it goes up there. Notice firstly, it's moving way slower than it did before. Remember before, it loaded in the first objective. And we went all the way to a million tokens. We're calling the exact same tools that are returning the exact same data, but we're being smart. We're using context engineering to understand what goes in. So with much less context, we're able to load the entire page. And I know some of you being counters have already noticed that the cost went way down when we did that as well. So, but let's walk through these one at a time just make sure we follow what exactly each of them do. So let's start with, you see tool search tool in here. So here is what the tool search tool returns. Again, the model looked at the problem we gave it and said, oh, I'm going to need a tool that does this thing, like that does hero retention metrics. And then our system went through all of the hundreds of tools that were there and picked out, I don't know, three or four and returned those. So rather than having 100 in context, we only have three or four. And so something like, I don't know, like hero retention metrics gets called. And then I guess later, we see it. Do we see it? Which one do we see? Do you see, oh, yeah, there it is. There it is. Sorry, I missed it. Hero retention metrics is right there. So we dynamically added this tool just in time. And then as soon as we added it, the model model turned around and called that and we returned the full data from that. Okay, so that was tool search tool. The next one we talked about was programmatic tool calling. So in this one, we flag it as code execution in this in this view and look at just examine the code that the model writes. And we, this is not some special thing I have. We give you exactly the code, the model writes so you can understand what's actually happening. You see it's calling those methods. If you look at the results equals a sync gather, it's calling each of those same tools. The same tools we saw before, it's calling those with the same parameters. But this time, we're not loading all of that into context. The model is store is loading them into a Jason object and is printing out very well documented. You see contact pipeline, gong, exactly where it came from. And then you see that colon, 2500, like that. The model knows exactly which object that it needs. So it doesn't like all of that crap we loaded in the first turn. It doesn't need that. It just needs this one little bit. So it's pulling that out. So it's printing it. And then that is what gets loaded into the model's context. Okay, so that's tool search tool. And then finally, the last one we talked about was compaction. So actually for compaction, why don't we look and see it gets called. Yeah, yeah, it gets called a couple of times in here. So what I did just for demo reasons, we set the compaction threshold pretty small like what like 500 K something like that. And that is a common technique to you don't have to go. I love I launched the million token context. I love a million token context, but you may not need all of that. It may save you some cost and it may save you some latency. And it may give you more intelligent to keep keep the model at at a lower threshold. So that's what we did here. We kept it a lower threshold. And so as the context grew to that threshold, the model we paused execution, sent the entire transcript to another model call, which summarized it and gave this summary. So go ahead and expand the summary. So you can see this is the summary of everything that went on. So it took hundreds of tool calls with all their results and everything that was going on and just compacted it down to like here, here's the few things you need to know to keep going so the model could keep going. So yeah, we hit I think we hit all the things. But I notice it's still costing $10 to run this thing. I mean, I got to tell you the hero if you're selling tokens is a great business. So the second to that is selling superheroes. That's also a really good business, but still $10 per load. I mean, we're going to get in trouble for this thing. So I notice that you're using opus. Yeah, where's it say this? So the model that's being used is opus again, opus 4, 7, absolutely great model. But it's expensive. And it may be better to use a smaller model. So it turns out a small model like sonnet can do tool calling just as well, right code just as well. But there's just a few things that it doesn't do as well. And we'll talk about how to take advantage of that. So let's switch back to slides. And we'll talk about advisor the advisor strategy. So what the problem we're trying to solve with advisor is we want opus level intelligence, but at high coup level costs or sonnet level cost. That sounds great. Right. So the insight behind here really came from engineering teams. So you know, if you've worked with engineering teams that you compare junior engineer with the senior engineer. And that junior engineer will get a lot better. Right. Because the senior engineer won't be doing all the work. They won't be hands on keyboard doing the work for them. But they will do coder views. They'll look at design documents. They'll do coaching. And that's exactly true with models as well. What we see is that if we give high coup away, it can call opus and ask for help. Then an opus can scan over the transcript and see what's happening. Then it can give some advice back to high coup, which really helps a lot. And so you see again, another beautiful animation here. You see that the executor with high coup knows every shape. It knows exactly what to do with each shape except this oddball shape. And the oddball shape, it has to go ask the advisor, the advisor knows, and then it puts the advisor tells the executor what to do with that oddball shape. And we're seeing customers like Bolt are using this to help manage their cost as well. Okay, so let's switch back to the demo and we'll take a look at this step. So yeah, why don't we go ahead and add advisor to this. So what we're going to do with adding advisor is going to switch the model from it was opus. Now you see that model line. It says, sonnet 4.6 plus opus as advisor. So immediately again, I saw the accountants eyes light up because the price went down significantly. Immediately because we're doing all the tool calling and Python code, whatever, all that's being done with sonnet, which is way cheaper than opus. So you get immediate savings. But the question is, are you getting the intelligence as well? So there's one really important objective that we have here. And that is the metropolis renewal. This is contract. I've been very worried about for here a corp. They have to win the metropolis renewal. And if we look at this advisor call, the way it works is sonnet went and looked at all of the gong transcripts, all the data about the metropolis renewal and real and sonnet said, great, this looks good. This is on track. And then, but then it said, oh, maybe I'll just call the advisor and just make sure because this is a high impact deal. And opus reviewed the same transcript. It reviewed that same transcript. And it said, oh, sonnet, you missed a thing buried deep in the transcript. We see that cryo thing is actually needed that the mayor actually wants cryo thing. And this is like way detailed and sonnet just missed it. And so sonnet said on track, but opus caught it. So now you see the advantage not only you're using sonnet for it being cheap. Oh, wait, the marketing team told me not to say it's cheap. It's not cheap. It's inexpensive. So sonnet is inexpensive. And so we're using, we can use that, which is great. But you still get the intelligence of opus, but just on demand, just exactly what is needed. And it catches this cryo thing thing. Okay, so we caught that. And so now you can see it says in red. It's not it's not good. We need to do something. And I don't know about you, but I have some tension about this. I want to make sure that we can can win them a tropicalist contract. So if you scroll down a little bit, there's an actionable thing that the CEO can do. And that is to lock in cryo thing. I also love these superhero names. It turns out it turns out the legal team wouldn't let me use any real superhero names. So we have to use cryo thing. We can lock this in. So should we should we save the contracts? Who's in favor saving the contract here? Yes. Okay, great. Thank you. You're with me. Okay, let's click that and lock in cryo thing. We have it. Okay, so that's this is in the agentic world. This is all you do as a CEO. Just click the button and you're and you're good. Everything saved. Okay, we've got just one minute. Thank you. Ben. Let me wrap us up. So back to slides. What we saw here is we saw, it's very important to do prompt caching. If you're not doing that, ignore the rest of the talk and go do that. If you are doing that, really pay attention to context engineering. Make sure you are aware of what's in the context and then optimize that for what's actually needed. And finally, we talked about using the advisor strategy. So it's really on on demand intelligence. But that's not all amazing stuff that we've shipped in the platform just in the last few months. So I want to call out my favorite things. This workload identity federation. So if you're concerned, if your security team is concerned about losing API keys, you need to worry no longer. If is a great answer for that. And then I just love this ants CLI command line tool. Everything you can do in the console just about is available via command line tool. And the best thing about a command line tool is Claude just loves command line tools. So Claude. If you use it via Claude Code, just tell it about the ants CLI tool and it will go and manage everything for you and do just an amazing job at that. And the thing to really take away here is that betting on the platform means that it's going to the platform is going to keep getting better and your agents are going to keep getting better as new things keep coming out in the platform. So that's all. Thank you very much. I appreciate it.

Feedback / ReportSpotted an issue or have an improvement idea?