The capability curve

Anthropic's Claude models, particularly Opus 4.7, have seen over a 25% jump in coding intelligence within a year, demonstrating significant improvements in planning, error recovery, and sustained attention.
Newer Claude models can now autonomously strategize, recover from failure states, and maintain coherence over very long runs, enabling developers to build more complex and reliable applications.
To leverage these advancements, developers should regularly update their evaluation processes, simplify their application scaffolding, refine prompts, and empower agents with more autonomy and self-correction capabilities.

Leverage Claude's enhanced planning: Allow newer models time to strategize before executing tasks to improve downstream performance, rather than forcing immediate action.
Trust Claude's improved error recovery: Newer models can backtrack from problems and explore alternative solutions, leading to better task performance and fewer wasted tokens compared to older "doom loop" behaviors.
Utilize Claude's extended attention span: With coherence maintained over hundreds of thousands or even millions of tokens, developers can reduce context window babysitting and enable longer, more autonomous agentic runs.
Optimize your evaluation process: Implement evals that align closely with your product's actual use case, ensure they remain challenging ("not saturated") as models advance, and regularly test them on the latest frontier models.
Simplify your scaffolding: Review and remove unnecessary code, prompts, skills, and tool setups, as newer, more capable models often require less explicit direction and can perform better with a streamlined setup.
Refine your prompts regularly: Periodically review and trim down prompts, cutting out rules and instructions that might no longer be needed with improved models, to boost performance and save on token usage.
Empower agents with adaptive thinking and tool access: Give the model room to think (e.g., using the 'effort' parameter) and allow controlled, safe access to more tools, potentially using mechanisms like 'auto mode' with human approval classifiers.
Design for closed-loop iteration: Build systems where Claude agents can inspect their own outputs, receive feedback, and iteratively refine their work, such as providing tools for an agent to test a front-end application it just wrote.

Agentic coding — A paradigm where an AI agent autonomously plans, executes, and iterates on coding tasks. Bench Verified — A specific benchmark used to measure an AI model's ability to autonomously complete software pull requests. Doom loop — A failure mode in AI models where they get stuck in a repetitive cycle of proposing solutions that don't work, spiraling until the context is exhausted. Context window — The maximum amount of information (tokens) an AI model can process and remember at any given time during a conversation or task. System prompt — Initial instructions or guidelines provided to an AI model that define its persona, constraints, or overall behavior for a task. Evals — Short for evaluations; structured tests or benchmarks used to measure the performance and capabilities of AI models or applications. Scaffolding — The surrounding code, prompts, skills, and tool setups that direct an AI model towards its goals within an application. Adaptive thinking — A technique allowing an AI model to dynamically adjust the amount of planning or strategizing it performs before taking action, often controlled by parameters like 'effort'. Closing the loop — Designing a system such that an AI agent can inspect its own outputs, receive feedback, and iteratively refine its work.

Please welcome to the stage member of technical staff of Anthropic Alex Albert Albert Hello Hello everybody All right let me set my water down here I'm Alex Albert I'm one of our research PMs here at Anthropic Um I just heard from the stage manager alom that that there's a happy hour that just started So I want to thank all you guys for choosing me over drinks and snacks I really appreciate that And I I promise this will be uh worth your 15 20 minutes here Um it's been really interesting comparing the vibes at this year's conference to last year's year's conference Last year Claude Code was not even G It had been out for around two or three months Uh folks were just starting to get used to this paradigm of agentic coding These days as I'm talking to folks here it feels like like people are doing a lot more They're trusting Claude They're shipping things faster with Claude They're building things that they weren't building building before that haven't been possible to build on the time scales that they are Out of actually out of curiosity I want to run a little exercise here So uh I'm going to say if the question is you please raise your hand and just keep your hand up I want to get a sense of the room Uh raise your hand if you feel like Claude Claude has allowed you to go 10 times faster than what you were doing a year ago Okay Wow that was a lot Keep it up Uh how about five times faster All right 2x faster Raise your hand if you use Claude There we go Okay Love that Um one of the best ways that we've measured Claude's coding intelligence and how we've seen Claude make this impact on folks is through this bench calledbench calledbench verified bench verified measures the model's ability to autonomously complete software PRs Uh about a year ago on our model at the time Sononnet 3.7 it it scored 62% on this eval Today with Opus 4.7 it scores 87% That's a over over 25% jump in just over a year Um to put this in other words Opus 4.7 is more than three three times as likely to succeed on some of those difficult PRs that Sonnet 3.7 was failing on a year ago Now numbers are great but examples are even better So to make this a bit more concrete I have a quick demo here that we we put together of the same task but 12 months apart So let's get into it So in this example we're going to be be comparing Sonnet 4 to Opus 4.7 And we're going to give them the same task Oh let's Let's go back here and get this demo working Give this a second See if it's playing playing if the people watching this Oh there we go Right No still still nothing All right Um well there was a demo here and it was really really cool So you're gonna have to believe me on that Oh here we go Okay Sonnet 4 working within Claude Claude Code The task that we gave it was to recreate claw.ai with a single prompt So you can see how it does here We get this really generic black and white chat chat application We can enter in a prompt We'll fire that off and immediately we just hit an error So it doesn't really even work Basically it it just made like a UI for us Now let's run this same task but with Opus 4.7 instead and see how it does So we're going to have the same same setup It's going to be running within Claude Code It's going to be writing a bunch of lines of code calling a ton of tools and eventually we're going to get get an output And immediately this output looks better So we have the clawed color scheme It knows that already We actually get a response back from the Claude Claude API when we send it a message We can start a new chat and it still remembers the old chat So we're keeping track of things Even renders in line the visualization like the claw.ai app does And like a true developer it even implemented dark mode So in addition to that that being a better output it also did it in less lines of code So it's more efficient as well Now this talk isn't about about any one of these models in particular What I do want to focus on is what it means to build on something that is getting meaningfully better like this month over over month Now before I get into the specific tips I want to take a look where these model gains are actually landing Starting first with Claude's planning planning ability older models would have this particularly bad failure mode where they would act first and then think think later It's kind of like me when I'm having to build like a new new set of IKEA furniture Like I'm going straight into it and then only once it's like a mess am I coming back and be like "Ah maybe I should actually read the the instructions Um I don't know if anybody else is the same on that but that was basically the same as Sonnet 3.7 Newer models are are much more thorough They take their time upfront to actually think about the problem strategize a little bit plan out what they're going to do and then dive into the the action and dive into writing those lines of code What this means for you as a developer is that you should allow Claude that time to think Uh give it some time time to actually think about the problem what it's going to do Don't force it to just jump straight into that action because this can reduce your downstream downstream performance The second major area where we're seeing gains is in in Claude's error recovery Previous models would hit this thing that we call a doom loop So the model would run into a a problem It would try to propose a solution That solution wouldn't work And then all of a sudden it's really stuck And it's just going to keep spiraling and spiraling spiraling until eventually that context stalls out and you got to just clear everything Newer models are much more adaptable If they hit a a problem they're able to now backtrack out of it and actually think about it in a different way and maybe take a different path What this means for you as is that you're going to get better better task performance from claude with fewer wasted tokens because it's not going to get caught in those loops Final area we're seeing these these gains is in Claude's attention over long runs Older models would often lose the plot as they were working on something Maybe they would start start forgetting things They're not paying attention as much Those instructions that you put in the system prompt aren't being uh paid attention to as much on that on that that thread Newer models are much more able able to hold coherence across runs They're able to remember those system prompt instructions and stay focused over the course of hundreds of thousands or even a million million tok In terms of your applications and how you use claude this means that you don't need to babysit the context window as much You don't need to chunk up work and you can trust that Claude can operate for you on long runs runs autonomously Adding things things these things all up together we have better planning we have fewer FA failures and error recovery and we have agents running for longer And this compounds into better end-to-end end-to-end task performance Our customers are seeing this as well Verscell saw in Obus 47's 47's planning it was actually writing proofs for systems code before implementing a single line of code Winser saw in their evals that Claude was really really sustaining its attention over their longest agentic runs And Shopify found that as the model was coding it was actually going back and and iteratively refining its outputs So how can you as a developer see those same things our customers are seeing in your your applications Well somewhat counterintuitively it starts not with your application but with with your evals If you can make something you can improve on it So it's important that a you have evals and b these evals are measuring something something close to your product distribution that you actually want to improve on That might sound pretty obvious but it's often something that I see leads developers astray when they try to eval their product on something something adjacent to their use case but not on their actual task distribution So maybe for example they have a coding agent and instead of evaling that coding agent on traffic they have or traffic of a a similar pattern of their what their users is is doing they're evaling it on an academic coding benchmark The second thing you want to do once you have those evals is to make sure that they're not not saturated As models are getting smarter evals are continuously needing to get harder harder and harder in order to get more signal from frontier models you want to make sure that that the evals you're building are growing alongside the model so that you can continue to get new signal as more intelligent models come out Finally once you have those evals and you've ensured they're not saturated test them on the newest frontier models I found that sometimes the best optimization you can make to your app is simply swapping in the latest latest model So it's worth spending a meaningful amount of time testing and trying out models as they come The second tip for you here is to take a second look at your scaffolding When I'm saying scaffolding what I'm referring to is is the code and the prompts and the skills and the tool setups everything that kind of surrounds the model and directs it towards its goals With newer models you might not need some of the things that you needed before So maybe instead of a multi-step workflow you can just let the model work on a task in one thread Often you can actually boost your performance by removing instead of adding things onto your scaffolding Now alongside your scaffolding you also want to take a second look at at your prompts Prompts begin to build up model over model And after a long amount of time and many different model generations it it becomes somewhat of like a hideous mess of different rules and instructions and you're not really sure why you added one thing and what it's doing now for your future model With every new model take a second look at those prompts cut down on things that might not be needed anymore And this will help both your task task performance but also save you tokens as well Final tip here is to give the model room to to work Like I was mentioning earlier when I was talking about planning it's important to let Claude choose when to to think Use adaptive thinking and dial the amount of tokens that Claude is thinking and the amount of actions that it's it's taking with the effort parameter Second thing thing here is to allow your agents more access to tools in a controlled way Now some of you hearing that might get a little bit nervous I'm not saying to just let let it do anything but there are ways that and uh methods that you can take to allow Claude to actually execute on more systems uh in a safe safe way One example of this is Claude Code auto auto mode So uh in one of our recent engineering blog posts we talked about auto mode Auto mode actually runs classifiers over the tool calls that that is proposing and we can determine if that tool call needs explicit human approval or not This allows us to let Claude run in the background for longer and it can work more autonomously without needing a human to step in The final tip here is is to ensure that you're closing the loop for your agent Design your system so that Claude can actually inspect its own outputs and and iterate on them Going back to that coding agent example if you have an an agent that's working on front-end applications maybe you want to give it a use tool so that it can click around on the site and Q&A test different bugs or features that it it writes Models are continuously getting better at at verifying and iterating on its outputs So it's important to allow it the affordances to do so And with that that is a quick look at the capability curve I want to thank everybody for coming out today I know this is the last talk of the day I'll be hanging around in the reception so please come find me I'd love to to chat about how we can make Claude better for you Thank you you

The capability curve

TL;DR

Takeaways

Vocabulary

Transcript