Spec-Driven Development: Agentic Coding at FAANG Scale and Quality

For those of you who haven't heard of us, Kuros and Agentic ID, we launched generally available this most recent Monday. I think the 17th, but we launched public preview on July 14th. So out there for a few months, getting customer feedback, call that good stuff. We're gonna talk a little bit about using Spectra and Development to sharpen your AI toolbox. I did a show of hands about a quarter of the people here familiar with Spectra and Dev. My name's Al Harris, principal engineer at Amazon. I've been working on Kuro for the last half. And we're a very small team. We were basically three or four people sitting in a closet, doing what we thought we could do to improve the software development lifecycle for customers. So we were charged with building a development tool that's that answered and improve the experience for Spectra and Development. We were theoretically funded out of the org that supported things like QDEVs, but we were purposefully a very different products from the QE system to just take a different take on these things. We wanted to work on scaling, helping you scale AI dev to more complex problems, improve the amount of control you have over AI agents, and improve the code quality and maintain it reliability, I should say, of what you've got out the other end of the pipe. Now we're back to new content. So our solution with Spectra, we took a look at some existing stuff out there and said, hey, vibe coding is great, but vibe coding relies a lot on me as the operator getting things right. That is me giving guardrails to the system, that is me putting the agent through a kind of a strict workflow. We wanted Spectra and Dev to sort of represent the holistic SDLC because we've got 25, 30 years of industry experience building software, building it well and building it with different practices, right? We've gone through waterfall at XP. We have all these different ways that we represent what a system should do, and we want to effectively respect what came before. So this animation looked a lot better. It was initially just the left diamond, but the idea was, hey, you basically are iterating on an idea. I think half of software development is discovery requirements, and that discovery doesn't just happen by sitting there and thinking about what should the system do, what can the system do? We realized, though, kind of working on this, that the best way to make these systems work is to actually synthesize the output and be able to feed that back really quick things like your input requirements, and actually do the design and realize, oh, actually, if we do this, there's a side effect here we didn't consider we need to feed that back to the input requirements. And so this compression of the SDLC evolved to bring structure into software development flow. We wanted to take the artifacts that you generate as part of a design. That's the requirements that maybe a product manager or developer writes that's going to be the acceptance criteria. What does success look like at the end of this? And then we want to take the design artifacts that you might review with your dev team, you might review with stakeholders and say, this is what we're going to go build an implement thing. And we want to make sure that you can do this all in some tight inner loop. And that was initially what Spectraven Dev was. What Spectraven development in Ciro is today or at least was before when GA was, you give us a prompt and we will take that and turn it into a set of clear requirements with acceptance criteria. We represent these acceptance criteria in the ears format. Your stands for the easy approach to requirements and tax. And this lets you really easily, it's effectively a structured natural language representation of what you want the system to do. Now for the first four and a half months this product existed, the ears format looked like kind of an interest decision we made but just that sort of interesting. And with our general availability launch on Monday, we have finally started to roll out some of the side effects of which is property based testing. So now your ears requirements can be translated directly into properties of the system which are effectively invariants that you want to deliver. For those of you who have not, I guess, done property based testing in the past using something like, I think it's a hypothesis and Python or fast check and node, closures, spec libraries and other example. These are approaches to testing your software system where you're effectively trying to produce a single test case that falsifies the invariant that you want to prove. And if you can find any contrapositive, then you can say this requirement is not met. If you cannot, you have some high degree of confidence where the word high there is doing a little bit of heavy lifting because it depends on how well you write your tests. But if you can say with a high degree of confidence that the system does exactly what you're saying it does, yeah, so a property, we'll get a little bit more into property based testing and PBT's a little later. But this is the first step of many we're taking to actually take the structured natural language requirements and then tie this with a through line all the way to the finished code and say, if your code, if the properties of the code meet the initial requirements, we have a high degree of confidence that you have reliably shipped the software you expected to ship. So with Spectra and Dev, we take your prompt, we turn it into requirements, we pull the design out of that, we define properties of the system, and then we build a task list and we go and you can run your task list. Effectively, the spec then becomes the natural language representation of your system. It has constraints, it has concerns around functional requirements, non-functional requirements, and it's this set of artifacts that you're delivering. So I don't think I have the slide in this deck, but ultimately the way I look at spec is that it is one a set of artifacts that represents sort of the state of your system at a point in time T. It is two, a structured workflow that we push you through to reliably deliver high quality software and that is the requirements design and execution phases. And then three, it is a set of tools and systems on top of that that help us deliver reproducible results where one example of that is property-based testing, another example of that, which is a little less obvious, but we can talk about later is going to be, you don't know what to call it, requirements for effication. So we scan your requirements for overambiguity, we scan your requirements for invalid constraints. EG, you have conflicting requirements and we help you resolve the zambiguities using sort of classic automated reasoning techniques. And I could talk a little bit more about sort of the, the features of Kiro, I think that's maybe less interesting for this talk because we want to talk about Spectrum Dev. We have all this stuff you would expect though. We have steering which is sort of memory and sort of cursor rules. We have MCP integration, we have image yadda, so we have ways to, and we have software hooks. So let's talk a little bit about sharpening your tool chain. And I'm going to take a break really quick here, just pause for a moment for folks in the room who had maybe tried downloading Kiro, or something else and just say, are there any questions right now before we dive into how to actually use Spectrum to achieve a goal? No questions, it could be a good sign, could mean I'm not talking about anything, it's particularly interesting. So I actually want to like talk in some concrete detail here. This is a talk I gave a few months ago on how to use MCPs in Kiro. And so one of the challenges that people who had tested out Kiro had, that might be a little easier to see, was that they, they felt that the flow we were pushing them through was a little bit too structured. Like you don't have access to external data, you don't have access to all these other things you want. And so one thing that we said on our journey here towards running your, oh, you know what, side of order. Here's my nice AI generated image. So you can use MCP, everybody here I assume is familiar with MCP at this point, but Kiro integrates MCP the same way all the other tools do. But what I think people don't do enough is use their MCPs when they're building their specs. And so you can use your MCP servers in any phase of the Spectra-driven development workflow that's gonna be requirements generation, design and implementation, and you can use, we'll go through an example of each. So first of all, to set up a spec in Kiro is fairly straightforward. We have the Kiro panel here, which is a little ghosty. And then you can go down to your MCP servers and click the plus button. You can also just my favorite way to do it is to ask Kiro to add an MCP and then give it some information on where it is. And it can go figure it out usually from there or you just give it the JSON blob and it'll figure it out. Once you have your MCP added, you'll see it in the control panel down here and you can enable it, disable it, allow us tools, disable tools, et cetera. So you can manage context that way. Or if noting, changing MCP and changing tools in general is a cache breaking operation. So if you're very deep into a long session, maybe you don't tweak your MCP config because it will slow you down dramatically. But let's talk about MCP inspect generation. So something I, the Kiro team uses Asana for reasons, I don't know, but it's our test tracker of choice. But so one thing I want to do is maybe go and say, I don't want to write the requirements for respect from scratch. My product team has already done some thinking. We've iterated Asana to kind of break a project down. This is not always how things work, but sometimes how things work. So in this case, I have a task in Asana, I don't know, I did the wrong thing. That's what I get for zoom in. So I have this task in Asana that says, add the view model and controller to this API. In this case, this was a particular demo app that I can share in a few minutes. And we even had like, it's kind of peaking under here, but we had some details about what we wanted to have happen. Now I can go into Kiro and just say, start executing task, XYZ URL from Asana. And Kiro is going to recognize this is an Asana URL. I have the Asana MCP installed. It goes and pulls down all the metadata there. So it's going to break out and from there, start determining what to work on. Oh, it's funny, these titles are backwards. Basically, create a spec for my open Asana tasks. Again, go pull from Asana all the tasks and then for each one generate a requirement based on the tasks. So I think I had like six tasks assigned to me. One is do user management do some sort of property management. That it pull them in, generate the requirements. And then in this case, title is wrong, apologies. Start executing task. This is, I want to go and do the codes synthesis for this. And I will take a quick break here to talk about how you can do this in practice. So for those of you who are following along in room, feel free to fire up your Kiro, open a project, and then pick an MCP server. I'll share a few repos here really quick that you can play around with. So I have an MCP server implemented. I have this lofty views, which I think implements the Asana. And then these should all be public. Let me just double check. Yeah. OK. So for example, if you wanted to extend my, I have a Nobel Prize MCP, which curls, perhaps on surprisingly, there is a Nobel Prize API. So you can use UVX to install it, or you can get clone this alharus at Nobel MCP. This is just one example. Another one here is if you want to play around with the sample that's in the video. I have alharus at slash lofty views. I'll leave these both sort of up on the screen for a few moments. For folks who do you want to copy the URLs? But all that is happening. Oh no. Let's put you on the same window. So what I'll demo quick is the usage of an MCP to make like spec generation much easier or more reliable. So here I have, let's see, got a lot of MCPs. Let's do I actually want to use, let's use the GitHub MCP. Oh no, ignore me. OK, well I have the Fetch MCP. So in this case, I could for example come in here and say, hey, I've generated a bunch of tasks to lofty views app. This is basically a very simple CRUD web app. But I want Kiro to use the Fetch MCP to pull examples from similar products that exist on the internet. You could also use Brave Search or Tavoli Search, MCP servers. But in this case, I'll just use Fetch because I've got it enabled. So let's say, oh actually we can run the web server and use Fetch. That's a good example. This is one example of you can at any point in the workflow generating a spec go through and use your MCP servers to get things working. Now this is what I get for now using a project in a while. Now we'll cancel that. We can actually do something a little more interesting, which is a separate project I've been working on. So I've been working on an agent core agent. And that might be, I know the project works, which is the reason off I heard up here. Did I call it? Well, maybe we'll do live demos at the end. So that's sort of like the most basic thing you can do with Kiro is just use MCP servers. But any tool uses MCP servers. I actually don't think that's particularly interesting. So let's say in sort of this process of trying to sharpen our spec driven dev toolkit, we've finished up with the 200 grit. We've added some capabilities with MCP. It's useful but it's not going to be a game changer for us. I want to come in here and actually get up to the 400 grit. Let's get start to get a really good polish on this thing. I want to customize the artifacts produced because you've got this task list, you've got this requirements list. And I don't agree with what you put in there out. You could say that a lot of people do. And that's a great starting point. So here's something I heard earlier in the week at, you know, earlier in the conference, is that people like to do things like use wireframes in their mocks. Use wireframe mocks because in your spectrum natural language, using specs as a control surface to explain what you want the system to do. Therefore, I want to be able to actually put UI mocks in here. So the trivial case is that I just come in here and say, Kiro's asked me here, does the design look good? Are you happy? And I said, this looks great. But could you include wireframe diagrams and ask you for the screens we're going to build here? I'm adding, this is again from that lofty views thing. I'm adding a user management UI. But I want to actually see what for a sort of proposing building, not just the architecture of the thing. So Kiro's going to sit here and turn for a few seconds. But you can add whatever you want to any of these artifacts, because they're natural language. So there's structure, which means we want some reproducibility in what they look like, but ultimately what they look like doesn't matter because we've got the any machine here, the agent sitting that can help translate it to what it needs to be. So Kiro's turning away here. It's thinking, thinking. And then it's going to spit out these text wrapped ASCII diagrams. I'll fix the wrapping here in a second in the video. But ultimately, it does whatever you want. See, if you want additional data in your requirements, you can do that. If you want additional data in the design like this, you can easily add that. Here we've got these wireframes and ASCII that help me sort of rationalize what we're actually about to ship. And then I can again continue to chat and say, actually, in the design I don't want, maybe I don't want this ad user button to be up at the top the entire time, in which case, I could chat with it to make that change easily. And now we're on the same page up front instead of later during implementation time. So we've again sort of left shifted some of the concerns. So that's one example. I want to add UI mocks to the design of a system. Another example, though, could be this, this is a quick snapshot of the end state there, and now my design does have these UI mocks. But another example that I actually like a little bit more is this including test cases in the definition and tasks. So today the tasks that Cure will give you will be kind of the bullet points of the requirements and the acceptance criteria you need to hit. But I want to know that at the end state of this task being executed, we have a really crisp understanding that it is correct, it's not just like done, because anybody who sees an agent can probably testify that the LLMs are very good at saying, I'm done, I'm happy, I'm sure you're happy, I'm just gonna be complete. Oh, yeah, the tests don't pass, but they're annoying. I tried three times to work, I'm just gonna move on. No, I don't want that. I want to actually know that things are working. So in this case, I've asked Cure to include explicit unit test cases that are going to be covered. So my task here, for example, and creating this Agent Core Memory Checkpointer is going to have all the test cases that need to pass before it's complete. And then I can use things like Agent Hooks to ensure those are correct. We'll run this sample a little later in the talk. This is the thing I'm ready to a little demo. Yeah, so this is another example where you can, again, you're working on your tool bench, you have all these capabilities and primitives that you're control, and you can tweak the process to work for you, not just the process that I think is the best one. And then sort of last but not least, the 800 grit. At this point, we're getting a final polish on the tool. We might be stropping next, but we want to, you know, you can iterate on your artifacts, but you can also iterate on the actual process that runs. So one thing you might have, and I do this a lot, is I'll be chatting with Cure, and I say, hey, I want to, in this case, I want to add memory to my Agent Core. Let's dump conversations to Nest 3 file at the end of every execution. Cure is going to say, that's great. I know how to do that. I'm going to research exactly how to do that thing. I will achieve this goal for you. But ultimately, what I've done is actually introduce a bias upfront. Which is I'm steering the whole A, it's using S3 as this storage solution just because maybe I'm familiar with it, but it's probably not the best way to go about it. So then after it had synthesized the design and all the tasks and all this stuff, I came back and said, well, like we didn't need to stick to this rigid, spectraven dev workflow that has been defined by Cure. I can ask for alternatives. Like, is this the idiomatic way to achieve session persistence? I don't know. Maybe there's a better way. Maybe if we're talking AWS services, it's three, it's Dynamo or yeah, they had a Cure is going to come in here and say, you know, good question. Let me research. It's going to go through call a bunch of MCP tools that I've given it access to. This kind of ties back to that. You should be using MCP. And then it comes back with this recommendation that I didn't know was a feature, which is the Agent Core memory. It says it's more a diamatic and future proof that maybe is TBD and should be checked a little closer. But or you could use S3, which is the thing you recommend. Now, actually, I bet there's far more than two options here. So you could probably keep asking the Agent are there other options? Yeah, Viata and it would go and continue to escape. But you should not lock yourself into the rigid flow that is sort of the starting point here. Yeah, so that's actually, I think, it for my deck. What I will talk about is let's just run through that sample I just had up there, which is that. So basically, let me delete it. And I'll just do a live demo of sort of specs in Kiro and how we can fine tune things a little bit. So this project is a Node.jm app. It is a CDK. Again, I'm not trying to sell more AWS. This is just the technologies I'm familiar with, so I can move a lot more quickly. So I wanted to know a little bit about Agent Core, which is a new AWS offering. And as somebody building an agent, I should probably be familiar with it. So I'm not familiar enough with it. So we've got some other people here who know a lot about it. So I'm going to put my hand up a little bit in, you know, you caught me. So I set up a CDK stack, which is just IAC technology to deploy software. I'm familiar with it, and I love it. So I have a stack here that lets me deploy whatever an Agent Core runtime is. I don't know. I ask Kiro to do it. We've odd coded this part. So we've odd coded the general structure. We got an agent. We got IAC set up. I then vibe code added, commit lent. I added husky. A few things like this that I like for my own TypeScript project. So I'm pretty year and ESLent I think. So we've a basic project here, or like a basic project here that I know I can deploy to my personal AWS account. Now I'm going to come in here and, oh, and then importantly, this is super important because I don't know how the Hell Agent Core works. And I could go read the docs, but the docs are long and they're complicated. And I'm really just trying to build out a POC to learn about it myself. So I added two MCP servers. Oh no, maybe I didn't. Let me check. Oh, okay, yes, sorry. Buried down here at the bottom. So this is my Kiro MCP config. I added one important MCP server here, which is the AWS documentation one. There's other ways to get documentation. You can use things like Tetzl level seven, but in this case, this is vended by AWS. So I have some confidence that it might be correct. So I used this to help the agent have knowledge about sort of what technologies exist. And I think I used fetch quite a bit as well. So these are the two steps of MCP servers. I provided the system. Let's create and move on. Um, so, and I'll just rerun this from scratch. So what I had done yesterday evening or maybe the evening before was I sat down and I have this system sort of basically working and now I want to start doing spectraven development. So I want to add this session ID concept and then I want to read right conversation doing this three file blah blah blah. This is the whole sort of bias thing I showed you earlier. We're going to fire that off through Kiro. It's going to start running Chugging away. And then it's going to see if the spec exists. Okay, the folder does exist. It's probably going to realize there's no files there and start working away. But from here, I'll sort of live demo. It's going to read through required. It's going to read through existing docs. Going to read through existing files, gather the context at need, churn away. But in a moment, once it generates sort of the initial requirements and design, I am going to challenge it to use its own MCD servers. I want you to go and do some research on the best way to do this and provide me some proposals. And this is why I was hoping to get the clip on my working because I'm going to set this down for a moment. So, you know, I don't know if this is the best way to do this. Go read docs, go use fetch, does it that? It's going to keep kind of churn in a way here. And then come back to me after it's probably got a few ideas and proposed it. But this is an example of me just using additional capabilities, use fetch, use the docs, MCP, use whatever you can to get the best information and don't take it face value, the things that I said. These are usually things we have to prompt pretty hard to get the agent to do. But if you're doing it in real time, it works fairly well. Again, all of these agents are going to be very easily. So, you know, just because I said something in the student docs may or may not actually be the most important thing from the agent's perspective down the road. So, you know, okay, so it's done a little bit of research and understands the LANG graph, which is the agent framework we're using already has the knowledge of persistence. And actually, in this case, it didn't find, it did not use the MCP for agent core docs who didn't find that agent core has the knowledge of persistence. So, maybe he like, let's assume I don't, I still don't know that was this because I didn't dry run this a few days ago. We might have to find that later with a design phase. So, I think it's going to do is kind of iterate over all my requirements here. You know, can use the requirements based on what it now knows about LANG graph and how it can natively integrate with the checkpointing, but it's still really crisply bound to this like S3 decision that I made implicitly in the ask. So, that is just something to be aware of. Anything you put in the prompt is effectively rounding the agent for better for worse. You see, it's still iterating. So, yeah, comes through, says does this look good? We changed to the, I'm going to say, looks great. Let's go to the design phase. So now, Cure is going to take my requirements and take me into the design phase of this project. I can make this. Something is a little bit bigger, but here's an example of what I meant by these EARs requirements. So, the user story here is as a dev. I want to implement a custom S3 based check pointers. The agent can use LANG graph, say, if persistence mechanism with S3, great. That sounds reasonable to me as a person, you know, sort of co-authoring these requirements. This here, this sort of when, then, shall syntax. This is the EARs format. And the structured natural language is really important for us to pass this through non-LLM based models and give you more deterministic results when we parse out your requirements. Because ultimately, our goal is to actually use the LLM for as little, not as little as possible, but less and less over time. We want to use classic automated reasoning techniques to give you high quality results, not just, you know, whatever latest model is going to tell you. So, here I was gone through. This is a bit set of design doc. Let's actually just look at this in a markdown. This, sure, you got a server, check pointer goes to S3, that makes sense. Sudo code, again, in a real scenario, maybe I read this a little bit more closely. And what's actually, this is the new thing we shipped in on the 17th, is that now Cura is going to go through and do this formalizing requirements for correctness properties. And so right now, what the system is doing is taking a look at those requirements you generated. There are requirements we agreed upon with the system earlier. These look good, I agree with them, yeah, they had a, it's taking a look at the design, and it's extracting correctness properties of the system that we want to run property-based testing for down the road. This is something that may or may not matter for you in the prototyping phase, but should matter for you significantly when you're going to production. Because these properties are correct, and these properties are all met. The system aligns one to one with the intent requirements you provided. Yeah, so while this is chugging away, any questions yet? Any folks, kind of curious about this? Yeah. We're here in the there. What would you say is the main difference between the low-vehs we have? I haven't used the planning mode in a couple of weeks, so things move so fast a little wild. But I think ultimately what we would say is that Kiro's Spectrum and Dev is not just LLM driven, but it is actually driven by like a structured system. And so planning mode, I'm not sure if there's actually like a workflow behind it that takes you through things, but yeah, this is our take on it, for sure. I'm not familiar enough to give like a more concrete example, unfortunately. I mean, it doesn't give you like, it's the point of school. It's being in your little school, but what Kurozori does is to basically to be in your plan. Just an execution plan. Okay. Oh, I see. So I think that the fundamental difference there, does that plan get committed anywhere or is it just a femoral? Okay. So what I want over time is not just how we make the changes we care about, but it is actually the documentation and specification about what the system does. So the long-term goal I have is that as Kuro, we were able to do sort of a bidirectional sync. That is as you continue to work with Kuro, you're not just accruing these sort of task lists. And so I'm just going to say go for it to go to the tasks. But we're not just accruing task lists, but actually if I come back and let's say change the requirements down the road, we will mutate a previous spec. So I'm looking at really just a diff of requirements, as you go through the green field process, you're going to produce a lot of green in your PRs, which is maybe not the best, because I'm just reviewing three new huge markdown files. But on the next time or the subsequent times that I go and open that doc up, I want to be seeing, oh, you've actually, you've relaxed this previous requirement. You've added a requirement and that actually has this implication on the design doc. That is the process the Kuro team internally uses to talk about changes to the Kuro system. So we review our design docs have in general been replaced by spec reviews. So we will, you know, somebody will take a spec from markdown, they'll blast it into our wiki basically using an mcp tool we use internally. And then we'll review that thing and comment on it in sort of a design section as opposed to, you know, this markdown file or wiki from scratch. So it becomes sort of if, well, it's actually not like an ADR because it's not point in time. It is like this living documentation about the system. But yeah, yeah. Thanks for the question. There's one over here. This may be more a spectrum of development question, but are there like, is there like a template for a set of files that you fill out right now? You're in the design.md. Are there like, is this is the design that md be back? And it's a single box or are there like, oh, great question. So the, yeah, the question was, um, are there, uh, and correct if I'm wrong here, but question is, are there a set of templates that are used for the system? And is the questionnaire driving at can you change the templates? Or is just are there? Okay. So the, yeah, question is, are there a set of templates? There are implicitly in our system problems for how we take care of your specs. So you'll see here at the top nav bar here. Right now we're really rigid about this requirement. Design task list phase, but we know that doesn't work for everybody. For example, if you're starting, we get this feedback from a lot of internal Amazonians, actually, that I want to start with it. I have an idea for a technical design and I don't necessarily know what the requirements are yet, but I know I want to make maybe design is even the wrong word. I want to start with a technical note. Like I want to refactor this comes up a lot for refactoring actually. Um, so I want to refactor this to no longer have a dependency on. Um, here's a good example. Hero, we use a ton of mutexes around the system to make sure that we're locking appropriately when the agent is taking certain actions because we don't want different agents to step on each other's toes. But maybe I want to challenge the requirements of the system so I can remove one of these mutexes. Or some of the fours I should say. So I might start with something like a technical note. And then from there sort of extract the requirements that I want to share with the team and say, you know, I had to kind of play with it for a little while to understand what I wanted to build. I still want to generate all these rich artifacts. So today it's this structured workflow. We're playing a lot around with making that a little bit more flexible. But the structure is important because the structure lets us build reproducible tooling that is not just an L. So I think that that's an important distinction we make is that our agent is not just an LLM with a work on top of it. The back end may or may not be an LLM or may or may not be other neuro symbolic reasoning tools under the hood. And so we try to keep that distinction a little bit clear that you're not just talking to like sonnet or Gemini or whatever you're talking to sort of an amalgam of systems based on what type of task you're executing at any point in time. Although when you're chatting, you are talking to just an LLM. But yeah, so we have a template for the requirements. We have a template for this design doc because there's sections that we think are important to cover. And again, like if you disagree and you're like, I don't care about the testing strategy section, just ask the. And similarly, the task list has is structured because we have sort of UI elements that are built on top of it as well. It's like task management and. Do we have. We'll get there when we do some property based testing, but. There's some additional UI will add for things like optional, you can have optional tasks and stuff like that. And so we need the structure there for our. Casual cell usp to work, for example. Yeah, thank you for the question. Anything else before we truck on cool. I mean, I need somebody to remind me what we were doing. Oh, that's right. So. We went through and we synthesized the spec for adding memory and some amount of persistence to my agent. I didn't introduce you to this project. This project is called Gramps. It is. It is an agent that I'm deploying to agent court to learn about it. I mentioned that but what I didn't tell you is that is it is. A dad joke generator. A very expensive one since we're powering it via LLM's. But effectively, your dad joke generator generator jokes should be clean. It should be based on puns, you know, obviously bonus points if they're slightly corny but endearing. So we're deploying this to the back end. So the reason I want memory is because every time I ask the dad joke generator for a joke, it gives me the same damn joke. And that's just super boring and my kids are not going to be excited about that. So I want memories that as I come back for the same session, I get different jokes over and over again. That's the context on the project. So we've come through here and we actually said we generated this thing. We did the task list. I said, hey, is this the idiomatic way to do it? What I know is that we didn't actually, we're not using agent course memory feature, which is probably a big oops. And so, you know, quick show hands, we want to make the mistake and go all the way to synthesis and deployment. Or should we fix it now? We want to fix it now because we know better. Now, I want to make the mistake. Let's keep on trucking. I have three yeses in a room full of nothing. So we're going to make the mistake and then come back and fix it later. So let's say run all tasks in order. The reason I mentioned in order, which seems very specific is because this is a preview build of Kiro. And so somebody just added to the system prompt. I should only do one task at a time. And I found that if I say run all tasks, it thinks I somehow mean do them all in parallel. So that will be fixed before these changes get out to production. So Kiro is going to keep kind of going through here and chewing away on the system in the back end. It has steering docs that explain how to do its job. It has which I guess I should show you guys steering again is like memory. So I have some steering on how to do commits. You know, how I like to have commits, but also steering on things like how do you actually deploy this thing? How do you deal with agent core and then how do you run the command that necessary for you to deploy this to my local dev account? And then those are mostly just an example again of sharpening your tools like I went through this kind of painful process of figuring out. You know, you have to use this parameter on the CDK, the CDK command you have to use this lag otherwise it doesn't work correctly. And so once I go through that pain of learning, I just say Kiro, right where you learned into a steering doc and it will usually do a very good job of summarizing. And so I generated automatically this Asian core length graph workflow MD file. So I mean, it's just going to kind of go away here and truck truck on and do its job and we can watch it in the background, but in the interim. I think at this point, we're at a pretty flexible spot. So for folks who want feel free to use Kiro try out spectre of and dev on your own. I'm going to keep just kind of running this in the background and taking questions and comments, but that's kind of it for the scheduled part of today. How does Kiro work for like existing large code bases or this? Yeah, yeah, question was how does Kiro work for large and existing code bases basically the brown field use case. And the answer is it depends on what you're trying to do for a spectre of and dev. You can ask your research into what already exists. So when you start a new spec, it will usually start by reading through the working tree. But the agent is generally starting from a scratch perspective, right? And so understand the system. In practice, what that means is that you're going to end up with a bunch of things like if your system already had good separation of concerns. You're the components in your system are highly cohesive and they're sort of highly coherent and highly cohesive. It's going to have a great job, right? It's going to be able to say this is the module that does this thing. I don't need to keep 18 things in my context to do my job. And it's going to do well. If you let's just take an example that's off the top of my head, if you were trying to launch an IDE very quickly leading up to an AWS launch and you, you know, took a lot of tech debt along the way that you need to unwind. And you know, nobody here would do that. I'm sure. But in case you did that like me, then your agent might actually have a much harder time traversing the code base in the same way that it dev would, right? So from just kind of that perspective, the more reliable things like your test suite are and the more understandable things like module separation and sort of decomposition of concerns are the better the agent will do. And versus true, of course. Now for things like understanding the code base, this is a bad example because this is a very small code base. But we do have things like, you know, code search and workspace. I don't know if you call these context providers. So you can come in here and just say, I want to do code. What is it? I might have turned this off actually. I did turn it off because the code bases and day enough will do things like indexing in the background. So the agent, like you can do semantic search over what you've got. But if you're just chatting, but in general, a cure should go in and do sort of background search to figure out how to do its job. Like as the code base scales up, it's going to be less, you do probably less well overall, but that's one thing we're working on as a team. Did that answer your question or did I kind of glance off the side a bit? Yeah, there got it. Okay, cool. Hey, what else? How long are you willing to wait for indexing to complete? So one example I have is that the code OSS. If it's not supremely obvious by looking at it, cure was a code OSS fork, just like, you know, cursor and surf. One of the challenges we've had is the code OSS code base is very large fairly long. There's other big ones out there, but that's kind of my large code base because I'm not forced to get to work in it fairly frequently. And so there, there's definitely some perceived slow down when you're dealing with something large like that, especially when you talk about code base indexing, the very active area of work for us though. So I'm trying to do things like either remove indexing from the critical path so that you're not waiting there on some kind of slow down render thread because indexing is running. But in practice, there should not be. I mean, again, the agent may practically be less well, but we're going to be talking in a couple of weeks or reinvent about how some of the temple features in Kira were built via spec in a code base we did not understand particularly well because we're just not via code devs. And Kira did a fine job of it. But again, that's a testament to fact that code base is reasonably well structured. And like if you've taken the time to understand how it works, it's very understandable. If you've not, it will might be a little bit opaque to stare at. Yeah, in terms of indexing is it's like just just putting this is this is one equation from the focus into context or it just is there way to like create some kind of like a vector database of all the code base and then like query. Yes, so question was, what do you mean by indexing because indexing can be a bunch of different things. And what I mean is that the agent is actually not provided the. I'm going to keep the agent context as small as possible. We use the index for most like secondary effects things like if you're doing a code search, or if I do something like search for pound, what the file in here is server like we use it more for these types of UI, then giving it to the agent because the agent does this is sort of anecdotal and based on our benchmarks does better when given less context but given the tools to understand where to go find things. Something we've heard a lot about is sort of incremental disclosure here at this conference and that's again, we don't want to load too much at the beginning of the context and conversation with the agent will be agent to self discover the right context for the task. Yeah. Thank you. Yeah. You guys managing session length like user and compression or printing off the regular cabinets. Yeah. So question was how do we manage session length. We have no incremental pruning today or incremental summary. You basically just a create context until you hit your limit which I think right now I'm on auto, which has like a 200 K token limit similar to the sonets. So we don't have a very sophisticated algorithm here yet we've looked at a few things but our number one concern actually is prompt caching hit rate. And so in a normal use case I can achieve something like 90 95% cash token usage here on per turn, which means that my interactions are very fast and that's or they're much faster than the alternative, which is I'm sending 160 K tokens to bedrock cold. So that's one of the reasons we've actually not done much experimentation with incremental summary. Our summerization feature exists when you hit the cat. It's not great. It's something we're trying to ship an improved version very, very shortly. EG in the next couple of weeks, which should be faster today. It's like a one off operation that can take up to 30 or 45 seconds, which is a horrendous experience. We're hoping to fix that here and make it sort of a real time experience. So that's the only reason I mean the specter of and dev is less to do with performance and more to do with reproducibility and accuracy at the agent. And if we can give you the right result. The way I and I think that we talk about it internally as this team is, if I spend 10 seconds giving a prompt to the agent and then it goes off and it gets it wrong, it's like, it's kind of no skin off my back right I burned. However many tokens and you know a couple sense of credit usage with whoever my own provider is, but I spent 10 seconds generating a prompt. If I spend five to 10 minutes with the system producing a detailed design doc or let's just say even a detailed set of requirements, I wanted to do a fairly good job. If I spend an hour generating a design doc reviewing with my team and then synthesizing from that, I wanted to get it right. So the goal necessarily is not just latency, but actually accuracy when we talk about that. Yeah, it's a both and you need to do both, but spec comes more from the goal to have highly reproducible output. I'm going to go over here first and then you. Yeah, how do the each of these task agents pass context to each other. And then are you only supposed to run this parent task because it just finished all like 3.1 3.2 3.3, but then it still thought that 3.1 wasn't done and ran that and 3.2. Yeah, we'll know mine. Oh, okay. Yeah. So if you. The question is if you're in the UI and you're like running tasks and I can just kind of pull up my task list here. So if I just hit start start start. Each of these is going to be a new session, which means the context is completely unique. Personally, I like to just if I can if I've got the context based afforded, I just say do all the tasks, because I find that more understandable. And I think I actually get better performance. But by default, each task will be a new session that has no shared context with the previous ones. So the session is effectively just seated with your specification. And then like here you're working on a spec that does all this stuff, lock of text. And you are doing this task that don't do any other tasks just do this. So that sounds like a bug. There has been up some agents for certain things. We don't have some agency at in care. Some were working on. Yeah. Yeah, because I mean, I can really read if we click on task three. And I've got 3.1 3.2 3.3 in there with separated. There's no good. They couldn't have different systems working on them. Yeah. We do have in the curacli custom agents. Yeah, Curacli is a concept of custom agents, which can be run sort of as a task. It's something we're playing with right now in care of desktop. And I think you have another one. Yeah, I'm sorry for this, but from the spec folder. And then as you do more and more of these tasks over time, yeah. It is just all in one design apartments tasks. Your whole project is defined there or did it group by? That's a good question. Yeah. So I will have many. I will have a. The question was as you do more you generate, let's say more specs over time. Are you sort of just creating one massive spec and no. Let me open a different project. This is, for example, the cure extension, which is like a 1p extension inside the cure ID. This is where the agent itself lives. And so we have prune some specs. But there are specs in here that we can talk through or I can just kind of demo. So these are the way I think about it is the specs sort of represents a feature or problem area in the in the project. And so, for example, I can blast this a little larger. So for example, we have like some of these are just tests. We've done things like, oh, could we have a prompt registry? Could we have a prompt or just a file loader? They may or may not make it all the way to production. I want to let me try on the chat UI. So these are just like somebody will go off and spend maybe represents a few days of work for an SD. Agent's Mb support is a good one where we just, you know, I sort of said research what agent's Md is and build it in the way you build steering and like support in the same way. This spec is fairly unlikely for us to come back and revisit in the future. So I may actually just delete it, which is what we've done with some of the older ones. But a good example of one that we might come back to is our message history sanitizer. So one thing we've had issues with we had issues with early in the the development of heroes that we would send these sort of invalid sequences of messages because let's say the anthropic API required tools to be in the same order they were invoked in the responses, but the system wasn't doing that. So we built this whole sanitizer system that has a bunch of requirements around. Let's see very specifically. Yeah. When conversation is validated, the system shall verify that each user input as either non empty content or tool responses. So we had things where like empty strings would get passed in, but there was a tool response. This is a good example where you've come in over time and actually just added maybe enough of the requirements, but to the to the acceptance criteria of the requirements as new validation rules are uncovered. Yeah. How do you handle like that? So for example, you have like. Yeah, I'm tripping there. Yeah, you have a feature that needs to let me treat it. You're going to go back and update that spec to where you're just should. Yeah. So if you usually you'll see, and let me just ask. A new chat here. No, that's terrible idea. So here I've asked I've made a inspect mode. I've made some requests to add you I tell a tree to the thing. I'll help you add it. I'll just check if there's any relevant run books, the next one of the code based and send the implementation. It might go do a little bit of research here. And then full of a coin again, it's an LM so it may or may not discover the existing to spec, but ideally it will after doing its research say there exists a spec already for things like UI telemetry. I'm going to go on a men that one. And in this case, like I would come in and just ask it to. As sort of the operator for the system, but over time, again, we want that to be easier for you as a user to not have to think about so much. We can watch it while it chugs along. Is there anything we configured in here that makes it better to work with it? So I'm trying to know. Not really. Is that question? No question was, is there anything in here that this pre configured to make it work better with AWS? No, we are sort of purposefully. We're in we are brought to you by AWS, which is something you know, Andy, Jesse and Jeffy be paying my check, but we're not like an AWS product that's deeply, deeply integrated with the rest of the AWS ecosystem. Now, that said, I still ask three emails when somebody says why is the other thing we built with AWS not working with cure. But similarly, like if you're building on G3, your Azure, whatever more you're running some on-prem system, the product sure just as well for you. That's our goal. Good answer potentially is the AWS documentation and see piece. Yes. So there are entity service that you can add into any of these things that will make the other kind of work. Yeah, that's a good point. So like in this case, I actually had to add the AWS MCP documentation here. We could of course have natively bundled this, but I don't want to ship this to customers who don't need it. Yeah, because again, AWS is not the only docs that we might care about. By the way, coming back to your question, so it did find the existing spec for telemetry. It read it. It read different sections of it. And now it's actually making amendments to us. We can follow the diff as it shows up here. So it's added new requirements to the preexisting spec. So this is effectively another case where mutating the system as opposed to just adding this sort of never ending spiel of specs. I guess what I'm wondering is like, how do they know what decide where to put the spec? You know, if you break down your project into these different categories, yeah, I would imagine like crossover. Yeah, I mean, that sort of like software development in a nutshell though, right? Like how do you actually define the seams between different parts of your system, different concerns, the product? And if you want to build something like I have a task, it's going to cost a lot of money changing like three or four things. Yeah, you're going to change through your horse specs and then run tasks across three or four. Oh, yeah, no, it should not do that. It would probably, so again, I don't have a good example of hand that we can do for that. But my perspective would be that if you're working on something that is across functional, by the way, the question was, if I'm working on something that let's say I have a spec for security requirements and I have a spec for API design, like the API shapes and I've a spec for logging. And I am changing something in the API public interface that is a security facing concern because we're redacting logging PII. I think that's maybe a semi tangible use case that we can all imagine coming down from our governance teams. I want to, I would imagine that you either pick one of those to load the requirements into or you create sort of a cross functional spec. I think you would come down to, I think you as an operator making that decision in much of the same way that if I, how you actually implement it might be, you would not necessarily implement my PII API redaction modules standalone thing. It's going to be a cross cutting theme across your code base, I'd imagine. Also, great examples, there's like multi-group works, they have been coming out, went to GA on Monday, and now you can like drag different, so like in your example, you just went through with like eight guys and all of them, and like even the front and things, you can bring in those projects if you happen separately and then still work. Yeah, thanks, Berk. The mental model, the spec, gender dates the code after that, like what all you can specify, or the set one. Yeah, so we have now synthesized effectively the spec, so we sat down, we defined the requirements design and task list. I've had Kieran now go through and run all the tasks in this spec, so ran them one at a time, it basically worked on small bite-sized pieces of work, chunk by chunk, and then now this is done. So what we've actually produced is not just like the completed spec, but it went here into my agent, and it did a few things in the CDK repo, because it's doing persistence test three, I'm sure it added a bucket, some new bucket encryption in the adiata, it then went into the agent, added the S3 checkpoints saver, it looks like it created a check pointer, it adds this to the graph, and it kind of passes this all the way through the system. And the S3 check pointer here, I'm sure as some knowledge of how to write the checkpoints to and from S3, so like we have gone not just for defining the system, but we've now produced it and and we've delivered it and including property tests, I believe. Yeah. I have a answer to a earlier slide related to like some specific AWS related features like that makes it easier to work with. Is the Curious CLI comes with the use of the decimal jobs with us? Yeah, yeah, so Rob's point now is the Curious CLI which we just rebranded this week has a USAWS tool, which is basically a wrapper over the adiata SSD care to make some of those things. But again, BYO use GCP tool is an FCP server if you were selling cland if that's your tool of choice and I believe don't quote me on this. The CLI is kind of new to my new to me, I should say, but I believe you can turn off tools in the CLI as well. Let me know if that's not right. Yeah, so that's like your action out strict in the desktop product today, you can't control the tools, the native tools built in but in CLI you can. So I intuitively get a bit of having a spec, you've done any work to inherently see like how a project or problem that work with the without. Yeah, we do have benchmarks covering the data offhand. I think part of that's in our blogs so if you go to the cure that slash blogger it's on the site, we talk really crisply about some of the things like property based testing give to task accuracy. Science seems always working on that stuff. I remember the blog about specs and here it comes from the database. Yeah, the distinguish engineer for databases. It's that on first really something that I think it has the data that you are. Yeah. How does it work? I understand the feature side of it, but how does it work in a non functional, slightly patient scene? The longer the more harder problems? Well, yeah, I mean that is ultimately the goal here right is we're saying you're making a slightly larger investment upfront, but we believe that the structure we're bringing is going to help you get increase the accuracy of your result. So while we've got a team of people who are basically working on making spec better my job when I fly back to Seattle is to make cure as a whole much faster. One execution, like kind of like lagging this in the UI, but to how do we get tokens through the system faster? How do we get responses to you faster so that like you're not sinking as much cost into Kiro to use a spec? Yeah, I'm not talking about the cure tool itself. Okay. Code generated from the spec. Oh, oh, yeah. Okay. Yeah, you mean like the non functional requirements of the generated code. So that's going to come down to I think what you're specifically trying to do. So you could add one of the slides I had here was talking a little bit about how to tweak the process and tweak the artifacts for your use cases. Again, you could very easily add something like I want non functional requirements for speed and runtime and things like lock contention to be considered in the design phase. Yeah, that's something you could certainly add. So you could shouldn't have called in rust or Java. Yeah, total. Yeah. Yeah. I think it could be in the functional depending on the plan. I mean, it would it would have to like yeah, there's no other way I think to approach it. Again, I'm just I'm familiar with no, it's so I'm doing everything here and no, but you can use this with any language. I think technically we say we support Java Python JavaScript and Jesus JavaScript TypeScript Java. And rust, but in practice, there's no reason that this doesn't work with any language. I mean, it's just an LLM. There's nothing language specific or framework specific in the system. And for those of you. So there was a conference earlier this week posted by Tessel, which you're doing sort of specs for knowledge base. As long as you've got the right grounding docs in there, and this is sort of. Their argument is that it should not matter what you're building. Like that's all just informed by the context you're building for your system. This is also really good point for steering. So steering you can get the agent to develop code in the way you want. Like being a developer is all about making trade offs and the problem with your out of the box is it's like so polite. It's trying to be everything to everyone. And especially like with latency and cost and other things like that. Just tell it in steering. What you want it to prioritize. And then that will influence any code that gets generated. Yep. You can like how it just science based on that as well. So if there's something that's very specific to your use case, the industry or whatever just jobs in the steering file. And that's more. Yeah, that's exactly right. So for example, I I will have cure generate. Commits for me. And one of the things I care I personally care about is that I contract commits I generate versus commits that cure generates being the one. They come from the system. And so my steering doc, all short, includes things like very specifically my requirement for cure is. Just use the UI. That I know I treated the co author of cure agent. Which is trivial, but also I wanted to happen every time. So in this case, I just generated a commit co author by cure agent. So that's an example of like you could add whatever you want there. Not just something really to commit, but you could do code style. You could do. You know, code style code coverage. Whenever you add a spec, where you're adding a new module, make sure that you. And I've been with coverage minimums that are 90% because that's the thing I care about. You've got to put anything you want up in there. The good news is it looks like what we built works. Cure is very happy with itself at least, and it looks like all tests passed. But yeah, so well, we can deploy this to the back end and see how. Things work. We're technically just about a time. So, you know, if anybody has any other questions, I'm going to stick around here for a while. But thank you all for joining. Listen and learning a little bit more about spectre and dev.

Spec-Driven Development: Agentic Coding at FAANG Scale and Quality — Al Harris, Amazon Kiro

TL;DR

Takeaways

Vocabulary

Transcript