- Building a "software factory" with AI agents shifts the human role from a direct code "worker" to a "manager" who delegates tasks, oversees processes, and ensures consistent, verifiable outputs.
- A robust software factory relies on well-structured code, dynamic guardrails to steer agent behavior, and comprehensive testing mechanisms that enable agents to self-verify their work.
- Scaling the factory involves automating human-in-the-loop tasks, providing agents with "enablers" like specialized skills and isolated environments, and continuously improving the system based on agent feedback and performance.
Building your own software factory — Eric Zakariasson, Cursor
- Levels of AI Autonomy: Progress from basic "spicy autocomplete" to a "pair programmer," then to AI generating most code (human reviews), and finally to a "software factory" where AI agents autonomously ship, test, and build code with human input limited to defining intent.
- Foundational Code Structure: Design your codebase with modularity, co-located code, and established usage patterns (e.g., authentication methods, test scripts) to provide clear references and make it easier for agents to learn from and reproduce.
- Implement Dynamic Guardrails: Create rules, checks, and hooks to guide agents and prevent costly mistakes in sensitive code areas. These guardrails should emerge dynamically based on observed agent misbehavior, effectively becoming automated SOPs.
- Enable Agent Self-Verification: Equip agents with the ability to verify their own work through various tests, including unit, integration, UI, and end-to-end tests (e.g., using Playwright for browser interactions), ensuring changes are correct.
- Provide Agent Enablers: Empower agents with "skills" (e.g., feature flagging for safe deployments) and access to reproducible isolated development environments (e.g., separate VMs) to foster greater autonomy and infinite scalability.
- Shift to a Managerial Mindset: Transition from hands-on coding to managing a fleet of agents. This involves scoping and parallelizing work, preserving "tribal knowledge," and front-loading comprehensive context for agents to execute longer, asynchronous tasks.
- Automate Human-in-the-Loop Tasks: Identify routine human interventions (e.g., copying logs, aggregating user feedback, extracting specs) and automate them using agent skills or integrations to eliminate bottlenecks and enhance efficiency.
- Continuous Factory Improvement: Establish feedback loops by observing agent outcomes and identifying "off-rail" behavior (e.g., incorrect database schemas, poor UI). Use these insights to refine guardrails, build design systems, and update agent learning mechanisms (e.g., via "continue learning" plugins).
Dogfooding — The practice of an organization using its own products or services internally.
Software Factory — A highly automated system where AI agents autonomously generate, test, build, and ship software with minimal human oversight, akin to a physical manufacturing assembly line.
Guardrails — Rules, checks, and hooks implemented to constrain AI agents' actions, preventing them from making undesirable or costly changes, especially in sensitive code areas.
Levels of Autonomy — A framework describing the progression of AI agent capabilities in software development, from basic assistance to full autonomous operation.
Primitives and Patterns — In the context of a software factory, these refer to the fundamental structural elements (e.g., modular code) and established ways of doing things (e.g., boilerplate code, specific service methods) within a codebase that agents can learn from and reproduce.
Enablers — Tools or capabilities provided to AI agents, such as "skills" or access to isolated development environments, that allow them to perform tasks more autonomously and effectively.
Feature Flagging — A technique that allows changes to be deployed to production in a disabled state, then activated for specific users or groups, enabling safe autonomous deployment by agents.
Verifiable Systems — Software systems designed such that AI agents can automatically test and confirm the correctness and functionality of their own changes (e.g., via unit, integration, or UI tests).
Human-in-the-loop — Describes tasks or processes where human intervention is still required, typically for critical decisions, feedback, or bridging disparate systems, which can often be targets for automation in a software factory.
Tribal Knowledge — Unwritten information, practices, and insights specific to a project, team, or codebase that are important for agents (and humans) to understand for effective development.
Okay, so we're starting five minutes early. Hey everyone, I'm Eric, I'm an engineering cursor and I mostly work at the developer experience and product. And today I kind of wanted to talk to you about my experiences like working at cursor, dog fooding the product and like getting to a place where you can build your own like software factory and like what that kind of like takes and the practical steps getting there. To be honest, I don't think we're really there yet, like sub parts of the product and sub parts of the company are running like fairly autonomously. But building a software factory takes a lot of work. I mean like look at like real life factories producing like hardware. There's a lot of assembly lines, there's a lot of people that go into this, a lot of managing, observability and all that, and there's a lot of concepts we can borrow from that world and put into the software world. So anyway, here goes my observations from doing this. But first, the agenda, I want to talk about like levels of autonomy, precursor to factory pun intended, building the factory, running the factory and then scaling the factory and I want to finish with some Q&A for any kind of questions. Okay, so for the levels of autonomy, Dan Shapiro put out this blog post, if you think in January or February. We have like six different stages of autonomy throughout like, so automating software. Carpathias also like previously used cursor example of like going from tab to agent and all that. But I think this kind of like encapsulates is really, really well. So we have the spicy auto complete at the start. And is it kind of like where a cursor started in 22, 23 like ages ago at this point? And we kind of like gradually moved up the ladder and making the software creation more autonomous and letting the agents do more work. And I think most people adopting the AI tools are like at somewhere between level two and level three where you have a pair programmer where essentially just going back and forth with the agent, asking questions, getting suggestions, asking the agent to do work, and eventually like finishing their tasks. And the step above that would be having the AI generate the majority of the code, which we can see like here in the developer level three, where you as a human more kind of like reviews it kind of like in the loop, following traces and all that. But as you further progress, you're like becoming more and more of a manager and we'll talk about this more later. But eventually like level four, I think this is where I'm at at this point like for most like software projects where I'm like delegating as much work as possible to agents and probably like reviewing the outputs before I actually review the code. Because I still look at a code sometimes. And lastly we have the software factory which is essentially like a black box. Then Shapiro calls it like the dark factory where you don't really have an insight. It's just like agents going around doing their thing, shipping the code, testing the code, building the code, all that. And you as a manager just provides like the intent and the instructions and like the goal from what you want out of the factory. Okay. Yeah, so like why do you even want to create a factory? First of all I throughput, you probably want to create more code with like less resources. You can run agent 24.7, you don't have to rely on humans that need sleep and food and eat and all that. You can just like have more agents. Another like thing with the factories like you have assembly lines and assembly lines produces consistent outputs. So if you build your factory right you can probably have very consistent output. But at some point you initially you feel like if you don't have a red setup you might feel like the agents are getting more and more probabilistic and like you're losing a lot of the terminism. Because they just go off and do random things. Which is probably a sign that you need to like build more guard rails for the factory. And I think this is a function of the model capabilities as well. Like as the models get better they can follow instructions better and just execute on whatever you want them to do. And thirdly you might want to have a factory because you can leverage your taste better. You can like get more out of your creativity out instead of just like waiting for you as a human to create them and produce this software that you're creating. And then obligatory like then and now this is what it used to look like. This is like a test of factory from a couple years ago. And it is like kind of what we're getting after here. Okay, let's get straight into it. So to build a factory what do you actually need? I like to think of this as primitives and patterns. So just like how do you structure the code? Is this like a modularized code base? Do you have this scattered all over the place? Is it co-located code, etc.? Just did like the distance in locating like if you have an agent like LSAing a folder, it can like discover all the relevant files at once instead of having to prep and search all of the code base. It can just like be very isolated to work within one single part of the code base. And this goes to the same with humans like if you have an easy time like onboarding yourself to a new code base an agent probably will have that too. The second thing is like usage patterns. Do you have specific methods and services for authenticating a user? Do you have like startups scripts? Do you have a way to like write tests, etc.? Do you have this boiler played in place? Because if you do you can point the agent to like existing references and just asking this to reproduce over time. So those are like some other like primitives and structures of the code base. The second one would be godrails. So like you might you want to let the agents free but not too free. So you want to have some rules and checks and hooks in place. For example, I hope you might want to have is touching a specific part of the code base. Maybe the agent should not be able to change the most sensitive like encryption of sensitive data or authentication or anything like that where a mistake could be like very very very costly for the company or for you as a human etc. Rules is probably the most misunderstood concept since we launched cursor rules. There's cursor directory which launched a good collection of different rules and the assumption was usually that you should just install every rule that you can depending on like what software stack you're using. For example, if you're using next years maybe you should have next years rules. But what I've found and what I'm seeing amongst our users and internally is that rules should just like emerge dynamically. If you're finding agents going off the rails, you should probably create a rule for that. And it should kind of like be sort of like an SOP to showing like agents what they can do and cannot do. And again, the models are getting so good at following specific rules that they usually don't go off the rails anymore. And I think that's just kind of like extrapolate over time as well. And of course tests like can the agent verify its own work and can it run tests and know like oh I messed up something up or I made a change depending in like in this specific area of the code but it still passes. Like I can still run the code and the check looks good. And lastly which I think is probably most exciting is the enablers like what can you allow the agents to do to actually let them be free. Skills is good for this. Just giving agents more capabilities, skills and MCPs, accessing like external context, getting like understanding of how to implement a certain thing. I'm going to show you some later in the cursor code base what we are doing. For example like feature flagging. Can we give the agents a skill to add a feature flag? So when we launch them autonomously they can just flag the actual changes made and merge to PR and come back to us like hey if you want to try this just turn on this flag. If you don't like it we'll just revert to PR. If you like it we can like expand it to more users. And lastly what kind of environment are you letting the agents run in? Can your agents start your dev environment? Can you just ask them like hey start my project and let them do that without having to like have any human in loop? Because if that's the case you can probably have them run. You can scale it up like infinitely on separate VMs. And then this checklist is like what I'm usually following when thinking of like building the actual like the factory. And part of that is like is it runable? There's a typo in here. I blame my Swedish. There is a accessible like the context that the agents need to have. Can they interface with linear or notion or data dog or slack, etc. And then you can see what is the broader context of the intent that the user have. And lastly which I think people should be spending a lot more time is like building verifiable systems. How can the agents themselves like verify their own work? Whether that's through unit tests or integration tests or UI tests. Actually clicking around in the DOM and trying to reproduce things that's actually happening for the end user. This is arguably easier for backend systems where there's like no UI really happening and you can have like clearer contracts and boundaries of what should work and what shouldn't. Whereas for web and UI and all that you actually need to click around and making sure things work. The buttons actually have a loading spinner, etc. Okay. So this is like part of building the factories. So if we switch over to cursor here, I'm not sure if you've seen this, but this is cursor 3. We launched this a couple of weeks ago and it's a complete rewrite of cursor. There's no via code anymore. Most of you are probably familiar with this type of cursor. We have files and sidebars and a lot of different things. Whereas this is a bit more streamlined for like an agent first workflow. And we'll get to like why we created this as well. So at a later point. But I wanted to show you some parts of some rules, etc. Let's see where I put them. So for example, I built this music agent project and if you've used Ableton before, you probably recognize this. Yeah. Yeah, yeah, I'll expand it. More? Okay. Yeah. So if you've used Ableton or any music production software, you probably recognize this interface. Oops. It's not really working on this size. But what I essentially asked agents to do here is like, can you start a local dev server? And we can see that it worked for a while. It explored some files, read package JSON. And based on this, there is a start script. So like package JSON, all these dependency files are so in distribution of the models that they know like we should immediately go to package JSON after a slot JS project where it exists to look for a start script. And this is like a good example of having like a pattern that it's predefined and like making your code base more like in distribution in that way. Because now it's like, it's super easy for the agents to understand like, oh, I should just go in here and start a server. So it started a server. It's running a local host 3000. And let's see here. We can see that we had this agents MD file. So agents MD is like cursor rules. It's across for the many different harnesses. And what I wanted to accomplish with this project is essentially like building a factory around this idea of building like an online music creation tool. And to do that, I like, I force myself never to write in a code myself. Try not to look at a code that much either. And just like try to figure out like what is the systems and the structures I need around this. Immediately, it became pretty clear that we need a way to start a project. We need a way for the agent to like verify some work. So the agent created this end to end tests using a playwright so it can just spawn browsers, go through root, et cetera, click around and get by test ID. And making sure like for every different change that make, for example, the play button still works or I can add notes to this project here without anything breaking. So these are like some examples of how you can create like verifiable outputs like that. Okay, we have VTES, we have this, et cetera. So let's see here, if you go back. Oh yeah, another option here, a casual scrolling on Twitter. A different way to verify the work is using like an automation to code review. You can ask the agent to just review the changes it made or you can use like a more like integrated tool like bug bot that we have in cursor that is looks at different PRs and GitHub and reviews them and comes back. And this is like also like one piece of the whole like factory that you should have multiple different stages where you plan it, you produce it, you review it and you essentially follow the whole SCLC but you like automate and codify this work. And I want to show you this as well. So we launched updated Claude agents in the last couple of weeks where we gave each agent their separate VM and you can have them like create a very reproducible environment in the Claude. And this essentially allows you to scale like infinitely. But we also gave the agent a tool to test its own work by controlling the computer. So for example we have glass here which is the interface and I asked the agent to let's see here. Glass agents, see if they're rough with the keyboard, control tab, etc. like better accessibility and using the keyboard to navigate the agents. And I asked it to make it change and then record this with the full editor because the first one was just a sidebar. So what we got back here is just a video of the agent actually testing its own work. So we can see that it has this highlighted row, I'm not sure if you can see that. But just some context for me as a human to verify the work. And then it actually click and around and using the keyboard to navigate. So with this we're like we're getting kind of far in like the factory like where we're at. And a lot of the things are automated like review is automated, the testing is automated. We have some rules to like steered agents, etc. But there's still a lot more to do. So I think when you have this in place the most important thing you can do is like shift your mindset. Like you are going to look way less at code. So you are going to go from like worker to manager. Instead of just doing the work yourself you're overseeing a lot of agents doing the work for you. So this also means going from sync to async because most of the work is going to happen in the background. And you can still tap in and see what's going on for different agents. But the more agents you spawn over time the harder time you're going to have to like understand what's going on in each of them. So then you need a way to aggregate these changes like upwards. And it's just I think it's so interesting that it's just the same as like in human organization. All the same principles kind of follow. You still have you start with a very small team and then you add more and more people because you need to get more throughput and all of a sudden you need a manager to like oversee things and then you add more managers and then you need a manager of the manager. And this is essentially what's going to happen with agents too. But you are just going to keep on going up the level of instruction. So when you're a manager you need to start thinking like how do you scope and paralyze the work because you want to get like high throughput. But some things are not necessarily it's not good to make all the changes at once. For example, if you have two different tasks working on the same part of the code base you're going to get merge conflicts. So you need to still like plan out scope and paralyze the work. One unit of work can always be one agent. So then how do you take a long, long list of things you want to do and actually make the most out of that and run the most amount of agents that you can do. And to do this I think it's important that you preserve like tribal knowledge of the code base. You still understand what's going on in the different systems. You know how data flows, what the users want, which part are critical, which part are critical? So not outsourcing too much to the agents but very direct and managing them pretty well. And when you're going from sync to async you are going to need to trust the agents a lot more. Because you are going to send them off and doing longer and longer tasks. And when you do that you need to get more context up front. So you kind of like front load the context of the agents. I did to like a plan or a long spec and then you send them off and then you let them go. And once you start doing this regularly you're going to like start to feel the agents. You're going to like understand the models and you're going to see like these are the weaknesses, these are the strengths and you are like this alignment with the models. So you know like how to prompt them and what intent to give them. And again as the models keep getting better you have to give them shorter or less and less prompts as you used to before but you still got to provide the intent and be very clear like what with the change you want the agents to do. And there's like no there's no shortcuts to this from what I've found and from what the team has found. You just got to like spawn a shit load of agents and just like let them do the work and see what happens. And as long as you have good safety guardrails you can just let them do that. So you probably shouldn't let them push to prod like straight away. Yeah so this kind of comes down to like personally I'm always using isolated environments in different VMs. I just tweeted about this actually because on one hand if you're sharing the workspace you can have like get work trees where you like have the shallow copies essentially of the code base on the same machine and you can reuse services. But you're still going to have to branch every like database or cache or user management to have like reproducible and separate environments. Like if you are going to make a lot of changes the ones you need to you want to know that they are pure and they're not like having side effects to the other branches. And that's why I found like just using Claude agents where I spawn a VM and this VM can run a database internal tooling database is other stuff and the cursor app itself and then have the agent just work in that isolated environment to be much better. It is more expensive it's going to take a lot more work to set up your like factory or your environment to support this. But once you have it set it up properly you can scale this to like 100 or 1000 agents. I'm not sure how many we are running today but I bet it's like multiple thousands a day just agents running in the same or like copies of the code base. So that's what I would recommend. Yeah so when you're a manager like your job changes quite a bit. So you have to like look at your system as a whole. You got to like think of where is the human in the loop needed. For example do you have a log service like data dog and do you need to copy paste the logs and go into the code base and paste them and like run the agents to identify and trace down issues. Or do you have user feedback that you need to copy paste from Twitter into somewhere else and let the agents do something with that. Do you have like a notion thing where you have all your specs you need to copy paste the notion or export them into markdown and then to agents. There's probably a way to like automate all these different things. Either it's like skills or mcp or or separate automations. So think of like where is the human in the loop needed and try to like automate that away. The second thing is like catch where how can you catch agents go in like off not doing what you actually wanted to do. And this is like the this is like the perfect flywheel for improving your factory as well. If you can see agents like creating like wrong schemas in your database because they don't follow in naming conventions etc. That's probably a rule somewhere. Or if they are just producing really ugly UI. There's probably a way for you to create a design system and let the agents be aware of the design systems where they can incorporate that and use it for the next kind of like iteration you do. And yeah, then you take all these learnings and you use it to actually improve the factory. And thirdly it comes to like scaling the factory. So now we have like your environment set up you know how to be a manager like manage a fleet of agents you scope the task and you do all this. So how do you like actually take it from like five agents to ten agents to 50 to 100 agents. And the thing is again not looking at code is going to be a real thing if the model gets better and they are getting better. So observing the outcomes kind of like the same thing as previously like where they go off the rail what are they producing what are the artifacts etc. How can you make it so that the agents also can verify their own work and verify the outcome that they produce. You should set up automations. You should look again at the things you're doing repetitively. So one thing we could do for example here is if we go to cursor and we go to this music agent again I can ask looking at my chat history what repetitive tasks I'm doing. So we can ask the eight inches like look at this and identify potential opportunities. So searching the agent transcripts and it's producing some kind of artifact of this. Yeah let's go this goes. I actually built this into a plug-in. Oh let's see here. Planting execution loops, we're starting to product direction. Let's see here. Ableton like UI iteration. I should probably like put this in a rule saying like make it look like Ableton. Tooling housekeeping etc etc. So this product is very short lived but if you're looking at an actual production thing where you have prompted a lot over time you're probably going to find things that you are doing recurrently. And I want to show you some things that we are doing as cursor that we are automating. And some of these are not that obvious all the time but one is for example let's see here for example daily review. So I have this automation for checking my own daily review. So this is going to look at Slack. It's going to look at GitHub and it's going to send me a summary of the things I've done over the last day. So I would previously have done this like writing down my notes maybe thinking like what did I get done today or like writing an agent with access to MCP. But now I can just put this on a schedule and do this automatically for me. I want to show you a different one. For example read Merge PR comments. This is also like a way for you to learn over time. So for all the PRs that we merge in our main repository we can look at the comments and we can look at what did humans actually review here and what did they say about the changes I made. Because if a human actually goes in and reviews a PR and leaves a comment there's probably a high value and high signal and high intent in that comment and we can then store that later in order for the agents to actually learn over time. We have another one which I can show you here. This one. Yeah again the code owners. So this one allows us to. We essentially had this problem where we had code owners in our code base and they were kind of right most of the time like 80% of the time but for these 20% of the time they cost a lot of bottlenecks for us internally. Like we were blocking the Merge PR we needed someone else to review it for us and maybe they were in a different time zone perhaps. So what we started doing was building this agente code owner thing and what it essentially does is look in at PRs and check in like first of all what's the risk of this what's the risk level. Can we is it just like changing a variable name or is it changing our constant that's changing like how long a trial subscription is or something like that and if it is low risk it can just approve the PR because we don't really we don't want to block our own engineers on these things. But if it is we can see that it is a high risk PR and then we can find like okay who made changes to this previously and can we like pull in their feedback and making the most out of this and like first of all making the code safe and not breaking into systems but also for the user that actually did the initial change keep them in the loop and like keeping them up to date on and refreshing their context of what's going on here. So it kind of like it goes both ways and yeah multiple value ads from doing this. Let's see if there's one more view. No I think that was pretty much it or yeah I have this one more thing called continue learning. So continue learning is another type of automation that I created a couple of weeks ago as well and it essentially does what we did with the agent. We look at the previous transcripts we have and we can then extract like memories and learnings from what we said previously like if we're correcting the agent to do a certain thing like use this component instead of that component or always refer to me as like always like have very like verbose descriptions of things that you're doing instead of me like every time going in and asking to do this I can create a rule but I'm kind of lazy so I don't really remember to create a rule so instead we can have this continue learning plugin that looks looks through the transcripts and stored this as a rule for you instead. So these are all examples of like systems to automate yourself away and to automate like things that the agent can do for you and I think that's the important part of like building these factories like how can you identify the flywheels and loops where you can automate yourself away by building systems. Okay and yeah you are going to move up distractions so now you're managing five to ten agents but tomorrow you might be managing an agent managing other agents and that is just going to grow like you're going to have a lot of sub agents like under you working for you cool so yeah what I want you to take away from this is be very clear about the intent and like really think about what's the actual problem to solve here what we want to get out of this don't outsource important decisions like make sure you're staying in the loop for important decisions whether this is like safety or security or databases or payments and authentication some things are really important and should not be made so not be decided by agents but by humans. Build tools and systems try to find these flywheels and like codify them and get them in your systems and let the agent have access to them. Store context for later whether that is like agent transcripts or artifacts of things you think look good because this is going to help the agent to like know what good and bad looks like over time and it's going to change so storing the context and building the tools and like keeping them up to date is more important than actually doing the work because this is going to provide like the framework and the guard rails for the agents. And lastly like let the agents be free like think of what do they need I have a friend a loverable he mentioned that they set up a Slack channel or he gave the aid into tool a vent tool so the agent can complain about things when it was running and the agent started complaining about hey I can't like access this image I'm like very frustrated about this and then it posted straight into a Slack channel and they they set it up as a joke but then they started scrolling through and like oh this actually is very valuable like we should probably like give the agent access to reading images and they did and then they didn't started complaining about something else that was problem with the harness so find ways to let the agents be free I think that's very important thing. Okay that's kind of it and that's kind of like a direction of and things we have found like building cursor and like taking cursor towards software factory I hope you learned a thing or two and can take away some of this. I'm happy to take any questions about anything cursor. Yeah or actually now we have the microphones coming here. Thank you very much. I have a question about code quality or architectural quality so when agents ship tons of code and you barely can review them how you ensure the code is extensible and so on. I mean you can establish hooks or guardrails for measurable things like I don't know number of lines in the file should not be more than something. But the architecture is not measured this way so and agents they have this completion bias they want to finish task as soon as possible and they don't think ahead they don't have their picture of the future how code will evolve they just want to finish task now and yeah. Thank you. Yeah it's a good question. I think we as a humans have the same problem but it just takes a lot more time for us to like discover them. One pattern like the good thing about agents and models being like essentially like completion machines is that they will just look at existing references and just continue forward with that same path. So if you have existing things you can point them to I think that's very important. If you don't I think there's a case where you let the agents do one off implementations here and there and then eventually you have another agent like refactoring like we do as humans as well. So like one to generalize and build abstractions and all these things. So like how can you build like a system to like detect this and verify that the abstractions that are getting built is also good and in line with what you want to do. But I think it's got to be like a lot more architectural review for humans and scoping and like planning of what the architecture should look like and system design. But yeah it's a tough problem. Thank you. Hello Eric. Thank you for the talk. When it comes to the activities of building the factory, one thing that I observe for example when it comes to building things like rules in a team is that because it's so new almost everybody feels oh this is a rule for me and I don't want to inflict it on other people and I notice this creation of silos where each engineer ends up having their own separate different factory. Do you have any advice on how to bring it to the point where the whole team is contributing to the creation of the factory? It's a great question. I think it's hard. I think it's very cultural as well. I mean like we developers have always created our own tools and we want to have our own custom setup but at some points we have to unify on a certain structure. So I think historically we have had PR reviews and all these kind of things as a ceremony to align on the code that's being produced and making sure it's consistent. I think we're going to take the same principles and apply that to the tools we're building as well and like the guard rails and enablers and primitives. So I think I don't know establishing some kind of a forum where you can discuss these things and like plan like what do we want the factories to look like? What are the components we need? What are the integrations we need? Do you have any samples of like specific things to people? Is it like flavor or is it more bigger changes that the agents are doing? When it comes to rules they create like oh I want to like one person wants to write the test first and they create the rule to write the test first but they know that somebody else doesn't want to do that way. So then they have the rules only on their machine. They don't share it because it is unique to what they are. So they are collaborating. The whole team is collaborating on creating the codeways but the collaboration in creating the factory in thinking well are we decided now that the factory writes the test first or not? That is a big decision that is hard to align everybody and accept that. Like with all of these rules not everybody's going to be completely on board and in most cases it doesn't matter when you defer a little bit but it is hard to do. Yeah I guess it is a human problem and a human change in what we made. Yeah it is a good question. I think I love about it. Thank you. Thanks for the talk. A lot of the patterns resonate. I was wondering what is needed, what kind of patterns can you suggest to take it to the next level if you work on enterprise, brownfield, mission critical systems that cannot fail, that cannot be insecure. If you look at the recent supply chain attacks and you give your agent sandboxes, maybe that is not even enough. So the humans remain accountable and we can't say it is not my fault my agent did that. Do you have any extra patterns that or is it just inherently we have to keep reading the code which may feel like reading assembly lines in the 80s or something? I think if you can spend a lot of compute and tokens upfront before you as a human actually needs to be involved. I think that is a pattern that we found to be pretty successful. So one thing is like manually writing tests for very critical parts of the systems and then just letting the agents run them a lot. The second part is building automation to our security team, they built the security system which is an automation that looks specifically for very specific invariance of the system and they run 10 of these on certain PRs that changes certain files. And then yeah I think it is a bit contextual as well but yeah just spending a lot of tokens before and trying to find different variants and like almost read teaming. So one thing I did is instead of focusing on velocity and throughput and focus on quality. Sorry what? I use AI to focus on quality and just improve the tests and just make it completely AI ready. Yeah I think that is very good because if you as a human trust the tests you probably are trusting the output even though you don't have to look at the code and that's kind of like where we're going. So thanks for a great presentation. I find myself kind of like lacking in using guide routes especially like rules and hooks. Partly because historically the knowledge of how to do that properly was very scattered and decentralized across a whole web so you would have this exotic GitHub repose wouldn't try to centralized this knowledge or maybe you'd have some medium articles or maybe cursor with cursor company would do a block post on this. Still it was very evolving and also the capabilities of models themselves on especially on instruction following. They are also evolving and they are getting better on that and and it always felt like kind of like duct taping to me. So I'm wondering basically can we have AI to help us with that? Meaning that good cursor for example give us like proactive agents or maybe some new setup or maybe wizards kind of setups where we could identify our workflow and then help AI build us rules and and guardrails and all those like rules artifacts for us. So maybe just like a proactive agent so maybe we would have like an agent that would scan our workflow globally and then help us build those artifacts. What do you think about it? Do you guys think about this in the company? Maybe do you work on that? Yeah totally. I think now there's like two places we can do this. One is in the product itself with the whole, with the like continual learning product. Let's see here. Oh I don't have it installed. We have a marketplace. Yeah with the continual learning kind of plugin to actually like look at your transcripts and like extraction rules and memories and all that that's like one way to do it. Then there's like another world where you like change the weights of the model depending on like what your code base looks like and what like your engineers are doing like in a specific team and you like your reconcilem and it's like it's like true continual learning not this hacky plugin and you like actually bake that into the model so they actually know what your preferences are etc. But totally like memory and rules and all that I think that's kind of come more and more important over time because that's kind of like what's lacking. That's kind of like what's preventing me from having a lot of trust in agents sometimes because like I say something and they forget about it but they look like stateless machines so how do we capture this knowledge? So I think we should put a lot more like time and effort into building these systems. If I might just follow on that so you say that you seem to first to start a project or or to dive into the project that's already existing in the code base and then to build the rules on top of that. How about we first have rules and we want to start a new code base new project. How to actually have those good rules for us. Do you think that humans should still do that or can we also automate that. Can we do we have a new best workflows for that? I think it's hard because my perspective of rules is like the bridge between the model behavior and the human behavior and how do we steer the models in a way that they follow me as a human what I want to do. And in a new project I'm not really sure what I want to do. I kind of want to like outstourge that to the model and see what are they doing here. Can I run different models? They want to combine them or do I want to scrap everything. So I think it's hard. Like the best example of a rule that I can think of internally as for bug bot. So when we're doing database migrations we're not really using foreign keys on a database for performance reasons. And the models like the right way to do is use foreign keys. So they will always add a foreign key. But when it hits get up and there's a PR created we have bug bot look and it's the reviewing it's like oh I have this rule saying we should never use foreign keys. So then it flags this. So that's the gap between the human and the model and what we want to desire like intent we have versus what they have. So I think rules should emerge dynamically over time. And before that you should probably adjust this ephemeral like specs and plans. So it's like one over here. Oh yeah. Oh yeah. It doesn't work. It's work. So thank you Eric. For the talk, it's evolution. Evolution and trust is a big point. I'd like to know how you effectively do GUI testing and user acceptance testing automated. If you show something of your workflow. Totally. The best or like the main way I do it is using let's see here. Oh yeah. I have this one for example. The main way I do it is using the cursor cloudy dance with the computer use that we have. So I'm going to publish this. I don't know. That's bad. I guess we're not doing that. I have this website where it's running a I have like seven components like a button a drop down etc etc. Web components and then I'm generating each of these components with a different model. And because I want to like compare like what does a composer drop down look like versus a GPD 5.4 drop down look like and I put this in a grid. But when I created this website there was an error where I had this like view code button so I can actually see the generated code. It was not working because the model didn't bundle the actual code. So I went to cursor and I clicked when clicking view code on that component it says it kind of loaded code. And it's like it's a very like short description. So what the Aidentid you can see here it spawned my local server. It started like clicking around and pressing enter. We can see the cursor up here. And it's created this like screen studio-esque recording whereas like chopping and speeding up and zooming in etc. So here it's taken a while because computer uses fairly slow. It's consuming a lot of tokens. And we can see we have this view code button. And now we can actually see it's working too. So since this is a very like much of a side product for me I'm not really going to look at a code. I'm just going to like see that this works and I'm going to merge it. But you can keep on prompting a model to do very specific things for you like. Can you follow these like specific instructions like a login flow for example. You should click the button. You should log in. The most this like login steps are probably so much in distribution that you can probably just prompt the model to say like go to this URL and click login or like login. And it's kind of like understand which steps it needs to take. But then you can ask the model to like input a wrong password or input a wrong email and see what are the results from the website. And maybe the website is giving like wrong credentials. And then the Aident would understand like oh I need to like put into write credentials. So just like you would like hire a consultant like a QA consultant and giving them instructions. You would just give the same instructions to the agent. So this is like one way to do it. I guess the other way would do like more playwright slash property here and just automating like a browser thing. Which is a bit more deterministic as you can review it. And check it in and like have other people reuse it. Does that answer the question? My question was going more into like user acceptance testing to check. Does this thing actually look right because that testing a login you can do you can automate that need an agent for that. Does the website look right? Is it consistent through all the pages that are generated and like that? Yeah. Then I used Claude agents for that a lot. There was one I can't remember now. But I think it was I did some changes in the docs. And I just asked it to like open every single instance where this word is referenced. Take a screenshot and give it back to me. So then I could just like look at all different screenshots, everything good, and then I could merge the code. So letting like the agents do the navigation and clicking around and the testing for you. I think it works surprisingly well. This was like a very much an AGI moment for me when we launched this in last year sometime internally. So have you had time to try Claude agents in cursor? You should. Which one this one? Well, it was initial question. How long it took? Very straightforward. Like I for this one, I did no specific setup for like our own repository where we have like when running cursor. So we can actually like reproduce like this demo here. It's running all the backend services for cursor. It's running all the front end things. And this is like a lot, a lot of different things. So the VM is quite beefy. But as long as we give the right instructions, it's working really well. What we did was creating this internal CLive that the agent could use to sort of like we call it like cursor dev tool, cursor dev tool back end start, cursor dev tool front end start. And that is abstracting everything away that actually needs to get to like orb stack to running a click house and postgres and redis. And then the front end running like electron and then a glass here. But then they just like co-exist to two different processes. And they didn't have access to everything like just as a human would do. And you can have like the authenticated if you store like a snapshot where you are authenticated. Oh, sorry, sorry, sorry. Okay, okay. My bad, my bad. Yeah, this one I don't really have the, I can probably look it up. I would guess this is like $1 something like that. There's like for one, probably like this initial one would be $1. And the other ones I just asked to rerecord a bunch of different things. Something like I can look it up later totally. Yeah, I guess depends on which model you're seeing too. Okay. Okay. My question is about hand off between humans and they just whenever you are using different tools. So in my current setup, I have a product owner and a functional analyst that they work on. And they prototype very fast. We basically without so much thinking about the bekand, the architectural choices or whatever. And then they pass the the control down to the delivery theme that uses cursor and has to make that stuff work actually work. Which best practices do you suggest in order to enforce a proper workflow between people just not knowing basically what they are doing on a technical point of view, of course. And the people that needs to bring that thing that maybe has some poor choices such as okay use that database or then Claude Code change the idea and they move from a super based to a source to any other kind of fancy database that actually is in that environment. And then bring that into some sound architectural choices moving from Claude Code to cursor. I think what we're doing internally is like by like one or two pms and they are building a lot of different prototypes. Sometimes it's actually in the real like product itself. They're using maybe Claude agents and just prompting them. They're getting like a video like this back of the changes. And it's like oh it kind of looks like I wanted to and then it tweaked the signs a bit. But the code might be really bad or like not following best practices. Which if they had a if we had a good factory then it probably would. But if that's the case we hand off like a link to the Claude agents. We just copy the link and just send it to the like engineers like hey this is like something that we want to build. Does this make sense like can we do this. And then you have a lot of intent already expressed. But the other case is like having the pms they have a separate repo called like prototypes. And it's just like an HTML file like a mega HTML file and reproducing like the cursor UI or the dashboard. Yeah the problem is the migration. So just a particular use case I had my PO and functional team build out of very fancy demo using a prisma and tour source whatever database and then storing data on versatile blob storage and then my delivery team had to migrate that to use SQL server and C sharp and aspire for the backend. And the migration was really painful. Yeah. Even because when they use the agent freely with no constraints the agent sometimes decided to use say next yes some adult time decided to use white and other time it decided to use as well. And putting constraints in form of rules within that agent shaped that down the path but the problem is that we need to write a lot of rules and make them consistent and it is not easy to manage all the work. So we are shifting a lot of effort from having people to write code to having people to write guidelines and the rules and whatsoever and make all the pieces talk to each others. I see yeah yeah I guess if the PO's and pms can't have access to the actual code base just like handing off an artifact is like the minimum viable intent which could be like an interactive like back in the days used to be like Figma prototypes right you can click around and you get like a feeling for it. Now you can have them even higher fidelity where you have an interactive prototype using like web technology without like touching anything on the backend stuff where it doesn't have to be like a working thing for real if it's just a prototype internally but just enough to like your engineers can understand like oh this is like the intended thing if I click this thing that should happen or if I like entered some text here and click send a row should show up here and I think all that can just be done in front and kind of like a hackathon. You don't think to migrate the product into something that becomes production really but rather rewrite that. Yeah I think so I think we're writing and I think like setting like clear expectations from from the engineers to the pms and pms like what engineers kind of want from the product organization and like what's most helpful for them so maybe not like web coding complete SaaS products it's the most efficient thing. I said thank you for our presentation my question as we're building more and more agent and it became art of our time critical processes. How do you see the brown outs and black outs as a as a new risk and what's your view how it can be mitigated and the impact reduced. Yeah it's a great question it's a really good question I think it comes down to what we talked like the humans are still accountable for the things that's being shipped so the humans need to build like systems and observability and monitoring around the changes that's being made and I think that still like comes down to understanding which are like system critical areas of the code base making sure you have good like observability and understanding of everything that goes on maybe like every line should be human written in these critical things or at least like always humanly reviewed by one or two people and yeah it's close to vibe it's easy to vibe code close to the sun and fly through close so I think it's also like a culture thing where you have to make sure that the humans are still like accountable for things getting shipped but yeah setting up good systems to understand the changes being made I think that's important and test. Eric thanks a lot for the talk and I'm assuming you're probably one of the people around the world that has a best understanding of how to use these technologies so this question takes us to back about from the technology and things about processes and how do you manage yourself in your work days and I wonder how long are these tasks or how long do you get to be away from your agents without babysitting them and how do you actually invest this time let's say you have five ten fifteen minutes how do you make the best out of your time and maybe how many agents do you have in parallel like mental processes and how do you manage yourself. Thanks a lot. Yeah it's a great question and I think like once you like there's like two levers to pull one is like the scope of the change like larger the scope is the longer the agents are gonna run and if you want to run for a really long time you want to have like a very viable system so like they can check their own work etc and the other thing is like how much can you parallelize like how many of these agents can you spawn off and I think the sad reality is in some sense is that there's going to be a lot of context switching I probably work in four different repos or like four different areas of the code base at the same time whether that is like through a like single like feature that requires front end back end database testing yaryara or if that's like five completely different things it could be like docs it could be like side products I'm exploring it could be fixing a bug from a twitter user but I usually they range from like probably five to ten agents five agents like asynchronous running in the Claude at all times and while I'm waiting for these I'm either like scrolling twitter or it's true I also have the browser in cursor now so I can just stay in here and do it or I have like synchronous task go and where like I'm a bit back and forth maybe that's like fixing a small thing in the code base or maybe that's like planning the next thing maybe I'm like sourcing notion and slack and just like creating a spec in cursor using a model so I love to like plan synchronously and then just execute the plans like asynchronously and then once that is done one of my Claude agents is probably done as well so I can come back and like review that keep on prompting it a bit maybe merging in some parts I still like need to test manually like maybe I need to download a copy of glass or cursor three test it manually and like this looks good to me let's go ahead and merge thank you a quick question this factory building leaves us with a scattered ecosystem of a lot of markdown files is there an easy way to organize these files and to keep an overview of the factory you have actually built as maintaining a factory would require you to have an overview of the processes you want your coding agents to go through what tools do you use what methods to recommend how do you keep a mental map of the factory you have built and how do you maintain it yeah it's it's a really good question I think it's somewhat unsolved as well one of the reasons we've rebuilt cursor to look like this instead of like the traditional ID is the fact that we are using more agents and we need like a better control panel where you can like see all the agents and manage them and spawn them etc so what's going to happen with like cursor three this is like the first stab at like multi-adients workstation what's going to happen is that these are going to be like nested agents so you're going to have like opening this one up and you're going to have like 10 agents in here so you can still like introspect them and see what's going on in following the traces but you're probably also going to have like somewhere here like some kind of project view where you can see like an aggregated status updates like here's what everyone is working on and here's like the latest here's what the USU human need to review so I think these are product things that we're going to build into cursor but to like set the spec for the factory I would probably like have a folder in your code base where you like outline how certain things should work maybe that's like just mark down files of saying here are some best practices maybe probably the rules and establishing some kind of counsel to decide on like what goes into the factory and what doesn't and like what are we lacking to like improve the factory so as long as it's something that the agents can understand and read which is files that's probably what I would do and just store them as in your code base that's checked in somewhere thank you I'm just thinking about like teams of the future so you know a year or two ago it's like very reasonable to have you know an engineering team that might be several hundred people several thousand people and what does this do with that and what roles and kind of like roles engineering team right this is kind of akin to almost becoming some of between like a product manager and like an architect so what roles do engineers have yeah I think that I think that's very accurate it's hard to predict like what already like second third fourth order effects of this happening and it's definitely like writing less code looking at less code spawning more agents it's going to be like how do you take because like we're still building software for humans mostly so like how do we know what other humans want like how do we talk to our customers how do we market what we're building how do we do all these things and bring them into the actual like factory who sets the direction what's the intent all these things are coming from somewhere either it's like creativity from someone else's head or it's actually like a user demand so having someone like doing that it's going to be very important having someone like like aligning that between the different humans in the org I think it's going to be important having people building the scaffolding for the other agents in like just building the assembly lines where the agents can actually run I think that's also going to be important but like to what magnitude and how many people it's going to be like in yeah I don't know it's really hard you can do a lot with the models right now with a very like small team if you have the right setups in place and like yeah depending on the domain you're working in I don't know do you have any predictions I see issues with kind of like from like a labor perspective if you're if you're working in an incredibly agentic environment what's your need to like like what happens to training new grads hiring new grads and kind of like the future from that perspective what happens with office politics and like land grabbing right because basically your your value now becomes in your ability to configure and set up your own kind of like agentic team not in your ability to kind of like program and be productive anymore the 10x engineering is no longer about you know words per minute it's like prompting yeah yeah token token usage yeah I'm I'm I'm I paid in tokens I'm I am I yeah leaderboard you know I'm gonna be talking next to my pader in a mount and then like my token usage takes away from that you know how do you how do you optimize we've got to train to model to be more political I think that's the solution right we need more like what's a cool at all I guess we're gonna have more of that if the agents are doing our work hi Eric hey we're talking um I was wondering um probably you are using at cursor uh some kind of uh uh issue tracking uh tools like atlasian or gyra okay um are you using uh I was wondering if you are using uh agents to check automatically check and uh um read tasks directly from uh gyra for example and spawn uh um uh sub agents to perform the work where if there is always a human that um start to work using cursor uh so we're using linear for issue management and uh we have this first part of integration as well so for every ticket that's getting created and linear we spawn in the Claude agent um so like one where I interface with this most is like if we have a feature flags before a specific thing that's rolled out and if it's rolled out for two weeks with 100% um the system kind of like single cells like hey uh you can it's a steel feature flag at this time you can remove it so then we have this to create an automatic issue in linear and since that is hooked up with cursor it triggers a Claude aid and to remove uh the feature flag so it's kind of like completely automatic once the system no status rolled out to everyone and I can just like I can probably look at a code and it's like okay we can merge this the feature is no longer active um and we do this for like everything so once you post something in Slack uh we either have a linear Slack agent Slack agents look at it or we have a cursor automation to like um look at the message I was posted and uh trioshed and like look for duplicates or like if it's determined to be easy like start to implement the fix for it immediately and and this is like an example of where a human is like in the loop where it might not have to be it could be like me going on Twitter and like seeing a tweet like something is broken with um the plan mode button drop down I can copy that into Slack and then having the agent perform the work but there's probably a wave where we can just source this feedback immediately without me having to like scan it and trioshed and copy paste it um so that's kind of like a bit how we work with uh linear initial management um but yeah we we're also like yes since we're spawning a Claude aid in for every single thing it provides a good way for us to dog food the product and like test it out but I'm not sure if I would recommend that for for everyone because it can be quite costly as Claude agents are a little expensive uh do you have something in roadmap run something locally like I'm just thinking of an alternative called dev containers and opening in that but do you have something planned in the roadmap for that um um what I think the closest thing you can do is probably just prompt aid in to run for really long time um it's kind of like the same thing with like running local models um and the recent like from like tried it like I've probably tried like once a month running like the best open source local model and like seeing how it works in cursor but it's never the same experience as sliling like um GPT or a Claude or a composer um and the same thing with like running really long things locally I've found it to not work that well as if it's running for a long time it's probably going to reuse your your local database your other local stuff um and it's going to prevent you from doing other work locally unless you like create a VM on your own machine um um and and if you do you could probably wait nevermind just ignore everything I said we launched cursor workers so cursor worker is and we launched it like yesterday it's way for you to run the same infrastructure and orchestration layer as we do for clouded ins but on any machine you might have um so you can do like not right now uh we can do cd dev let me see them yeah so you can do agent so we have the agency lie and there's now a worker and you can call worker start uh so from here we have a worker running um and this worker is going to show up in here let's see so we can do self-hosted let's see here oh I don't think it's hooked up yet there's a different uh account I'm running it on but eventually essentially you can run this on any kind of machine and you can get access to this um from like cursor Claude so you can spawn multiple of these on your own machine or you can run like a Mac mini or you can have a VM um in any like cloud platform provider just to follow up at that so you you're saying that we can have isolated environments in the local itself using this command yeah so it's still called the frontier models or composer models yes exactly so this is going to like leverage the cursor harness um why does it run on wherever you're spawning this demon yeah that's interesting thank you so I like I built this like cursor claw thing uh where I have one running on my Mac mini and that has access to i-message and calendar and all these kind of other things and um yesterday we launched automation as well so I can get like um like a daily reporter weekly report of everything that's going on in my machine uh that I might like want to know on uh specific instance is running like the agent demon you will get access to this in like slack and a web and the mobile app that's coming um at some point not too far out what's wrong with so what oh it's going to use swifty wise so it's probably going to be compatible with i-possible problem got it yeah I just want to ask quite a simple question like when you have obviously more than one developer in your where you're working your company and you're spawning hundred and hundred of agents to do a lot of different kind of work how do you ensure you don't step on each other toes doing the same kind of work ties and even high like you're running internally do you still do you do you use a scram or still agile ways of work you know even there's already kind of going out of the window already um um yeah what are we doing we're not really following any like traditional methodologies in that sense uh we do have like month of goals and of things we want to get shipped uh but I think since everyone has so much like power at their fingertips with agents uh this like causes people to have like extreme ownership over certain things so for the longest time there's like one guy building like mcp and rules and like all kind of accessibility uh by himself um and now we have like maybe one person focusing on mcp uh but they can own everything around mcp and they don't really need to interact that much with other teams but at some point that's going to break too um and like so far in the like history of cursor we have like found ways to like going around this the like the agente code or anything was probably one place where we stepped on each other toes where the code owners were like misconfigured so we could just like instead of having a deterministic thing can we just pull in the relevant people at relevant time um so like something like that is probably going to happen with other like problems that we're going to surface in the future thank you I think computer use is the one thing that's like still in uh early access I think we're I think we haven't shipped a GA yet but it's coming for sure so this should be like completely on parity with the Claude agent. Okay describe the profile of these kind of like uh mix between product managers and engineers that that take this this ownership. Hmm yeah so I guess the archetypes we have it's like pm um they talk a lot internally in at cursor like they talk with uh go to market with sales um they talk with engineers they talk with users they just product manage and product manage and just keep everything together in a way and also like shield like engineers from various things um and then we have designers um designers work I would say like 50 50 and figma and code at this point like all of them do code uh all of them like do push to production um but it's a lot of like exploratory work like what should like what does it look like when you have like 10 nested sub agents um and you can't really feel that in figma like you got to actually like develop and prototype that um and they work um they work with pms and then we have engineers of course um but I think cursor is very fortunate to build like developer products so developers are building developer product and it's kind of like they have good taste they know what good and bad look like they know like what developers want and don't want um and I think because of that they can take such like ownership and they can like go with the concept and go really really far uh whereas so like the pm might be setting more of the business and uh like the overall overarching like direction and then the engineers and designers like collaborate on like what does this actually look like in code but also like how should the feel and how should look um for a developer. Makes sense? Are there like analysts in this mix as well or is that done by the product managers? Oh yeah that's a good yeah so we have a data data team as well uh data scientists and analysts and they are also working closely with uh pms of course and like understanding like how users are using the product where the bottlenecks are uh but also we're like with engineers and like instrumenting the code in the right way and like understanding feature flags and why certain users hit certain path and some don't so everyone is like just working together um and we have like I think the way we structure the team is like pretty much um domain like extensibility might be one team um Claude might be one team um and clouds should still be extensible so then they have to also work together um but we try to like keep it like um modularize and not to ship our organization that much. Thanks. Cool I guess one final question if there is one. So um from time to time I've messed up and started a Claude agent in a wrong repo or something we're just like wind out on tangents came back to an hour later where we're separately trying to get access to that repo um are any way to catch these agents that just don't provide any value there's just continue doing stuff but they're not really making progress. Yeah I think that's that's on us for sure um over last year we have made a lot of improvements to the Claude agents where initially they were like when they were like work they were extremely useful but most of the time they weren't um so like again Claude agents also come from this like internal need of us just wanted to like run things asynchronously um and because of that we have also like put a lot a lot of effort into making our own code base work really well in Claude agents so maybe some like we like have to sometimes like create new projects and jump into other projects and talk to our customers to understand like where does things fall short and we tried to have like instrumentation of like does the agent run for X amount of hours or minutes and like does it touch any files at all or like is it going in circles and loop detection of these kind of things um and this is like part of the observability I was talking about before um most of that should happen on our side uh but they're always going to be like very specific uh contextual things where um like if you are the uh the code base owner need to like set up certain things um but yeah we're working on that improving it and if you have any examples like please come to me and I'll try to take a look I think the worst was when I started it on uh wrong repo and it just like pulled out to slack mcp and tried to get accessing 10 different ways and failed yeah yeah yeah we could make that better a good deal working on it all right thanks everyone for coming um I'll be around for next two days as well so please grab me if you want to discuss anything cursor or anything at all actually
TL;DR
- Building a "software factory" with AI agents shifts the human role from a direct code "worker" to a "manager" who delegates tasks, oversees processes, and ensures consistent, verifiable outputs.
- A robust software factory relies on well-structured code, dynamic guardrails to steer agent behavior, and comprehensive testing mechanisms that enable agents to self-verify their work.
- Scaling the factory involves automating human-in-the-loop tasks, providing agents with "enablers" like specialized skills and isolated environments, and continuously improving the system based on agent feedback and performance.
Takeaways
- Levels of AI Autonomy: Progress from basic "spicy autocomplete" to a "pair programmer," then to AI generating most code (human reviews), and finally to a "software factory" where AI agents autonomously ship, test, and build code with human input limited to defining intent.
- Foundational Code Structure: Design your codebase with modularity, co-located code, and established usage patterns (e.g., authentication methods, test scripts) to provide clear references and make it easier for agents to learn from and reproduce.
- Implement Dynamic Guardrails: Create rules, checks, and hooks to guide agents and prevent costly mistakes in sensitive code areas. These guardrails should emerge dynamically based on observed agent misbehavior, effectively becoming automated SOPs.
- Enable Agent Self-Verification: Equip agents with the ability to verify their own work through various tests, including unit, integration, UI, and end-to-end tests (e.g., using Playwright for browser interactions), ensuring changes are correct.
- Provide Agent Enablers: Empower agents with "skills" (e.g., feature flagging for safe deployments) and access to reproducible isolated development environments (e.g., separate VMs) to foster greater autonomy and infinite scalability.
- Shift to a Managerial Mindset: Transition from hands-on coding to managing a fleet of agents. This involves scoping and parallelizing work, preserving "tribal knowledge," and front-loading comprehensive context for agents to execute longer, asynchronous tasks.
- Automate Human-in-the-Loop Tasks: Identify routine human interventions (e.g., copying logs, aggregating user feedback, extracting specs) and automate them using agent skills or integrations to eliminate bottlenecks and enhance efficiency.
- Continuous Factory Improvement: Establish feedback loops by observing agent outcomes and identifying "off-rail" behavior (e.g., incorrect database schemas, poor UI). Use these insights to refine guardrails, build design systems, and update agent learning mechanisms (e.g., via "continue learning" plugins).
Vocabulary
Dogfooding — The practice of an organization using its own products or services internally.
Software Factory — A highly automated system where AI agents autonomously generate, test, build, and ship software with minimal human oversight, akin to a physical manufacturing assembly line.
Guardrails — Rules, checks, and hooks implemented to constrain AI agents' actions, preventing them from making undesirable or costly changes, especially in sensitive code areas.
Levels of Autonomy — A framework describing the progression of AI agent capabilities in software development, from basic assistance to full autonomous operation.
Primitives and Patterns — In the context of a software factory, these refer to the fundamental structural elements (e.g., modular code) and established ways of doing things (e.g., boilerplate code, specific service methods) within a codebase that agents can learn from and reproduce.
Enablers — Tools or capabilities provided to AI agents, such as "skills" or access to isolated development environments, that allow them to perform tasks more autonomously and effectively.
Feature Flagging — A technique that allows changes to be deployed to production in a disabled state, then activated for specific users or groups, enabling safe autonomous deployment by agents.
Verifiable Systems — Software systems designed such that AI agents can automatically test and confirm the correctness and functionality of their own changes (e.g., via unit, integration, or UI tests).
Human-in-the-loop — Describes tasks or processes where human intervention is still required, typically for critical decisions, feedback, or bridging disparate systems, which can often be targets for automation in a software factory.
Tribal Knowledge — Unwritten information, practices, and insights specific to a project, team, or codebase that are important for agents (and humans) to understand for effective development.
Transcript
Okay, so we're starting five minutes early. Hey everyone, I'm Eric, I'm an engineering cursor and I mostly work at the developer experience and product. And today I kind of wanted to talk to you about my experiences like working at cursor, dog fooding the product and like getting to a place where you can build your own like software factory and like what that kind of like takes and the practical steps getting there. To be honest, I don't think we're really there yet, like sub parts of the product and sub parts of the company are running like fairly autonomously. But building a software factory takes a lot of work. I mean like look at like real life factories producing like hardware. There's a lot of assembly lines, there's a lot of people that go into this, a lot of managing, observability and all that, and there's a lot of concepts we can borrow from that world and put into the software world. So anyway, here goes my observations from doing this. But first, the agenda, I want to talk about like levels of autonomy, precursor to factory pun intended, building the factory, running the factory and then scaling the factory and I want to finish with some Q&A for any kind of questions. Okay, so for the levels of autonomy, Dan Shapiro put out this blog post, if you think in January or February. We have like six different stages of autonomy throughout like, so automating software. Carpathias also like previously used cursor example of like going from tab to agent and all that. But I think this kind of like encapsulates is really, really well. So we have the spicy auto complete at the start. And is it kind of like where a cursor started in 22, 23 like ages ago at this point? And we kind of like gradually moved up the ladder and making the software creation more autonomous and letting the agents do more work. And I think most people adopting the AI tools are like at somewhere between level two and level three where you have a pair programmer where essentially just going back and forth with the agent, asking questions, getting suggestions, asking the agent to do work, and eventually like finishing their tasks. And the step above that would be having the AI generate the majority of the code, which we can see like here in the developer level three, where you as a human more kind of like reviews it kind of like in the loop, following traces and all that. But as you further progress, you're like becoming more and more of a manager and we'll talk about this more later. But eventually like level four, I think this is where I'm at at this point like for most like software projects where I'm like delegating as much work as possible to agents and probably like reviewing the outputs before I actually review the code. Because I still look at a code sometimes. And lastly we have the software factory which is essentially like a black box. Then Shapiro calls it like the dark factory where you don't really have an insight. It's just like agents going around doing their thing, shipping the code, testing the code, building the code, all that. And you as a manager just provides like the intent and the instructions and like the goal from what you want out of the factory. Okay. Yeah, so like why do you even want to create a factory? First of all I throughput, you probably want to create more code with like less resources. You can run agent 24.7, you don't have to rely on humans that need sleep and food and eat and all that. You can just like have more agents. Another like thing with the factories like you have assembly lines and assembly lines produces consistent outputs. So if you build your factory right you can probably have very consistent output. But at some point you initially you feel like if you don't have a red setup you might feel like the agents are getting more and more probabilistic and like you're losing a lot of the terminism. Because they just go off and do random things. Which is probably a sign that you need to like build more guard rails for the factory. And I think this is a function of the model capabilities as well. Like as the models get better they can follow instructions better and just execute on whatever you want them to do. And thirdly you might want to have a factory because you can leverage your taste better. You can like get more out of your creativity out instead of just like waiting for you as a human to create them and produce this software that you're creating. And then obligatory like then and now this is what it used to look like. This is like a test of factory from a couple years ago. And it is like kind of what we're getting after here. Okay, let's get straight into it. So to build a factory what do you actually need? I like to think of this as primitives and patterns. So just like how do you structure the code? Is this like a modularized code base? Do you have this scattered all over the place? Is it co-located code, etc.? Just did like the distance in locating like if you have an agent like LSAing a folder, it can like discover all the relevant files at once instead of having to prep and search all of the code base. It can just like be very isolated to work within one single part of the code base. And this goes to the same with humans like if you have an easy time like onboarding yourself to a new code base an agent probably will have that too. The second thing is like usage patterns. Do you have specific methods and services for authenticating a user? Do you have like startups scripts? Do you have a way to like write tests, etc.? Do you have this boiler played in place? Because if you do you can point the agent to like existing references and just asking this to reproduce over time. So those are like some other like primitives and structures of the code base. The second one would be godrails. So like you might you want to let the agents free but not too free. So you want to have some rules and checks and hooks in place. For example, I hope you might want to have is touching a specific part of the code base. Maybe the agent should not be able to change the most sensitive like encryption of sensitive data or authentication or anything like that where a mistake could be like very very very costly for the company or for you as a human etc. Rules is probably the most misunderstood concept since we launched cursor rules. There's cursor directory which launched a good collection of different rules and the assumption was usually that you should just install every rule that you can depending on like what software stack you're using. For example, if you're using next years maybe you should have next years rules. But what I've found and what I'm seeing amongst our users and internally is that rules should just like emerge dynamically. If you're finding agents going off the rails, you should probably create a rule for that. And it should kind of like be sort of like an SOP to showing like agents what they can do and cannot do. And again, the models are getting so good at following specific rules that they usually don't go off the rails anymore. And I think that's just kind of like extrapolate over time as well. And of course tests like can the agent verify its own work and can it run tests and know like oh I messed up something up or I made a change depending in like in this specific area of the code but it still passes. Like I can still run the code and the check looks good. And lastly which I think is probably most exciting is the enablers like what can you allow the agents to do to actually let them be free. Skills is good for this. Just giving agents more capabilities, skills and MCPs, accessing like external context, getting like understanding of how to implement a certain thing. I'm going to show you some later in the cursor code base what we are doing. For example like feature flagging. Can we give the agents a skill to add a feature flag? So when we launch them autonomously they can just flag the actual changes made and merge to PR and come back to us like hey if you want to try this just turn on this flag. If you don't like it we'll just revert to PR. If you like it we can like expand it to more users. And lastly what kind of environment are you letting the agents run in? Can your agents start your dev environment? Can you just ask them like hey start my project and let them do that without having to like have any human in loop? Because if that's the case you can probably have them run. You can scale it up like infinitely on separate VMs. And then this checklist is like what I'm usually following when thinking of like building the actual like the factory. And part of that is like is it runable? There's a typo in here. I blame my Swedish. There is a accessible like the context that the agents need to have. Can they interface with linear or notion or data dog or slack, etc. And then you can see what is the broader context of the intent that the user have. And lastly which I think people should be spending a lot more time is like building verifiable systems. How can the agents themselves like verify their own work? Whether that's through unit tests or integration tests or UI tests. Actually clicking around in the DOM and trying to reproduce things that's actually happening for the end user. This is arguably easier for backend systems where there's like no UI really happening and you can have like clearer contracts and boundaries of what should work and what shouldn't. Whereas for web and UI and all that you actually need to click around and making sure things work. The buttons actually have a loading spinner, etc. Okay. So this is like part of building the factories. So if we switch over to cursor here, I'm not sure if you've seen this, but this is cursor 3. We launched this a couple of weeks ago and it's a complete rewrite of cursor. There's no via code anymore. Most of you are probably familiar with this type of cursor. We have files and sidebars and a lot of different things. Whereas this is a bit more streamlined for like an agent first workflow. And we'll get to like why we created this as well. So at a later point. But I wanted to show you some parts of some rules, etc. Let's see where I put them. So for example, I built this music agent project and if you've used Ableton before, you probably recognize this. Yeah. Yeah, yeah, I'll expand it. More? Okay. Yeah. So if you've used Ableton or any music production software, you probably recognize this interface. Oops. It's not really working on this size. But what I essentially asked agents to do here is like, can you start a local dev server? And we can see that it worked for a while. It explored some files, read package JSON. And based on this, there is a start script. So like package JSON, all these dependency files are so in distribution of the models that they know like we should immediately go to package JSON after a slot JS project where it exists to look for a start script. And this is like a good example of having like a pattern that it's predefined and like making your code base more like in distribution in that way. Because now it's like, it's super easy for the agents to understand like, oh, I should just go in here and start a server. So it started a server. It's running a local host 3000. And let's see here. We can see that we had this agents MD file. So agents MD is like cursor rules. It's across for the many different harnesses. And what I wanted to accomplish with this project is essentially like building a factory around this idea of building like an online music creation tool. And to do that, I like, I force myself never to write in a code myself. Try not to look at a code that much either. And just like try to figure out like what is the systems and the structures I need around this. Immediately, it became pretty clear that we need a way to start a project. We need a way for the agent to like verify some work. So the agent created this end to end tests using a playwright so it can just spawn browsers, go through root, et cetera, click around and get by test ID. And making sure like for every different change that make, for example, the play button still works or I can add notes to this project here without anything breaking. So these are like some examples of how you can create like verifiable outputs like that. Okay, we have VTES, we have this, et cetera. So let's see here, if you go back. Oh yeah, another option here, a casual scrolling on Twitter. A different way to verify the work is using like an automation to code review. You can ask the agent to just review the changes it made or you can use like a more like integrated tool like bug bot that we have in cursor that is looks at different PRs and GitHub and reviews them and comes back. And this is like also like one piece of the whole like factory that you should have multiple different stages where you plan it, you produce it, you review it and you essentially follow the whole SCLC but you like automate and codify this work. And I want to show you this as well. So we launched updated Claude agents in the last couple of weeks where we gave each agent their separate VM and you can have them like create a very reproducible environment in the Claude. And this essentially allows you to scale like infinitely. But we also gave the agent a tool to test its own work by controlling the computer. So for example we have glass here which is the interface and I asked the agent to let's see here. Glass agents, see if they're rough with the keyboard, control tab, etc. like better accessibility and using the keyboard to navigate the agents. And I asked it to make it change and then record this with the full editor because the first one was just a sidebar. So what we got back here is just a video of the agent actually testing its own work. So we can see that it has this highlighted row, I'm not sure if you can see that. But just some context for me as a human to verify the work. And then it actually click and around and using the keyboard to navigate. So with this we're like we're getting kind of far in like the factory like where we're at. And a lot of the things are automated like review is automated, the testing is automated. We have some rules to like steered agents, etc. But there's still a lot more to do. So I think when you have this in place the most important thing you can do is like shift your mindset. Like you are going to look way less at code. So you are going to go from like worker to manager. Instead of just doing the work yourself you're overseeing a lot of agents doing the work for you. So this also means going from sync to async because most of the work is going to happen in the background. And you can still tap in and see what's going on for different agents. But the more agents you spawn over time the harder time you're going to have to like understand what's going on in each of them. So then you need a way to aggregate these changes like upwards. And it's just I think it's so interesting that it's just the same as like in human organization. All the same principles kind of follow. You still have you start with a very small team and then you add more and more people because you need to get more throughput and all of a sudden you need a manager to like oversee things and then you add more managers and then you need a manager of the manager. And this is essentially what's going to happen with agents too. But you are just going to keep on going up the level of instruction. So when you're a manager you need to start thinking like how do you scope and paralyze the work because you want to get like high throughput. But some things are not necessarily it's not good to make all the changes at once. For example, if you have two different tasks working on the same part of the code base you're going to get merge conflicts. So you need to still like plan out scope and paralyze the work. One unit of work can always be one agent. So then how do you take a long, long list of things you want to do and actually make the most out of that and run the most amount of agents that you can do. And to do this I think it's important that you preserve like tribal knowledge of the code base. You still understand what's going on in the different systems. You know how data flows, what the users want, which part are critical, which part are critical? So not outsourcing too much to the agents but very direct and managing them pretty well. And when you're going from sync to async you are going to need to trust the agents a lot more. Because you are going to send them off and doing longer and longer tasks. And when you do that you need to get more context up front. So you kind of like front load the context of the agents. I did to like a plan or a long spec and then you send them off and then you let them go. And once you start doing this regularly you're going to like start to feel the agents. You're going to like understand the models and you're going to see like these are the weaknesses, these are the strengths and you are like this alignment with the models. So you know like how to prompt them and what intent to give them. And again as the models keep getting better you have to give them shorter or less and less prompts as you used to before but you still got to provide the intent and be very clear like what with the change you want the agents to do. And there's like no there's no shortcuts to this from what I've found and from what the team has found. You just got to like spawn a shit load of agents and just like let them do the work and see what happens. And as long as you have good safety guardrails you can just let them do that. So you probably shouldn't let them push to prod like straight away. Yeah so this kind of comes down to like personally I'm always using isolated environments in different VMs. I just tweeted about this actually because on one hand if you're sharing the workspace you can have like get work trees where you like have the shallow copies essentially of the code base on the same machine and you can reuse services. But you're still going to have to branch every like database or cache or user management to have like reproducible and separate environments. Like if you are going to make a lot of changes the ones you need to you want to know that they are pure and they're not like having side effects to the other branches. And that's why I found like just using Claude agents where I spawn a VM and this VM can run a database internal tooling database is other stuff and the cursor app itself and then have the agent just work in that isolated environment to be much better. It is more expensive it's going to take a lot more work to set up your like factory or your environment to support this. But once you have it set it up properly you can scale this to like 100 or 1000 agents. I'm not sure how many we are running today but I bet it's like multiple thousands a day just agents running in the same or like copies of the code base. So that's what I would recommend. Yeah so when you're a manager like your job changes quite a bit. So you have to like look at your system as a whole. You got to like think of where is the human in the loop needed. For example do you have a log service like data dog and do you need to copy paste the logs and go into the code base and paste them and like run the agents to identify and trace down issues. Or do you have user feedback that you need to copy paste from Twitter into somewhere else and let the agents do something with that. Do you have like a notion thing where you have all your specs you need to copy paste the notion or export them into markdown and then to agents. There's probably a way to like automate all these different things. Either it's like skills or mcp or or separate automations. So think of like where is the human in the loop needed and try to like automate that away. The second thing is like catch where how can you catch agents go in like off not doing what you actually wanted to do. And this is like the this is like the perfect flywheel for improving your factory as well. If you can see agents like creating like wrong schemas in your database because they don't follow in naming conventions etc. That's probably a rule somewhere. Or if they are just producing really ugly UI. There's probably a way for you to create a design system and let the agents be aware of the design systems where they can incorporate that and use it for the next kind of like iteration you do. And yeah, then you take all these learnings and you use it to actually improve the factory. And thirdly it comes to like scaling the factory. So now we have like your environment set up you know how to be a manager like manage a fleet of agents you scope the task and you do all this. So how do you like actually take it from like five agents to ten agents to 50 to 100 agents. And the thing is again not looking at code is going to be a real thing if the model gets better and they are getting better. So observing the outcomes kind of like the same thing as previously like where they go off the rail what are they producing what are the artifacts etc. How can you make it so that the agents also can verify their own work and verify the outcome that they produce. You should set up automations. You should look again at the things you're doing repetitively. So one thing we could do for example here is if we go to cursor and we go to this music agent again I can ask looking at my chat history what repetitive tasks I'm doing. So we can ask the eight inches like look at this and identify potential opportunities. So searching the agent transcripts and it's producing some kind of artifact of this. Yeah let's go this goes. I actually built this into a plug-in. Oh let's see here. Planting execution loops, we're starting to product direction. Let's see here. Ableton like UI iteration. I should probably like put this in a rule saying like make it look like Ableton. Tooling housekeeping etc etc. So this product is very short lived but if you're looking at an actual production thing where you have prompted a lot over time you're probably going to find things that you are doing recurrently. And I want to show you some things that we are doing as cursor that we are automating. And some of these are not that obvious all the time but one is for example let's see here for example daily review. So I have this automation for checking my own daily review. So this is going to look at Slack. It's going to look at GitHub and it's going to send me a summary of the things I've done over the last day. So I would previously have done this like writing down my notes maybe thinking like what did I get done today or like writing an agent with access to MCP. But now I can just put this on a schedule and do this automatically for me. I want to show you a different one. For example read Merge PR comments. This is also like a way for you to learn over time. So for all the PRs that we merge in our main repository we can look at the comments and we can look at what did humans actually review here and what did they say about the changes I made. Because if a human actually goes in and reviews a PR and leaves a comment there's probably a high value and high signal and high intent in that comment and we can then store that later in order for the agents to actually learn over time. We have another one which I can show you here. This one. Yeah again the code owners. So this one allows us to. We essentially had this problem where we had code owners in our code base and they were kind of right most of the time like 80% of the time but for these 20% of the time they cost a lot of bottlenecks for us internally. Like we were blocking the Merge PR we needed someone else to review it for us and maybe they were in a different time zone perhaps. So what we started doing was building this agente code owner thing and what it essentially does is look in at PRs and check in like first of all what's the risk of this what's the risk level. Can we is it just like changing a variable name or is it changing our constant that's changing like how long a trial subscription is or something like that and if it is low risk it can just approve the PR because we don't really we don't want to block our own engineers on these things. But if it is we can see that it is a high risk PR and then we can find like okay who made changes to this previously and can we like pull in their feedback and making the most out of this and like first of all making the code safe and not breaking into systems but also for the user that actually did the initial change keep them in the loop and like keeping them up to date on and refreshing their context of what's going on here. So it kind of like it goes both ways and yeah multiple value ads from doing this. Let's see if there's one more view. No I think that was pretty much it or yeah I have this one more thing called continue learning. So continue learning is another type of automation that I created a couple of weeks ago as well and it essentially does what we did with the agent. We look at the previous transcripts we have and we can then extract like memories and learnings from what we said previously like if we're correcting the agent to do a certain thing like use this component instead of that component or always refer to me as like always like have very like verbose descriptions of things that you're doing instead of me like every time going in and asking to do this I can create a rule but I'm kind of lazy so I don't really remember to create a rule so instead we can have this continue learning plugin that looks looks through the transcripts and stored this as a rule for you instead. So these are all examples of like systems to automate yourself away and to automate like things that the agent can do for you and I think that's the important part of like building these factories like how can you identify the flywheels and loops where you can automate yourself away by building systems. Okay and yeah you are going to move up distractions so now you're managing five to ten agents but tomorrow you might be managing an agent managing other agents and that is just going to grow like you're going to have a lot of sub agents like under you working for you cool so yeah what I want you to take away from this is be very clear about the intent and like really think about what's the actual problem to solve here what we want to get out of this don't outsource important decisions like make sure you're staying in the loop for important decisions whether this is like safety or security or databases or payments and authentication some things are really important and should not be made so not be decided by agents but by humans. Build tools and systems try to find these flywheels and like codify them and get them in your systems and let the agent have access to them. Store context for later whether that is like agent transcripts or artifacts of things you think look good because this is going to help the agent to like know what good and bad looks like over time and it's going to change so storing the context and building the tools and like keeping them up to date is more important than actually doing the work because this is going to provide like the framework and the guard rails for the agents. And lastly like let the agents be free like think of what do they need I have a friend a loverable he mentioned that they set up a Slack channel or he gave the aid into tool a vent tool so the agent can complain about things when it was running and the agent started complaining about hey I can't like access this image I'm like very frustrated about this and then it posted straight into a Slack channel and they they set it up as a joke but then they started scrolling through and like oh this actually is very valuable like we should probably like give the agent access to reading images and they did and then they didn't started complaining about something else that was problem with the harness so find ways to let the agents be free I think that's very important thing. Okay that's kind of it and that's kind of like a direction of and things we have found like building cursor and like taking cursor towards software factory I hope you learned a thing or two and can take away some of this. I'm happy to take any questions about anything cursor. Yeah or actually now we have the microphones coming here. Thank you very much. I have a question about code quality or architectural quality so when agents ship tons of code and you barely can review them how you ensure the code is extensible and so on. I mean you can establish hooks or guardrails for measurable things like I don't know number of lines in the file should not be more than something. But the architecture is not measured this way so and agents they have this completion bias they want to finish task as soon as possible and they don't think ahead they don't have their picture of the future how code will evolve they just want to finish task now and yeah. Thank you. Yeah it's a good question. I think we as a humans have the same problem but it just takes a lot more time for us to like discover them. One pattern like the good thing about agents and models being like essentially like completion machines is that they will just look at existing references and just continue forward with that same path. So if you have existing things you can point them to I think that's very important. If you don't I think there's a case where you let the agents do one off implementations here and there and then eventually you have another agent like refactoring like we do as humans as well. So like one to generalize and build abstractions and all these things. So like how can you build like a system to like detect this and verify that the abstractions that are getting built is also good and in line with what you want to do. But I think it's got to be like a lot more architectural review for humans and scoping and like planning of what the architecture should look like and system design. But yeah it's a tough problem. Thank you. Hello Eric. Thank you for the talk. When it comes to the activities of building the factory, one thing that I observe for example when it comes to building things like rules in a team is that because it's so new almost everybody feels oh this is a rule for me and I don't want to inflict it on other people and I notice this creation of silos where each engineer ends up having their own separate different factory. Do you have any advice on how to bring it to the point where the whole team is contributing to the creation of the factory? It's a great question. I think it's hard. I think it's very cultural as well. I mean like we developers have always created our own tools and we want to have our own custom setup but at some points we have to unify on a certain structure. So I think historically we have had PR reviews and all these kind of things as a ceremony to align on the code that's being produced and making sure it's consistent. I think we're going to take the same principles and apply that to the tools we're building as well and like the guard rails and enablers and primitives. So I think I don't know establishing some kind of a forum where you can discuss these things and like plan like what do we want the factories to look like? What are the components we need? What are the integrations we need? Do you have any samples of like specific things to people? Is it like flavor or is it more bigger changes that the agents are doing? When it comes to rules they create like oh I want to like one person wants to write the test first and they create the rule to write the test first but they know that somebody else doesn't want to do that way. So then they have the rules only on their machine. They don't share it because it is unique to what they are. So they are collaborating. The whole team is collaborating on creating the codeways but the collaboration in creating the factory in thinking well are we decided now that the factory writes the test first or not? That is a big decision that is hard to align everybody and accept that. Like with all of these rules not everybody's going to be completely on board and in most cases it doesn't matter when you defer a little bit but it is hard to do. Yeah I guess it is a human problem and a human change in what we made. Yeah it is a good question. I think I love about it. Thank you. Thanks for the talk. A lot of the patterns resonate. I was wondering what is needed, what kind of patterns can you suggest to take it to the next level if you work on enterprise, brownfield, mission critical systems that cannot fail, that cannot be insecure. If you look at the recent supply chain attacks and you give your agent sandboxes, maybe that is not even enough. So the humans remain accountable and we can't say it is not my fault my agent did that. Do you have any extra patterns that or is it just inherently we have to keep reading the code which may feel like reading assembly lines in the 80s or something? I think if you can spend a lot of compute and tokens upfront before you as a human actually needs to be involved. I think that is a pattern that we found to be pretty successful. So one thing is like manually writing tests for very critical parts of the systems and then just letting the agents run them a lot. The second part is building automation to our security team, they built the security system which is an automation that looks specifically for very specific invariance of the system and they run 10 of these on certain PRs that changes certain files. And then yeah I think it is a bit contextual as well but yeah just spending a lot of tokens before and trying to find different variants and like almost read teaming. So one thing I did is instead of focusing on velocity and throughput and focus on quality. Sorry what? I use AI to focus on quality and just improve the tests and just make it completely AI ready. Yeah I think that is very good because if you as a human trust the tests you probably are trusting the output even though you don't have to look at the code and that's kind of like where we're going. So thanks for a great presentation. I find myself kind of like lacking in using guide routes especially like rules and hooks. Partly because historically the knowledge of how to do that properly was very scattered and decentralized across a whole web so you would have this exotic GitHub repose wouldn't try to centralized this knowledge or maybe you'd have some medium articles or maybe cursor with cursor company would do a block post on this. Still it was very evolving and also the capabilities of models themselves on especially on instruction following. They are also evolving and they are getting better on that and and it always felt like kind of like duct taping to me. So I'm wondering basically can we have AI to help us with that? Meaning that good cursor for example give us like proactive agents or maybe some new setup or maybe wizards kind of setups where we could identify our workflow and then help AI build us rules and and guardrails and all those like rules artifacts for us. So maybe just like a proactive agent so maybe we would have like an agent that would scan our workflow globally and then help us build those artifacts. What do you think about it? Do you guys think about this in the company? Maybe do you work on that? Yeah totally. I think now there's like two places we can do this. One is in the product itself with the whole, with the like continual learning product. Let's see here. Oh I don't have it installed. We have a marketplace. Yeah with the continual learning kind of plugin to actually like look at your transcripts and like extraction rules and memories and all that that's like one way to do it. Then there's like another world where you like change the weights of the model depending on like what your code base looks like and what like your engineers are doing like in a specific team and you like your reconcilem and it's like it's like true continual learning not this hacky plugin and you like actually bake that into the model so they actually know what your preferences are etc. But totally like memory and rules and all that I think that's kind of come more and more important over time because that's kind of like what's lacking. That's kind of like what's preventing me from having a lot of trust in agents sometimes because like I say something and they forget about it but they look like stateless machines so how do we capture this knowledge? So I think we should put a lot more like time and effort into building these systems. If I might just follow on that so you say that you seem to first to start a project or or to dive into the project that's already existing in the code base and then to build the rules on top of that. How about we first have rules and we want to start a new code base new project. How to actually have those good rules for us. Do you think that humans should still do that or can we also automate that. Can we do we have a new best workflows for that? I think it's hard because my perspective of rules is like the bridge between the model behavior and the human behavior and how do we steer the models in a way that they follow me as a human what I want to do. And in a new project I'm not really sure what I want to do. I kind of want to like outstourge that to the model and see what are they doing here. Can I run different models? They want to combine them or do I want to scrap everything. So I think it's hard. Like the best example of a rule that I can think of internally as for bug bot. So when we're doing database migrations we're not really using foreign keys on a database for performance reasons. And the models like the right way to do is use foreign keys. So they will always add a foreign key. But when it hits get up and there's a PR created we have bug bot look and it's the reviewing it's like oh I have this rule saying we should never use foreign keys. So then it flags this. So that's the gap between the human and the model and what we want to desire like intent we have versus what they have. So I think rules should emerge dynamically over time. And before that you should probably adjust this ephemeral like specs and plans. So it's like one over here. Oh yeah. Oh yeah. It doesn't work. It's work. So thank you Eric. For the talk, it's evolution. Evolution and trust is a big point. I'd like to know how you effectively do GUI testing and user acceptance testing automated. If you show something of your workflow. Totally. The best or like the main way I do it is using let's see here. Oh yeah. I have this one for example. The main way I do it is using the cursor cloudy dance with the computer use that we have. So I'm going to publish this. I don't know. That's bad. I guess we're not doing that. I have this website where it's running a I have like seven components like a button a drop down etc etc. Web components and then I'm generating each of these components with a different model. And because I want to like compare like what does a composer drop down look like versus a GPD 5.4 drop down look like and I put this in a grid. But when I created this website there was an error where I had this like view code button so I can actually see the generated code. It was not working because the model didn't bundle the actual code. So I went to cursor and I clicked when clicking view code on that component it says it kind of loaded code. And it's like it's a very like short description. So what the Aidentid you can see here it spawned my local server. It started like clicking around and pressing enter. We can see the cursor up here. And it's created this like screen studio-esque recording whereas like chopping and speeding up and zooming in etc. So here it's taken a while because computer uses fairly slow. It's consuming a lot of tokens. And we can see we have this view code button. And now we can actually see it's working too. So since this is a very like much of a side product for me I'm not really going to look at a code. I'm just going to like see that this works and I'm going to merge it. But you can keep on prompting a model to do very specific things for you like. Can you follow these like specific instructions like a login flow for example. You should click the button. You should log in. The most this like login steps are probably so much in distribution that you can probably just prompt the model to say like go to this URL and click login or like login. And it's kind of like understand which steps it needs to take. But then you can ask the model to like input a wrong password or input a wrong email and see what are the results from the website. And maybe the website is giving like wrong credentials. And then the Aident would understand like oh I need to like put into write credentials. So just like you would like hire a consultant like a QA consultant and giving them instructions. You would just give the same instructions to the agent. So this is like one way to do it. I guess the other way would do like more playwright slash property here and just automating like a browser thing. Which is a bit more deterministic as you can review it. And check it in and like have other people reuse it. Does that answer the question? My question was going more into like user acceptance testing to check. Does this thing actually look right because that testing a login you can do you can automate that need an agent for that. Does the website look right? Is it consistent through all the pages that are generated and like that? Yeah. Then I used Claude agents for that a lot. There was one I can't remember now. But I think it was I did some changes in the docs. And I just asked it to like open every single instance where this word is referenced. Take a screenshot and give it back to me. So then I could just like look at all different screenshots, everything good, and then I could merge the code. So letting like the agents do the navigation and clicking around and the testing for you. I think it works surprisingly well. This was like a very much an AGI moment for me when we launched this in last year sometime internally. So have you had time to try Claude agents in cursor? You should. Which one this one? Well, it was initial question. How long it took? Very straightforward. Like I for this one, I did no specific setup for like our own repository where we have like when running cursor. So we can actually like reproduce like this demo here. It's running all the backend services for cursor. It's running all the front end things. And this is like a lot, a lot of different things. So the VM is quite beefy. But as long as we give the right instructions, it's working really well. What we did was creating this internal CLive that the agent could use to sort of like we call it like cursor dev tool, cursor dev tool back end start, cursor dev tool front end start. And that is abstracting everything away that actually needs to get to like orb stack to running a click house and postgres and redis. And then the front end running like electron and then a glass here. But then they just like co-exist to two different processes. And they didn't have access to everything like just as a human would do. And you can have like the authenticated if you store like a snapshot where you are authenticated. Oh, sorry, sorry, sorry. Okay, okay. My bad, my bad. Yeah, this one I don't really have the, I can probably look it up. I would guess this is like $1 something like that. There's like for one, probably like this initial one would be $1. And the other ones I just asked to rerecord a bunch of different things. Something like I can look it up later totally. Yeah, I guess depends on which model you're seeing too. Okay. Okay. My question is about hand off between humans and they just whenever you are using different tools. So in my current setup, I have a product owner and a functional analyst that they work on. And they prototype very fast. We basically without so much thinking about the bekand, the architectural choices or whatever. And then they pass the the control down to the delivery theme that uses cursor and has to make that stuff work actually work. Which best practices do you suggest in order to enforce a proper workflow between people just not knowing basically what they are doing on a technical point of view, of course. And the people that needs to bring that thing that maybe has some poor choices such as okay use that database or then Claude Code change the idea and they move from a super based to a source to any other kind of fancy database that actually is in that environment. And then bring that into some sound architectural choices moving from Claude Code to cursor. I think what we're doing internally is like by like one or two pms and they are building a lot of different prototypes. Sometimes it's actually in the real like product itself. They're using maybe Claude agents and just prompting them. They're getting like a video like this back of the changes. And it's like oh it kind of looks like I wanted to and then it tweaked the signs a bit. But the code might be really bad or like not following best practices. Which if they had a if we had a good factory then it probably would. But if that's the case we hand off like a link to the Claude agents. We just copy the link and just send it to the like engineers like hey this is like something that we want to build. Does this make sense like can we do this. And then you have a lot of intent already expressed. But the other case is like having the pms they have a separate repo called like prototypes. And it's just like an HTML file like a mega HTML file and reproducing like the cursor UI or the dashboard. Yeah the problem is the migration. So just a particular use case I had my PO and functional team build out of very fancy demo using a prisma and tour source whatever database and then storing data on versatile blob storage and then my delivery team had to migrate that to use SQL server and C sharp and aspire for the backend. And the migration was really painful. Yeah. Even because when they use the agent freely with no constraints the agent sometimes decided to use say next yes some adult time decided to use white and other time it decided to use as well. And putting constraints in form of rules within that agent shaped that down the path but the problem is that we need to write a lot of rules and make them consistent and it is not easy to manage all the work. So we are shifting a lot of effort from having people to write code to having people to write guidelines and the rules and whatsoever and make all the pieces talk to each others. I see yeah yeah I guess if the PO's and pms can't have access to the actual code base just like handing off an artifact is like the minimum viable intent which could be like an interactive like back in the days used to be like Figma prototypes right you can click around and you get like a feeling for it. Now you can have them even higher fidelity where you have an interactive prototype using like web technology without like touching anything on the backend stuff where it doesn't have to be like a working thing for real if it's just a prototype internally but just enough to like your engineers can understand like oh this is like the intended thing if I click this thing that should happen or if I like entered some text here and click send a row should show up here and I think all that can just be done in front and kind of like a hackathon. You don't think to migrate the product into something that becomes production really but rather rewrite that. Yeah I think so I think we're writing and I think like setting like clear expectations from from the engineers to the pms and pms like what engineers kind of want from the product organization and like what's most helpful for them so maybe not like web coding complete SaaS products it's the most efficient thing. I said thank you for our presentation my question as we're building more and more agent and it became art of our time critical processes. How do you see the brown outs and black outs as a as a new risk and what's your view how it can be mitigated and the impact reduced. Yeah it's a great question it's a really good question I think it comes down to what we talked like the humans are still accountable for the things that's being shipped so the humans need to build like systems and observability and monitoring around the changes that's being made and I think that still like comes down to understanding which are like system critical areas of the code base making sure you have good like observability and understanding of everything that goes on maybe like every line should be human written in these critical things or at least like always humanly reviewed by one or two people and yeah it's close to vibe it's easy to vibe code close to the sun and fly through close so I think it's also like a culture thing where you have to make sure that the humans are still like accountable for things getting shipped but yeah setting up good systems to understand the changes being made I think that's important and test. Eric thanks a lot for the talk and I'm assuming you're probably one of the people around the world that has a best understanding of how to use these technologies so this question takes us to back about from the technology and things about processes and how do you manage yourself in your work days and I wonder how long are these tasks or how long do you get to be away from your agents without babysitting them and how do you actually invest this time let's say you have five ten fifteen minutes how do you make the best out of your time and maybe how many agents do you have in parallel like mental processes and how do you manage yourself. Thanks a lot. Yeah it's a great question and I think like once you like there's like two levers to pull one is like the scope of the change like larger the scope is the longer the agents are gonna run and if you want to run for a really long time you want to have like a very viable system so like they can check their own work etc and the other thing is like how much can you parallelize like how many of these agents can you spawn off and I think the sad reality is in some sense is that there's going to be a lot of context switching I probably work in four different repos or like four different areas of the code base at the same time whether that is like through a like single like feature that requires front end back end database testing yaryara or if that's like five completely different things it could be like docs it could be like side products I'm exploring it could be fixing a bug from a twitter user but I usually they range from like probably five to ten agents five agents like asynchronous running in the Claude at all times and while I'm waiting for these I'm either like scrolling twitter or it's true I also have the browser in cursor now so I can just stay in here and do it or I have like synchronous task go and where like I'm a bit back and forth maybe that's like fixing a small thing in the code base or maybe that's like planning the next thing maybe I'm like sourcing notion and slack and just like creating a spec in cursor using a model so I love to like plan synchronously and then just execute the plans like asynchronously and then once that is done one of my Claude agents is probably done as well so I can come back and like review that keep on prompting it a bit maybe merging in some parts I still like need to test manually like maybe I need to download a copy of glass or cursor three test it manually and like this looks good to me let's go ahead and merge thank you a quick question this factory building leaves us with a scattered ecosystem of a lot of markdown files is there an easy way to organize these files and to keep an overview of the factory you have actually built as maintaining a factory would require you to have an overview of the processes you want your coding agents to go through what tools do you use what methods to recommend how do you keep a mental map of the factory you have built and how do you maintain it yeah it's it's a really good question I think it's somewhat unsolved as well one of the reasons we've rebuilt cursor to look like this instead of like the traditional ID is the fact that we are using more agents and we need like a better control panel where you can like see all the agents and manage them and spawn them etc so what's going to happen with like cursor three this is like the first stab at like multi-adients workstation what's going to happen is that these are going to be like nested agents so you're going to have like opening this one up and you're going to have like 10 agents in here so you can still like introspect them and see what's going on in following the traces but you're probably also going to have like somewhere here like some kind of project view where you can see like an aggregated status updates like here's what everyone is working on and here's like the latest here's what the USU human need to review so I think these are product things that we're going to build into cursor but to like set the spec for the factory I would probably like have a folder in your code base where you like outline how certain things should work maybe that's like just mark down files of saying here are some best practices maybe probably the rules and establishing some kind of counsel to decide on like what goes into the factory and what doesn't and like what are we lacking to like improve the factory so as long as it's something that the agents can understand and read which is files that's probably what I would do and just store them as in your code base that's checked in somewhere thank you I'm just thinking about like teams of the future so you know a year or two ago it's like very reasonable to have you know an engineering team that might be several hundred people several thousand people and what does this do with that and what roles and kind of like roles engineering team right this is kind of akin to almost becoming some of between like a product manager and like an architect so what roles do engineers have yeah I think that I think that's very accurate it's hard to predict like what already like second third fourth order effects of this happening and it's definitely like writing less code looking at less code spawning more agents it's going to be like how do you take because like we're still building software for humans mostly so like how do we know what other humans want like how do we talk to our customers how do we market what we're building how do we do all these things and bring them into the actual like factory who sets the direction what's the intent all these things are coming from somewhere either it's like creativity from someone else's head or it's actually like a user demand so having someone like doing that it's going to be very important having someone like like aligning that between the different humans in the org I think it's going to be important having people building the scaffolding for the other agents in like just building the assembly lines where the agents can actually run I think that's also going to be important but like to what magnitude and how many people it's going to be like in yeah I don't know it's really hard you can do a lot with the models right now with a very like small team if you have the right setups in place and like yeah depending on the domain you're working in I don't know do you have any predictions I see issues with kind of like from like a labor perspective if you're if you're working in an incredibly agentic environment what's your need to like like what happens to training new grads hiring new grads and kind of like the future from that perspective what happens with office politics and like land grabbing right because basically your your value now becomes in your ability to configure and set up your own kind of like agentic team not in your ability to kind of like program and be productive anymore the 10x engineering is no longer about you know words per minute it's like prompting yeah yeah token token usage yeah I'm I'm I'm I paid in tokens I'm I am I yeah leaderboard you know I'm gonna be talking next to my pader in a mount and then like my token usage takes away from that you know how do you how do you optimize we've got to train to model to be more political I think that's the solution right we need more like what's a cool at all I guess we're gonna have more of that if the agents are doing our work hi Eric hey we're talking um I was wondering um probably you are using at cursor uh some kind of uh uh issue tracking uh tools like atlasian or gyra okay um are you using uh I was wondering if you are using uh agents to check automatically check and uh um read tasks directly from uh gyra for example and spawn uh um uh sub agents to perform the work where if there is always a human that um start to work using cursor uh so we're using linear for issue management and uh we have this first part of integration as well so for every ticket that's getting created and linear we spawn in the Claude agent um so like one where I interface with this most is like if we have a feature flags before a specific thing that's rolled out and if it's rolled out for two weeks with 100% um the system kind of like single cells like hey uh you can it's a steel feature flag at this time you can remove it so then we have this to create an automatic issue in linear and since that is hooked up with cursor it triggers a Claude aid and to remove uh the feature flag so it's kind of like completely automatic once the system no status rolled out to everyone and I can just like I can probably look at a code and it's like okay we can merge this the feature is no longer active um and we do this for like everything so once you post something in Slack uh we either have a linear Slack agent Slack agents look at it or we have a cursor automation to like um look at the message I was posted and uh trioshed and like look for duplicates or like if it's determined to be easy like start to implement the fix for it immediately and and this is like an example of where a human is like in the loop where it might not have to be it could be like me going on Twitter and like seeing a tweet like something is broken with um the plan mode button drop down I can copy that into Slack and then having the agent perform the work but there's probably a wave where we can just source this feedback immediately without me having to like scan it and trioshed and copy paste it um so that's kind of like a bit how we work with uh linear initial management um but yeah we we're also like yes since we're spawning a Claude aid in for every single thing it provides a good way for us to dog food the product and like test it out but I'm not sure if I would recommend that for for everyone because it can be quite costly as Claude agents are a little expensive uh do you have something in roadmap run something locally like I'm just thinking of an alternative called dev containers and opening in that but do you have something planned in the roadmap for that um um what I think the closest thing you can do is probably just prompt aid in to run for really long time um it's kind of like the same thing with like running local models um and the recent like from like tried it like I've probably tried like once a month running like the best open source local model and like seeing how it works in cursor but it's never the same experience as sliling like um GPT or a Claude or a composer um and the same thing with like running really long things locally I've found it to not work that well as if it's running for a long time it's probably going to reuse your your local database your other local stuff um and it's going to prevent you from doing other work locally unless you like create a VM on your own machine um um and and if you do you could probably wait nevermind just ignore everything I said we launched cursor workers so cursor worker is and we launched it like yesterday it's way for you to run the same infrastructure and orchestration layer as we do for clouded ins but on any machine you might have um so you can do like not right now uh we can do cd dev let me see them yeah so you can do agent so we have the agency lie and there's now a worker and you can call worker start uh so from here we have a worker running um and this worker is going to show up in here let's see so we can do self-hosted let's see here oh I don't think it's hooked up yet there's a different uh account I'm running it on but eventually essentially you can run this on any kind of machine and you can get access to this um from like cursor Claude so you can spawn multiple of these on your own machine or you can run like a Mac mini or you can have a VM um in any like cloud platform provider just to follow up at that so you you're saying that we can have isolated environments in the local itself using this command yeah so it's still called the frontier models or composer models yes exactly so this is going to like leverage the cursor harness um why does it run on wherever you're spawning this demon yeah that's interesting thank you so I like I built this like cursor claw thing uh where I have one running on my Mac mini and that has access to i-message and calendar and all these kind of other things and um yesterday we launched automation as well so I can get like um like a daily reporter weekly report of everything that's going on in my machine uh that I might like want to know on uh specific instance is running like the agent demon you will get access to this in like slack and a web and the mobile app that's coming um at some point not too far out what's wrong with so what oh it's going to use swifty wise so it's probably going to be compatible with i-possible problem got it yeah I just want to ask quite a simple question like when you have obviously more than one developer in your where you're working your company and you're spawning hundred and hundred of agents to do a lot of different kind of work how do you ensure you don't step on each other toes doing the same kind of work ties and even high like you're running internally do you still do you do you use a scram or still agile ways of work you know even there's already kind of going out of the window already um um yeah what are we doing we're not really following any like traditional methodologies in that sense uh we do have like month of goals and of things we want to get shipped uh but I think since everyone has so much like power at their fingertips with agents uh this like causes people to have like extreme ownership over certain things so for the longest time there's like one guy building like mcp and rules and like all kind of accessibility uh by himself um and now we have like maybe one person focusing on mcp uh but they can own everything around mcp and they don't really need to interact that much with other teams but at some point that's going to break too um and like so far in the like history of cursor we have like found ways to like going around this the like the agente code or anything was probably one place where we stepped on each other toes where the code owners were like misconfigured so we could just like instead of having a deterministic thing can we just pull in the relevant people at relevant time um so like something like that is probably going to happen with other like problems that we're going to surface in the future thank you I think computer use is the one thing that's like still in uh early access I think we're I think we haven't shipped a GA yet but it's coming for sure so this should be like completely on parity with the Claude agent. Okay describe the profile of these kind of like uh mix between product managers and engineers that that take this this ownership. Hmm yeah so I guess the archetypes we have it's like pm um they talk a lot internally in at cursor like they talk with uh go to market with sales um they talk with engineers they talk with users they just product manage and product manage and just keep everything together in a way and also like shield like engineers from various things um and then we have designers um designers work I would say like 50 50 and figma and code at this point like all of them do code uh all of them like do push to production um but it's a lot of like exploratory work like what should like what does it look like when you have like 10 nested sub agents um and you can't really feel that in figma like you got to actually like develop and prototype that um and they work um they work with pms and then we have engineers of course um but I think cursor is very fortunate to build like developer products so developers are building developer product and it's kind of like they have good taste they know what good and bad look like they know like what developers want and don't want um and I think because of that they can take such like ownership and they can like go with the concept and go really really far uh whereas so like the pm might be setting more of the business and uh like the overall overarching like direction and then the engineers and designers like collaborate on like what does this actually look like in code but also like how should the feel and how should look um for a developer. Makes sense? Are there like analysts in this mix as well or is that done by the product managers? Oh yeah that's a good yeah so we have a data data team as well uh data scientists and analysts and they are also working closely with uh pms of course and like understanding like how users are using the product where the bottlenecks are uh but also we're like with engineers and like instrumenting the code in the right way and like understanding feature flags and why certain users hit certain path and some don't so everyone is like just working together um and we have like I think the way we structure the team is like pretty much um domain like extensibility might be one team um Claude might be one team um and clouds should still be extensible so then they have to also work together um but we try to like keep it like um modularize and not to ship our organization that much. Thanks. Cool I guess one final question if there is one. So um from time to time I've messed up and started a Claude agent in a wrong repo or something we're just like wind out on tangents came back to an hour later where we're separately trying to get access to that repo um are any way to catch these agents that just don't provide any value there's just continue doing stuff but they're not really making progress. Yeah I think that's that's on us for sure um over last year we have made a lot of improvements to the Claude agents where initially they were like when they were like work they were extremely useful but most of the time they weren't um so like again Claude agents also come from this like internal need of us just wanted to like run things asynchronously um and because of that we have also like put a lot a lot of effort into making our own code base work really well in Claude agents so maybe some like we like have to sometimes like create new projects and jump into other projects and talk to our customers to understand like where does things fall short and we tried to have like instrumentation of like does the agent run for X amount of hours or minutes and like does it touch any files at all or like is it going in circles and loop detection of these kind of things um and this is like part of the observability I was talking about before um most of that should happen on our side uh but they're always going to be like very specific uh contextual things where um like if you are the uh the code base owner need to like set up certain things um but yeah we're working on that improving it and if you have any examples like please come to me and I'll try to take a look I think the worst was when I started it on uh wrong repo and it just like pulled out to slack mcp and tried to get accessing 10 different ways and failed yeah yeah yeah we could make that better a good deal working on it all right thanks everyone for coming um I'll be around for next two days as well so please grab me if you want to discuss anything cursor or anything at all actually