Skip to main content

Replacing 12K LoC with a 200 LoC Skill — David Gomes, Cursor

TL;DR

  • Cursor replaced a complex, code-heavy Git worktree feature with a lightweight, Markdown-based "skill" definition, significantly reducing maintenance overhead.
  • This refactor leverages existing agent skills and sub-agents primitives, allowing users to initiate parallel worktree operations and multi-model comparisons via simple slash commands.
  • While the new implementation offers benefits like multi-repo support and improved comparison, it introduces challenges such as agent deviation from the worktree and reduced discoverability, which are being addressed through prompt improvements, eVals, and RL training.

Takeaways

  • Original Worktree Feature: Cursor initially implemented Git worktrees as a full-blown feature with extensive code for creating, managing, isolating, and cleaning up worktrees for AI agents.
  • Markdown-Based Refactor: The entire worktree feature (including Best-of-N) was re-implemented using simple Markdown-based "skills" (effectively commands), resulting in a deletion of approximately 15,000 lines of code.
  • Leveraging Existing Primitives: The new system utilizes Cursor's agent skills for defining agent instructions and sub-agents for spawning multiple AI instances, enabling parallel execution on individual worktrees.
  • Slash Commands for Access: Users interact with the feature via chat slash commands like /worktree to start an agent in an isolated worktree and /best-of-N to compare multiple models on the same task.
  • Key Advantages: Reduced maintenance burden for developers, ability for users to switch to a worktree mid-chat, support for multi-repo setups, and a superior experience for comparing and combining results from different models in Best-of-N.
  • Challenges Encountered: Agents may "deviate" or "elucinate" (hallucinate) and work outside their assigned isolated worktree, the process can feel slower as worktree creation is visible in chat, and the feature's discoverability is reduced.
  • Addressing Issues: Solutions include improving agent prompts through eVals (automated evaluations) and RL (Reinforcement Learning) training for Cursor's Composer model, along with better system reminders for agents.
  • Future Directions: Cursor 3.0 will feature a more native worktree implementation within its new agentic UI, and the team is exploring non-Git based parallelization primitives to address current limitations (slowness, disk space, Git-only).

Vocabulary

Git work trees — Separate, lightweight working copies of a repository that allow developers to work on multiple branches or tasks concurrently without interfering with the main checkout. Agent skills — Configurable sets of instructions, often defined in Markdown, that an AI agent follows to perform specific tasks or workflows. Sub-agents — Secondary AI agents that are spawned or managed by a primary agent to handle specific sub-tasks, often in parallel. Best-of-N — A feature where the same task or prompt is given to N different AI models or agents, allowing users to compare their outputs and choose the best one. Parent agent — The primary AI agent responsible for orchestrating tasks, potentially spawning sub-agents, and synthesizing their results. eVALs — (Evaluations) Automated test suites or metrics used to assess the performance, accuracy, and behavior of AI models or the effectiveness of prompts. RL — (Reinforcement Learning) A machine learning approach where an agent learns to make optimal decisions through trial and error by interacting with an environment and receiving rewards or penalties. Prompting — The process of crafting specific inputs or instructions (prompts) for a generative AI model to guide its output toward a desired response or action. Elucinate — (Likely a misspelling of 'hallucinate') When an AI model generates information that is factually incorrect, nonsensical, or not grounded in its training data or context. Primary checkout — The main working directory of a Git repository that a developer typically works from, distinguished from isolated Git worktrees.

Transcript

Hi everyone, how you all doing? Thank you for coming today. I'm going to be talking about how markdown is basically the new code. As Tejia has already sort of previewed, we recently replaced a lot of code in the cursor application with just markdown, just a skill. And in today's talk, I'm going to share a bit of the journey of going from a full-blown feature with a lot of code, a lot of dependencies, a lot of complexity and tests into a much more lightweight, streamdown version of the same feature effectively, but just with a single skill. Before I start, though, I have to give you guys a little recap of git work trees and how they work in cursor. Now, if you haven't heard of work trees, and git, they are effectively like separate checkouts, and I'm sorry for the widescreen, but they're effectively like separate checkouts of your repos that allow you to work in parallel. So different Asians can be working on the same task at the same, or on different tasks at the same time without interfering with each other. If you've never used this feature before in cursor, the way it works is that you can spin up an agent on an individual work tree, and you will see, for example, the same file in two different work trees, and you can see that they look different because the agent is doing some work on the work tree, but not on your primary checkout. And anytime the agent runs commands, or lints, or anything, it does, will be isolated and scoped to that git work tree. With this feature, you can also work even in parallel at the same time on the screen. You can have these grids of Asians working for you. And if you say, hey, open a PR, the agent will open a pull request from that work tree with the changes that it produced inside that work tree. And one of the coolest things about this feature is that it allows you to give the same task even to different models at the same time and then compare what different models do on the same prompt. So if you haven't heard of this, we call it Bessavent, and it's effectively a way for you to compete on different models compete on the same task. And then you can even preview the changes if it's a front end project you're working on. You can compare all the different visual implementations and then choose the one you prefer. Now, if you have never heard about this, everything I'm talking about today, I will also just say that it all came out in around October of last year, or alongside cursor 2.0. And when we initially shipped that, it came with a lot of complexity. We had to write all the code for creating work trees, managing these work trees, feeding them into the agent as context. We also had to make sure that the agents were scoped and isolated and they could not escape the work tree they were working on. We also have something called setup scripts which users can configure and run, and I have cursor run them anytime an agent starts operating on a given work tree. We also have the judging, so I didn't show you this before, but there's a little thumbs up icon on one of the models. That's just a judge that we run that tells you which implementation looks the best based on different criteria. And then we also had to make some changes to the hardest and introduce some system reminders to help the agents stay on track in these work trees. And then finally, there's some cleanup complexity as well, because people like to spin up hundreds of these work trees and then their disk sizes blow up, and we have to help them by cleaning up the work trees that stayed behind. Now, in our new implementation, the one that I'm going to be talking about today, we were able to get rid of most of these things. And in fact, I recently opened the PR, removing this entire feature from cursor, and it was a massive deletion of code, like I think it was around 15,000 lines of code deleted. The new implementation of the feature is almost as good as the previous one, and it is much more lightweight in terms of us to maintain it. And it even has some benefits compared to the previous implementation that I'll be talking about today. So how are we able to replace an entire feature with a skill? We decided that there are two primitives that we could use to effectively allow cursor users to use work trees by simply leveraging two primitives. One is asian skills and the other are sub-asians. So both of these are existing cursor features. You can learn more about them in our docs. We have a page for skills and we have a page for sub-asians. We realized that if we took these two things together, we could basically re-implement both the cursor work trees feature as well as the cursor best event feature. We just marked them. And this is a little video of how it works. So I can now as a user say slash work tree, and then I'll give it some tasks. I'll say fix the typo in the footer of the website. And this agent will run in an isolated work tree and do its work there. So the way the skill is written is actually really simple. I can show you most of it. It doesn't fit on the screen, but it's basically a set of instructions telling the model how to create work trees and to run the set-up scripts that the user might have configured and then to stay on that checkout. We want to make sure that when the agent is operating on a work tree, it is staying in that checkout. The best event skill is very similar. It's actually even smaller. The entire skill fits on the screen here with a small font. And what we're doing here is we're instructing the parent agent to go and create sub-asians for each model and then spin up a work tree for each sub-asian created its own work tree and work inside that work tree. And then we also tell it to wait for all the sub-asians and when they're done, please provide some commentary. Please let the user know what the different implementations by the different sub-asians look like. Maybe you can grade them, maybe you can make some criticism of them and maybe you can help the user choose which one is the best. And please give that to the user in some nice stable format or something. But again, it's only around 40 lines of code and it's all marked on. It's not even code. And the previous version of this was maybe 4,000 lines of code. As some of the considerations we have to have in this skill is that the skill must be cross-platform compatible. Like we have window-specific instructions and we have Linux and macco-s instructions as well. We also instruct the parent model to run the setup scripts for each work tree that the user might have configured. And then this is the hardest part. We'll spend a bit of time on this on the talk today. We have to instruct the model to stay on that work tree. Right? We have to really say, hey, do not ever work outside this and do not ever escape, right? And we do that with some aggressive prompting effectively. So the new commands are slash work tree and then slash best event to do the basically like the to start agents in isolated work trees and to start multiple agents on the same task. And then we also have apply work tree and delete work tree to bring over changes from the side work tree into your primary checkout and delete work tree just as what you would expect. A little note is that these are not actually skills in cursor. They're actually commands, but the way these commands work in cursor is extremely similar to how skills work in that their the prompts only get loaded into the context if the user chooses to load them. And the only reason we did it as commands and not as skills is so that the prompts for them can be controlled in our servers in our back end. This means I can iterate on these prompts without you having to update your cursor version. If I do some improvements to these prompts, the next time you use them, you're going to have you're going to get the latest version of the prompts. We're effectively the work like skills. This is a demo of the best event, skill or command where I'm giving the same task to Kimi, GROC, Composer, GPT and Opus. And what you will see is that the parent agent starts by spinning up five sub-asians on the five different models that I specified and each one is going to have its own work tree, each one has its own context. And then Opus takes a little longer as expected. And then at the end, the parent model as instructive will do that comparison across all the different sub-asians. It'll say these two models did basically the same thing. This one did something that none of the others did. And you can even talk to the parent agent and you can say, oh, I like this part that Opus did. And I like this part that GPT did. Can you match them together? And the parent agent will do that for you. So let's talk about some of the pros of the new implementation. And then I'll talk about some of the cons. Some of the things we lost with this refactor. So the main pro of re-implementing this entire feature as a skill is that I have a lot less code to maintain. Selfishly, I'm going to be spending a lot less time maintaining this feature. And this is an advanced feature. We're not talking about a feature that is used by 90% of cursor users. Far from it. Work trees are kind of an advanced thing. And so only the cursor power users that love paralyzing and having these grids of agents are using work trees. So it's not the kind of feature where we want to be spending a lot of time with maintenance. Another advantage is that our users can now switch into a work tree halfway through a chat. It was not possible before. We didn't want to pollute the prompt UI too much with all these like drop downs and settings. And so now that it's just a slash command, it's much easier for users to switch to a work tree halfway through a chat. They can start talking about something and then if they decide they want to work on the site, they can do that with slash for a tree. Another big advantage is that the previous implementation did not work if you were working on multiple repos at the same time. So it's very common to have a multi-repo setup where maybe you're front end and you're back in our separate repos. In the past, you could not do work trees in this kind of setup. It was just disabled. With the new slash work tree command, everything works fine. The agent will make sure to create a work tree on each repo. And then if you open a PR, it'll open two PRs, one for each repo. It works quite well. Another advantage of the new skill implementation is that the judging experience at the end of knowing what model did which for best event is far superior. The parent now has a lot more context over what each of the sub-asians did and the user can even ask the agent to stitch together a little different pieces and bits from the different implementations which was not possible before in the previous implementation. You had to choose one sub-asiant or one model and just stick with that. Now let's talk about some of the cons. And if you're curious, we have a forums link here where we're actually getting some mixed feedback on the new implementation. Some people were really accustomed to the old way of how the feature used to work. And if you're curious, you can go and see that not everyone is happy with the change at least for now, but we're tracking. What are the problems? Number one, it's very hard for the agent to stay on track. With our previous approach, the agent had to stay on track. We didn't let the model ever touch any files outside its work tree. It was physically impossible for it to do so. Now we're trusting the model. So you could say it's a bit vice-based because we're basically saying, hey, operate on this directory and then like, knock on wood, please don't forget about this. And especially over long sessions, it's quite possible that the model will forget where it should be operating. And sometimes these models, especially the worst models, will kind of elucinate or they'll go a bit haywire and they'll start doing things they shouldn't. But we're working on this. Another con is that it feels slower because you're seeing the agent create the work tree and you're seeing that in your chat. It's not actually slower but it does feel like the agent is kind of like wasting time doing something that should be done for it in advance. We're also looking at some improvements here. And then finally, this is much harder to find the feature now. Like before, whenever you opened cursor, you had this drop down that would show you, do you want to run this task locally or do you want to run it in a Claude or do you want to run it in a work tree. Now that entire drop down is gone. And so if you want to use work trees, you have to know the feature exists so you can actually type slash work tree. So the discoverability is a bit worse. But as I mentioned before, this is an advanced power user feature which we're personally okay, we're okay with being less discoverable in general. So how can we make the skill better? As I mentioned, the biggest problem right now is that the agent is not really always staying on track. There's two ways that we're going to improve this. One is with eVals and then using those eVals to improve the prompts and then the other one is through RL and training. So at cursor, we train our own model called composer. And for composer two, the lightest version of this model, we didn't have any RL tasks with these prompts. We didn't have any tasks in all of the many, many thousands of tasks that we used for RL actually operating in this type of environment. So we're working on adding a bunch of these tasks into our RL pipeline so that by the time we launch composer three or four or five, at least our own model would be much better at this. Obviously, we cannot improve the models that the other company is developed but we've been showing feedback with all the other labs and model providers on this kind of thing. And for eVals, I've been working on some eVals for this feature and it was actually my first time or not my first time but I'm fairly early in my writing eVals journey. And I was actually very surprised that if you use something like brain trust and shout out to brain trust, they've been super helpful. Writing these kinds of eValsers is actually super, super easy. You don't have to know almost anything about eVals and you can just prompt the agent, it'll do everything for you. Effectively, what I'm doing is I spin up the cursor CLI. It's headless, so it's great for eVals. And then I have two scorers, one that checks to see if the model did any work in its work tree as expected. And then another one which is the reverse of that which is did the model do any work in the primary checkout, where it shouldn't be doing any work. And so far the eVals I've gone are pretty simple so I actually haven't been able to simulate extremely long sessions, which is when the model start performing worse. But even so far I've already understood that not all models are equally good at this. So for example, HIKU, which is a smaller less intelligent model, will very often deviate and start working in the primary checkout. But the other models that I've been testing, such as composer and GROC, are doing much better. So I still have to improve eVals a lot more to make them more complicated. But the hope is that as soon as I can start to find patterns here, I can actually go and improve the prompts. And then another thing we can do is have better system reminders to the models, instructing them to stay on track and to not deviate from the work tree that they are supposed to be working in. Okay, so what's next? The first thing is we're actually going to take a small step back here and we're actually going to have a much more complete and native work tree implementation in the new cursor asian window. If you've been following, we recently announced cursor 3.0. Part of 3.0 is a more agentic interface for coding, where you can still edit code and you can still see code, but the UI and the UX are much more optimized around the asian and the chat interface. We believe this kind of interface is the right place for a proper work tree implementation. The kind of person who is more likely to be doing a bunch of local parallelization is usually the same type of person that is more likely to use this type of UI. So we're taking a small step back there and building a proper work tree's implementation that is more native, not so much agentic in the new UI. Also, we're improving the skills as I mentioned through this continued work on eVELs and then RL and other training work. And then finally, we are actually looking into other parallelization primitives that are not git work trees. So if you've used git work trees, you might know that they can be a bit slow to create and also to the also use a lot of disk space on your computer. And then finally, the only working git repose. So if you're using something other than git, there's really no local parallelization primitive in cursor. In the near future, we hope to share more about this, but we're looking into some other solutions for local parallelization that don't involve git and don't involve git work trees. So, yeah, stay tuned for that. Thank you all for coming to the talk today. I'm sure many of you have questions and I'm going to be around all day. Feel free to grab me anytime and I'm happy to chat with anyone. Thank you.

Feedback / ReportSpotted an issue or have an improvement idea?