Automating Large Scale Refactors with Parallel Agents - Robert Brennan, OpenHands

I thank you all for joining for Automating Massive Rebatchers with with our real agents. Super excited to talk to you all today about what we're doing with OpenHands to really automate a large scale chances of software engineering work. Lots of lots of soil related to tech debt, code maintenance, code modernization. These are tasks that are super automatable. You can throw it into that but they tend to be way too big for like you know a single just one shot. So it involves a lot of what we call agent orchestration. We're going to talk a little bit about how we do that with OpenHands and also just more generically. A little bit about me, my name is Eric Brennan, I'm the co-founder and CEO of AdOpenHands. My background is in depth tooling. I've been working in open source DevTools for over a decade now. I've also been working in natural light with processing for about the same amount of time. I've been really excited over the last few years to see those who build suddenly converge as well. I'm still really good at writing code. I was super excited to be working this space. The OpenHands is an MIT lessons coding agent. OpenHands started a little bit of development about a year and a half ago. When DevTools launched their DevTools video of a full of the problem with software engineering agent. My co-founder and I saw that. I'm super excited about what was possible with the feature of software engineering work look like. I realized that shouldn't happen in a like box. If our job is going to change, we want that change to be, you're going to find the software in all the community. We want to have a say in that change. So we started OpenHands as a way the community awaits you, how to drive what the feature of software engineering might look like in an AI power world. Hopefully, not controversial to say that software development is changing. I know my work was changing for a great deal in the last year and a half. I would say now, pretty much every line of code that I write goes through an agent. Rather than me opening my RE and heading out on the code, I'm now asking an agent to do the work for me. I'm still doing a lot of critical thinking. A lot of the mentality of the job has it changed, but what that actual work looks like has changed quite a bit. But what I want to contribute to is that it's still changing. We're still just in the first name of the change. We still haven't realized all the impact that our data models have already brought to the job and are going to continue to bring to the job as they improve. I would say even if you froze large levels today and they didn't get any better, it was still still the job of software engineering changing very drastically over the next two to three years as we figure out ways to operationalize the technology. I hear still a lot of sort of psychological and organizational hurdles to adopting large-legged models with this software engineering. We're seeing a lot of those hurdles disappear as time goes on. Re-p history, how we got here, everything started I would say with what I call context on where code snippets. Some of the first large levels of models have turned out very good at writing chunks of code, especially things that they've seen over and over again. You could ask it to write bubble-store. You could ask it for small algorithms, how to access a SQL database, things like that. I was able to generate a little bit of code. It was able to see understable logic a bit. But this was totally context unaware. It was just dropping code into a chat window that you'd ask for. It had no idea what project you were working on with the context. Shortly thereafter, we got these context-aware code generation. So GitHub, Co-Hilot, as obviously was probably the best example here. So the actually within your IDE, they could see where you're typing, what the code you're working on in. It could generate code that was specific to your code based on reference local variable lanes, that reference table lanes in your database, huge, huge improvement for our productivity. So copy-tasting back and forth between the chat GBT window and your IDE. Now all of a sudden, you can see the little robot. If I as you can see inside your code base, they can actually generate a level code for your product base. And then I think the giant we've had in early 2024 with the one on the top, Devon, and then the next day we'll open Devon and open hands. This is where we first started to see Thomas coding agents. So this is when an AI started not just writing code, but it could run the code in a group. And it could Google an error message that came out, find a stack over a bar, or it could apply that to the code. That's a debug statement that says to the code and run it and see what happens. Now basically, automate the entire inner loop of development. This was a huge set of function-informed word. You can see the little robot gets arms in this picture. This is a huge jump. At least it might not be able to like just spread a couple of tons of English, give it to an agent and let it turn through this task until it's got something that's actually working, running, testing, and then now what we're seeing is parallel meetings. We're calling agent orchestration. Those are figuring out how to get multiple agents working in parallel, sometimes talking to each other, sometimes fitting up new agents under the hood. You can just create agents. This is, I would say, a kind of leading edge. What's possible? People are just starting to start with this. They're just starting to see success with this at scale. But there are some really good tasks that are very amenable to this sort of workflow. And it has the potential to really out of a huge amount of data sets under every kind of various software company. A little bit about the market landscape here. Again, you can see that same evolution from left to right, from where we started with IOP plugins, you know, Co-Pilot inside of our existing IDs. We got these AI-powered IDs with AI-TAC on them. I would say your meeting developers adopting local agents out, maybe running Claude co-locally for our two things, maybe some hot box tasks. Your early adopters, though, are starting to look at cloud-based agents, agents that get their own sandbox, spreading the Claude. This allows those early adopters to run as many agents as they want in parallel. They'll also run those agents, but for us, honestly, if they were running on their local laptop, right? If it's running on your local laptop, there's nothing stopping the agent from doing RMRF slash, trying to delete everything in your own directory, whatever it might do, installing some weird software, whereas you've got your own containerized environment somewhere in the Claude. You can run a little bit more safely knowing that the worst it can do is ruin its own environment. You don't have to sit there, babysitting it, hitting the white key every time I want to start on a command. So those top-based environments look more scalable, they've been more secure. And then I would say at the far right here, what we're really just seeing the top 1% of early adopters search for experiment with this orchestration. This idea that you don't only have these agents running in the Claude, but you have them talking to each other. You're coordinating those agents on a larger path. Maybe those agents are spinning on some agents within the Claude that have their own sandbox environments. Some really cool stuff happening there. I would say with open hands, we generally started with Claude agents. We've made back a little bit, we've got the local CLI's and the local Claude Code in order to meet developers where they are today. These types of experiences are much more comfortable for developers. We've been using audacity for decades. Just got a million times better with GitHub. Go, go, pilot. I would say these experiences are the right side of very important developers. They feel very strange to give up and ask to an agent or a fleet of agents and let them do the work for you. It feels kind of like for me at least the jump that I made when I went for being an IC to be a manager is what it feels like to run it from writing to myself, giving that code to agents. It's a very, very different way of working. I think one of the developers has been very slow to adopt. Again, both the top one percent or so of engineers that we've seen adopt. The stuff on the right side of this landscape, they've been able to get massive massive lifts in productivity and Apple huge backlogs that attack that other teams just weren't getting to. Some examples of where you want to use orchestration rather than a single agent. Typically these are tasks that are going to be very repeatable and very automatable. So some examples are things like the basic code maintenance tasks. Every code base has to there's a certain amount of work to do to just keep the lights on. To keep dependencies up to date to make sure that any vulnerability gets solved. We want to find, for instance, that is using open hands to read CDs throughout their entire code base. They have 10,000 to the developers, 10,000 to the thousands of repositories. They basically every time a new vulnerability gets announced in an open source project, they have to go through their entire code base, figure out which of the reposal of all our role. Subitual requests to that code base to actually resolve the CDE, update whatever dependency, fix breaking API changes. They've seen a 30X improvement of time resolution for these CDs by doing orchestration at scale. They basically have set up now where every time a CDE gets announced, you go on a really good day. They kick off an open hands session to scan a repo for that vulnerability, make any code changes that are necessary, and it will go to pull requests and all the downstream team has to do. This could emerge, but it changes. You can also do this for like obniated documentation release notes. There's a bunch of modernization challenges that companies face. For instance, you might want to add annotations to your Python code base if you're working at Python 3. You might want to split your Java model into the microservices. These are the sorts of tasks that are still going to take a lot of thought for an engineer. You can't just like, you know, one Java with thought code to say, like, you know, refactor my model with microservices. There is still very growth work, right? So just kind of like copying and pasting a lot of code around. So if you thought we were creating agents together, they can do this. A lot of migrations that, so migrating like old versus a Java or new versus Java, we're working with one client to migrate and want to start two jobs, just part three. We've used an open hands to migrate our entire front end from React, from Redux to Zustend. So you can do these very large migrations. Again, in both the very growth work still takes a lot of thinking for making a lot of different orchestration data. And then a lot of the fact that detecting a new code, getting rid of that, you know, we've been able to one client who is using RSCK to issue scan the data.logs, every time you do a pair of pattern, go into the code base and add a error handling fix whatever problem it is copying up. So a lot of things that are a little too big for a single agent to just one shot, the RSCK or audio control aren't good to ask that, well, if they can follow your thoughtful about orchestrating them. A bit about why these aren't one shot will ask some of their topological problems, some of them are more like human psychological problems. On the technology side, you have a limited amount of context that you can give to the agent. So extremely long-running passes to the passes span like a very large code base. Usually you don't be having enough air, you don't have to compact that. I'm actually going to go to the point of the agent that I can lost. We all see the easiest problem. I try to actually want to check some of these sets to the passes and then I get to say, okay, I migrated three of your hundred services. I need to hire a few of the six people to do the rest. The agents off of the lack of a knowledge within your code base are like, they don't have the same intuition that you do for the problem. And the errors to the pound when you go on, these really lost geographies with the agent. A tiny error in the beginning is going to, you know, pound over time, the agents to repeat that error over and over again for every single set that takes it as fast. And then on the human side, you know, we do have this intuition to the pound that we can't convey. You know, so you want to rate your model as to microservices. You probably have a mental model of how that's going to work. You just tell the agent, trick them all into the microservices. It's just going to take a shot in the dark based on the scene of the past without any real understanding of your code base. We have some difficulty decomposing tasks for agents and understanding what agent can actually get done in one shot. We also, like, you do need this intermediate review intermediate check in for the human as the agents doing its work. We'll talk a little bit about what that looks like later. Again, not something you can just like tell me to do and expect the final result. Command, you have to approve things as the agent goes along. And then not having a period of addition of that. I think that if you don't really know what finish looks like for this project, it's hard to tell the agent. All these types of orchestration tasks are going to be super clear that we don't expect every developer to be doing agent orchestration. We think most developers are going to use a single agent locally for, you know, sort of have hop-past that are going to run engineers, building new features, mixing up all things like that. I think running a product locally, I had a familiar assignment alongside an IDE. It's probably going to be a common workflow these for the next couple years. What we're seeing is that a small percentage of engineers who are early adopters of agents are really excited about agents are finding ways to orchestrate agents to tackle like huge mountains of tech that at scale. And yet, a much bigger lithic productivity for that smaller select setup assets, right? You're not going to see a 3000% of the product to be for all software engineering, probably going to get more of that, you know, 20% lift that everybody's been recording. For some select assets like CD remediation or proof based organization, you can get a massive, massive lift you can do, you know, any of your users that work in a couple weeks. I want to talk a little bit about what these workflows look like in a practice. So this loop probably looks pretty familiar. You're used to working with local agents. This is very typical loop that looks like the interloop development or, you know, non-AI coding as well. But basically, you know, you give the agents some problems. If that's some work in the background, maybe you babysit it and watch, you know, everything is still doing, hit the walk key, you never want to recommend. The finishes, you look at the output, you see the tester passes and you see if this actually satisfies, which maps for, and then maybe you probably get again to get a little closer to the answer for maybe your side side result. You, you know, you committed results in a push for bigger orchestrated tasks. This becomes a little bit more complicated. Basically, what you need to do is you or maybe hand it hand with pod. You want to depoposition into a series of tasks that can be executed individually by agents. Then you'll send off an agent for each one of those individual tasks and you'll do one of those, one of those agents for each of the individual tasks. And then finally, at the end, you, maybe with the help of an agent, are going to need to collate all the output together from all those individual agents into a single change and merge that into your code base. Very importantly, there's still a lot of human in the loop here. You need to review that just the final output of the collated result, but the intermediate outputs to re-change it. At a cell post, the goal is not to automate this process 100%, it's something like 90% automation. That's still a horror magnitude productivity list. I think this is, this is really a tricky to get right. This is where a lot of like thought comes into the process of like how I get right into the task style so that I can verify the original step and so that I can actually automate this whole process without just ending up with the right code to nest. This is that typical git workflow that I like to use for tasks like this. Typically, we will start a new branch on our repository. We're asking L, not text-to-step branch using an agent-to-d or an open-hand due to the cost of the home micro agent, but I just sort of mark down and explain, you know, here's what we're doing here just so the agent knows, okay, we're migrating from re-excess us to end or we're going to migrate these part two jobs to part three. You might want to put some kind of scaffolding in place. Let's talk a little bit more about examples of scaffolding later. You're going to create a bunch of agents based on that first branch. The idea is that they're going to be so big that it worked into that branch and it's basically going to accumulate our work out of the go along and then eventually, most of the end, we can re-bought our scaffolding and Rose App branch into main. Now, if you're kind of getting started with this, I would suggest submitting yourself to about three to five concurrent agents. If I more than that, your brain starts to break, but for folks that have really adopted an registration at scale, we see that running hundreds of thousands of agents concurrently. Usually, a human is not the loop for one human is not allowed to review every single one, but maybe those agents are sending out call requests into the digital teams, things like that. So, you can scale up very aggressively once you start to get a feel for all those works and you feel like you go a very good way of getting that human in the loop. I'm going to pick it off to my co-worker, Howard here. He's going to talk about a very, very large snow migration, basically eliminating coats of snow from the open hands data base that he did using our re-backer STA, the workshop here. Open hands excels at solving the topic of impassions. Give it a focus problem, something like fix my panning CI and debug this at points and it deletes. But like all agents, it can stumble when the scope goes too large. Let's say I want to refactor the type of code that is maybe in four-stripe tag checking, updated if you're dependency or even a minor if you're one framework to adopt it. These are not in time, yes. They're sprawling the internet, the changes that can touch hundreds of files. The battle problems at the scale, we're using the open hands agent SDK to build tools to start specifically to orchestrate collaboration between humans and multiple agents. As an example, let's work to eliminate folks in the open hands across the team. Here's the repository structure. Just the core agent definition has 380 files spanning 60,000 lines of code. There's a lot of the volume of the code but not much at all the structure. So let's use our new tools to visualize the dependency graph of this chunk of the repository. Here each node represents a file. The agents show dependencies who we get board to. And as we keep zooming out, it becomes clear if this tangle web is fiber factor against CLS hard. To make this manageable, we need to correct the path to the human size chunks. PR size batches that an agent can handle a human can understand. There are many ways to bash based on what was important to you. Graphically ready, all regions give strong guarantees about the structure and edges in between and do spatches. But for our purposes, we can simply use the easy state directory for a trip to make sure that semantically related files appear inside the same batch. Now, we're getting back to the dependency graph. We can see that the clues of the nodes are no longer randomly distributed. Instead of the correspond to the batch that each node is associated with files, is this zooming out and zooming back in. We usually find a cluster of the g-stops that are all the same color which indicates that an agent just doesn't assess all of those files on the TNC. Of course, the scratch is still large and it critically tanked. You can start to simpler view, we'll build a new graph where nodes are dashes and the edges between those nodes are dependencies that are inherited from the files that we don't need to attack. This view is most simply, we can see the entire structure are almost the same. But this is something we're going to be able to bash. Using a graph, we can identify batches that have no dependencies that they extract the files that we go with. This patchwork sample has 16 yellow. We'll tell you that in a bit, file is probably a few. Let's check. This is the tool intent for a human and AI collaboration. So, once we know that this file is an FDVMite, determine that it's better to go with it. We'll maybe be able to locate the SIDS patch and all that we want to do is add it out to ourselves or review the batch. So, you know, the contents. Of course, when we're after we go, it's important to consider the complexity of what it is you're moving. This batch is trivial. Let's find one with a little bit more complex. Here's the batch that has four files that we recently told you, and the complexity after is perfect. These are usually sold in two days to a human that we should be more careful with the ability to test and first and foremost, why we need to add it. So, how are we actually going to do it with the code? Well, it's a two-step process that I see in a bunch of other things to appear. Before we can fix the code by moving the code smells, you need to identify what's wrong with the first one. And through the pair of fire. There are several different ways to define the verify based on what you can hear about. You can see the program at, so it calls a match format. This is useful if your verification is checking unit tests, running a linear or attention. Instead of using the code styles, I'm going to be using the language while this will be looking at a code and trying to identify any problem that it has based on the set of lists that are funded. Now, let's go back to our first batch actually, for this verify use. Remember, this batch is trivial and unfortunately the verify recognizes the assumption. It comes down to the next sort of report indicating which goes else about it. And status and status is starting to be completed, green, good. In this change in status is also what they're in the match track. Now, I'm getting back and talking about how we can see that we have exactly one node at many completed and the rest are still yet to be handled. But this already is a great sense of the work that we've done and how it is in the bigger picture. So now, I'm sure you're not sure that there are no codes in the high-vehicle repositories, very boring. We just have to ensure that it's, well, no, we don't have this batch track for this group. So let's go back to our batches and continue verifying till we run across the failure. We'll keep going and dependency making sure that we pick nodes that don't have any dependencies on clear matches that we need to analyze. The next batch is the last single is the first but because the unit file is a little bit loud enough, the report that is generated is a little bit more robust. Continuing down the list, we come across the batch we identified earlier with some chunky files that relatively high-goat elasticity. And this batch happens to be a start per spher between the set of status trends, right? Instead of group, now this batch has more files than we'll be seeing in the past. So the verification report is proportionally longer. Looking through the, it is listing file by file, the code of the set of file in which works. I see one file is particularly reduced to the five wishes. Well, I have to go back to that. And if we zoom all the way back down to the batch draft and look at status indicators, we'll see the two green nodes representing the batches we far would successfully verify it. We'll also see the right node representing the batch that we just saw that were erratic. Now, our student goal is to turn this entire graph green is right now and present a little bit of an issue. In purpose, we're not into the green node, we need to address the problem set that are far found using the next step of the by-pointing of the fixture. Just as the third fire, the fixture can be counted in other different ways. The program manufacturer can run a batch command, or you can see how the batch is going to be able to run the trial and open addresses the issues to the single step. But by far, the most powerful features that we have is the open-hand state SDK to make the green node independent agent of that SXS-2 also tools to run tests, example code, look at the documentation on the internet and do whatever needs to be addressed to these issues. So let's go back to the screen on the dash and run the fixture as you would see what happens. Now, this front of the demo is set up considerably, but because we're exploring these batches and dependency forward, all we're waiting, we continue to go down the list running our error fires, and spending our new instances at the open-hand agent using the SDK, until we come across a node that's blocked because one of these security agencies is still complete. When the fixture is set and the batches are used, we'll need to re-ground verification of the future of the security system to be open and through the end. Looking at a report that the fixture is prepared, there's not much information just to the title of VR. We've said that such that every fixture produces a nice, highly-goal request ready for human approval. Just doing this with the reflectors automated doesn't mean it needs to be a few. And here's the generator VR. The agent is at the next laptop, the Sonoma-based, the code is now identified, the agency is made to address the tools that can be inside and after the thing. It's also less than helpful for the reviewer, and it's a node forward if anybody works on this part of the code in future, features for that blast. And when we look at the content of the VR, we see a very recent. All the changes are tightly focused on addressing the code's analysis that we've provided here, and we've all modified a couple on the list of the bulk of which is a really fact and people invested a lot into the sound function at all. Not all the errors don't do this small, but our batching strategy in our instructions to ensure that this code will change as our world is free. This is a certain abbreviation of a formless, but it also will be easier to refer to that one. From here, the full process for rooming codes and else for the entire area of human and code is to be done for the career. Used to the required target to identify problems, used to the fixer to spin up the arms and address those problems, review and merge those VRs, the unblocked new fixes, repeat it until that target happened to me. We've already used this whole database to see if it could change the database, including start timing and any previous task force. We could not have done without the open-hand HDSDK power, everything in the code. So that's the open-hand, refactor SDK, power bi-art, urbanization SDK, we're going to walk through a little bit on the workshop to build something a little simpler, but very similar. We get a couple of agents working together to fix tasks that were discovered by an initial agent. I want to talk a little bit about strategy for wolf-key composing tasks and sharing contacts between these agents. This is a wolf-ready big apart parts of agent organization. So the effective task-to-composition, you're really looking to break down your very big problem and to test a single agent can solve a single agent can one-shot something that can fit in a single permit, single pull requests, super, super important because you don't want to be, you know, fascinating and early in the each of the sub-agents. You want to each one, you want to do a pretty good guarantee that each one is just going to one-shot the thing, you'll be able to understand it and merge it into your honorable crash. You want to look for things that can be parallelized. This is going to be a huge way to increase the speed of the task. You know, if you're just executing a bunch of different agents, serially, you might as well just have a single agent moving through the task serially. The more you can parallelize, the more you get a mini agent working at once, the faster you're going to move through the task and iterate. You want things that you can verify as correct very easily and quickly. Ideally, you'll have something where you can just look at the CNC status and have good confidence that everything is spring, you're good. Maybe you'll need to put it through the application itself, something like that, run, run, recommend yourself to verify that things look good to you. But you want to be able to very quickly understand whether an agent has done the work you asked to or not. And you want a clear dependency in order to do some tasks. You notice these criteria are pretty similar to how you work right down the work front of your team, right? You need to make sure that you have tasks that are needy-separable, different people on your team can execute an entire role and then go ahead and results together. You want to know once I get a task done and then I don't know if it's past CC&D, and then once those are done, we can do E. So it's very similar to breaking down work for a team of engineers. There are a few different strategies for breaking down a very large group back to one of these talk albums, just do. The simplest, like most of you want is to just show what needs to be like East, you know, anyway, if you're a required pilot, if you're a pilot, or a redirectary, maybe a refunction or a class, you know, this is a fairly straightforward way to do things that works well if those dependencies are can be kind of executed without depending on another too much. So good examples of doing that in type-meditation, start-up pipeline code base. Then you know, the very end once you've migrated every single file, say, you can collate all those results into a single PR. A slightly more sophisticated thing would be to create dependency tree. And the idea here is that as a worker into that piece-by-piece approach where you start, you start how do you start with like the leaf notes in your dependency graph, right? You start with maybe your utility files to those might get over, and then anything that depends on those, you know, it's going to have those initial fixes in place, and the dependencies can start working through, you know, their stuff in the process. You basically back your way up to whatever thing is your point of the application is. This is often a better way to proceed. It's more like principal approach for how you're going to hoard those through these tasks. Another example is to create some kind of scaffolding that allows you to live in both the like pre-migrated post-migrated worlds. We did this, for example, of migrating our React state management system. We basically had an agent set up some scaffolding that would allow us to work with both products, Redux and Zustan at the same time. Pretty ugly, not something you would actually really want to do, but it allowed us to test the application as each individual component got migrated from the whole state management system to the new state management system. And then we sent off parallel agents for each of the components. I got each component done, and then at the very end, once I've been using Zustan, we're about all the scaffolding, so there was no more mesh and pre-migolds working. But having that scaffolding in place allows us to validate, as each agent finished its work, for just that one component, we give out the application of the stillwork, the not-and-the-pone of the stillworks. We didn't have to do everything all the less, but we got some kind of human feedback from the agents. I just wanted to talk about context sharing. As you go through a bit of a large scale project like this, you're going to learn this, right? You're going to figure out what my original model wasn't actually complete. I didn't actually you know, understand the problem correctly. Your agents might run into, you know, you might have a few agents, you got to an agent's running, they're all getting the exact same problem, you kind of want to share the solution that problems, but they're all getting stuck, right? There's a bunch of different strategies for doing this context sharing between agents. One strategy, like the most thing you can do is share everything. Basically, every agent sees every other agent's context. This is not great. Basically the same thing as just talking a single agent working in order to be through the task. You know, we can context one another, but you can do something like this. So this is not what we can go. A better value approach to be to type of human beings, sort of manually after information and the agents. If you have a chat message, a chat will be able to teach agent. You can just paste in like a user library, one up to three instead of one up to two. The human can also buy agents out of the, or if it's a micro agent to pass us just to these agents. This stuff is a volume of manual human effort. It involves a lot more like babysitting of the agents. So it's not super scalable. You can also have the agents basically share context to be jotted through a file like agent. You can have a lot of the agents actually modify this file themselves. Maybe there's a four more questions in the file as there were many things. Downside here is that sometimes agents will try and learn important things. They can get kind of aggressive outwishing information to this file to do some kind of human or new things to help. And then last, this is probably the most like reading edge idea here. But you can basically give you a tool that allows it to send us just to other agents. It can be like a broadcast message that goes out all the other agents, or it could be point conversation. This is super fun to experiment with, or to experiment with this now with our S.P.K. But it's tricky together. Once you get agents talking to each other, you're like increasing the level of non-determinism in the system. I have an example here on the right. This is from the doctors of the portfolio. They have two agents just talking to each other. They just enter into a loop of wishing each other and send perfection. Well, now I want to work through an exercise. I would love it if you all want to follow along. You can access this presentation for copypacing purposes at duct.fh.openhands-workshop. We'll work through some coding exercises with the OpenHands SDK specifically to UCD Reveation at Scale. We're going to write a script that we'll take in again, a repository, scan it for the source folder release or CDs. And then set an off a parallel agent for every single vulnerability we find to solve that in a whole request. So duct.fh.openhands-workshop. Let me know if anybody can access it. It's giving us a slideshow. So it should be the slideshow. There will be a copypacable prompts and laces that might have gotten around slide 29. I'll get there. So in terms of how this process is going to work, basically we're going to start with one agent that runs a CVE scan on this repository. It's going to scan for vulnerabilities. What's nice about using an agent for this is it can look at the repository and decide what kind of scan for vulnerabilities. Right? I'm going to use Trippi to scan a Docker image. I might go on to run MDM on a package.json. So it can basically detect the programming and figure out how I can stamp CDs here. Then once we have our list of vulnerabilities, we're going to run a separate agent for each individual vulnerability. Each of these agents is going to research whether they're not solvable. We can update the role of the dependency, fix any breaking API changes throughout the code base, and then open up a whole request. What's nice about this is that we can emerge as individual PRs, must they're ready? What's nice about running a solving in parallel is that we get we get a bunch of different PRs. So we can merge them as they're ready. One agent gets stuck. One of the vulnerabilities is solvable. All the other ones are still going to work. Maybe we get to 90% or 95% solves. We don't have to get to 100% in order to have any value here. Just a quick pseudo-proper, what this is going to look like. So this is an example using the Open-Hit SDK of how to create an agent. You can see we create our language model. We then pass that language model to an agent object, follow all of the tools, terminal, file under a abstractor for planning. We give it a workspace, and then we just tell them what we want to do and hit it wrong. So it's a pretty naive little world example. We'll see how it gets a little bit more complicated as we progress through this particular task. The benefit of that first agent is that we're going to iterate through all of the vulnerabilities to get back out. Each one will send out a new agent asking to solve that the data. So to get started here, I would say you create a new GitHub repository. You can start Save our Work there. You're also going to need both a GitHub token and an LOLIM token. I would if you signed up for Open Hands appEL hands up then you can get a $10 free credit, LOLIM credit is there. If you're already a existing user, let me know. I can bump up your existing credit so the purpose of this exercise, that we're going to start an agent server. This is a, basically like a Docker container that's going to have all the work that our agents are doing. This is a great way again to run agents securely and more savably. So instead of running the agents on our open machine to solve these CDs, we're going to run the inside of our container. I think that if we were doing thousands of CDs, we could run this in like a Kubernetes cluster so that we have as many workstations as we want for our agents. But for the purpose of the sector, so we'll just run one Docker container as a single for our agents. That we can create an agent side of the year of the hands-like agent to start working through this task. I'm going to be using the OpenHead CLI as we go here. You're all going to check out the OpenHead CLI. You can also use cursor or bot code or whatever you're used to using as we go our way through a CDU remeation process with the VennHeads. I'm going to give it a couple of minutes. I'm going to walk through creating my GitHub repo, getting my GitHub token, etc. If you all have any trouble, feel free to raise your hand and come around and help getting all out of it, etc. All hands-like to have that. Yeah. You're going to need me. All right, so got my new GitHub repo here. So I'm going to add a quick OpenHeads micro agent here. I'm just going to talk about building any process for any CDUs with agents. The LVNX talks over the OpenHead SDKR ads. So that's how it ends up. OpenHeads SDK. So let's open the hands. A little bit of context. So let's do a side-deaf. We now have this repo out of the beef. To get token, I'm actually going to do it here. So that I'll close my token but you can go to GitHub settings, do your profile, then develop your settings, personal access tokens, I'll do cross-exot tokens. All right, cross-exot tokens, give it a name, and then the repo, so it's really what you'll need. That way we can open a pull request to solve to the CDUs involved. We do have a classic token, not a new thing. I don't know how I'm interested in it. I've been got the one. It's just a few. Use to the same thing. You're welcome to do. I guess you could create a new repository. I think I could use them either. So what do you... Yeah, stack up the old heads. So what permissions do we need to... Just a repo version. I'll just gonna show you some for backup. Okay, that's not that. You go to not piece of your profile here. You can get your open-and-date here, your L-A-P here. I won't show it, but... This is all I need to use our L-A-Rex. Just a bit here, so I can... Steps down. Okay, last I'm gonna start up some agents over here. So probably wants to copy paste this out of the presentation. I have my repo clones run the S-E-N server container. I care if you do wanna work with the open-and-C-L-I. Maybe to install open-and-S, gonna give me a certain the open-and-S. C-L-I. Again, you can use bot code, cursor, whatever else if you want. You could just need a little more time with the setup. I'm gonna follow him, key, get a token setup. Sorry, check it through. So I'm gonna start with this first prompt. Basically what we're going to do is we're going to point our agent at the open-and-S SDK, point it at the documentation. And just ask it to basically check that R-L-L-L-L-L-L-I. Y-Key is working that it can actually do in our organization. So it'll be like a very basic hello world, so just kinda get started here. I'm gonna tell it, I'm using the OpenHands key that I generated at App.all. And so on and so on. So I'm telling you to use this OpenHands slash bots on a four-bottle. You can replace this with Anthropic. If you wanna use just like a regular Anthropic API key, you may need to set this model a bit different depending on the user OpenAI. Using my help, you can put the delay all on docs that we get out. If you have an OpenA-Di key or an OpenA-I key, you can look at the delay all on docs that they got, which model that I'm playing for the string. And I'm just gonna copy this as is. Sorry, what's the stuff for AgentSotND, what are the ones for OpenHands? So if I was to just create a file out of the AgentSotND, if you're working with a tools to model with that, or hands we have, it's called a microagent. Let me get into it. So OpenHands slash, not OpenHands slash microagent slash, if I can mention remote.nnd, is the description of the repository you're in. And I just gave it a couple links to the SDK documentation and the repository for the SDK, so that it has access to basically the API docs there. This is kind of an optional stuff. Only thinks little easier about the pages doing. It thinks it's got something good. So let's see what's gone on. My phone's EVE software, environment variables using a theorem to set my environment here, check those in. The software, it's like, Agent didn't quite get the API docs right. It's a, pasting our Mac to ads. Let's check again, of course, laptops never show. We're using code. It's a bit too much. It's just me. It's just me. It's sincere. But you can't just make sure for it, you need to follow up against that breaks. Yeah. You can reverse it if you're on, 096, 0.9.6. Whatever you're going to know about. I don't know why. You want to do a little things where. No, it's a terrible start. I just want to provide it by cat if you open hands, removing tool, air-failed to install entry points. I'm new-ish to the Python world, so I assume I was straight. So you could try it, it makes out around the end of 11. Just what I'm on, but yeah, I'll try. No question. Yeah, so I was able to, is do you run into the CLI? I was able to run this on the, like, all hands.dev. Yeah, cool. And it's minute PR created it. Looks good. Why are you doing it to the CLI? Really just for. You see? Normally, I actually prefer to work through the web UI here. I think being able to run and show script is working locally. It's like a little bit better than enough. I actually like to work through the web UI and then have the agent push and act whole locally. If I really want to work locally, I think that was just extra steps for presenting purposes. We have to go through to use the web or the tool. Looks like I got right API key center here. Just got to fix the output. It should be at 10, 200, 200, what's that? Should we get 200 for paste and it says that? Let's see. Yeah, you should. Yeah, something like this. I just got finally where the hell on says hello. Open it. How does that actually work? Anybody managed to get a hell on connection working? Thanks. I created the file. Just so it's great to report this looks like. I have it in the order. So it's certainly needed to be here. Basically, you can see we create no one. So a lot of what we want to use. The way the I key we want to use. You put it in. Does that just send a quick message to the hell on to make sure it's actually working? Let's say it's out of the move towards out to you. So here we're going to actually start to do some more for the agents. So we're going to tell the agent we're working with. We want to use the SDK to create a new agent. It's going to you to get over repository. It's going to connect to a remote workspace running at local host, 8,000. Again, that's the Docker start command from before. If you haven't already run that, that's a good time to get Docker running. Docker run this agent server. It's going to clone over repository into that Docker container. We're going to create an agent. That's going to work inside that Docker container. And we're going to tell that agent is going to scan this repository for an able to use. With the open CLI is the way to interrupt and get it to stop. I can troll P, or pause. And then can I insert my corrections? Or you can type a message or just type continue. I got the CLI to install that at the ad-AI. See some pipeline that there's a dash AI version. And then it says in the docs that it's going to be a AI one is separated. But it is a usable CLI. You can run your site. But a service out of an office, 13. Did you get the dash? AI wanted to work because as soon as I tried to run it, it was passed. Oh. It didn't solve that stuff. But yeah, it installed and then it did just work. It does a deprecation warning when I go to Virgin. So yeah, there it is. If you want to download an executeable binary on our release page, that might be straightforward. You can also run it in a doc. If you are a CLI docs, I think there's a UV run as well. UV run. You have the Virgin. You have the Virgin? Yeah. You see. Oh, it's interesting. All right. See if you try adding a kind of version. That's for the model that it has at. Like it will be on the very go. Okay. Oh, thank you. Supposedly have an agent working here. Let's see what I'll run it with. Aerospace photo. Now, we should have a few CDs at. Let's see if we find any vulnerable. Final false. Open hands will visualize the output here. It's a TV agent working even with the SDK. First of all, we saw the. CLI. I don't think it's data. It's task list. It's. I'm going to repository. It doesn't have trivia. So it's not a trivia. So you're doing with respect and agent. You've been a task. We've been into it. So we're running trivia now. Show a bit about what this. What this generated code looks like. You can see so we may have stated our LLM kind of first step. Now we're actually passing this out into an agent. We're also giving a terminal tool on file under tool. We're creating this remote workspace that's connected to our doctor container so that it. It starts working in its own environment. We create what's called a conversation which basically one truck of contacts that the agent is going to manage as it goes about its its work. I passed a task with clear instructions for what it's supposed to do. And then said that message. That task. It looks like that initial scanner agent is almost done. It looks like that agent ran just fine. With these results. It's how we keep. Keep looking a lot here. No, we've got immediate that's. It's getting control of our abilities. So the next thing I'm going to ask this to do is basically we're going to reach into the. It's environment to get the vulnerability list out. The idea is we're going to have it say it's not just the vulnerabilities to the J solve file. Then we can on that workspace object inside of the doctor. We're going to see command the part to get those vulnerabilities back out. We also have some some options for quite many things files. So we're going to go into the workspace. Then for now, we're just going to iterate over the vulnerabilities that Jason file. Just so we can see you were able to reach into this workspace and get some information back out. Exposively. There you go. So it happens. She's right. So we got some vulnerability results. The agents finish. Let's see if our script can get. So it's back. So there's a. There's an action and an observation. So it might be. From the command and then an observation was back with the output. It's more or than. It's the. Basically, the entire. You can take the events events. And then there's two kinds of other actions and observations. And we think all of the. The whole. It comes back with an action to take basically a tool call. And then the observations like a tool call. If anyone's stuck on anything, I'm happy to interrupt you. Consider raise ahead. What's that? Number four. I just did number. Three. And. Walkings. Nice. Yeah. Looks like it's printing the CV list. Yeah. That looks good. The. It's. It's supposed to. It's. It's. It's. It's. It's. It's. It's. It's. It's. It's. It's. It's. It's. It's. It's happening. It's. This is a great work. Probably going on. The body. The body. The body. It's. The body. The body. The body. The body. It's. What would be. So that's five little sections. For sure. Yeah, yeah, yeah. No, I think there's definitely better ways to organize this code than to have one single script. Just easier for demo purposes. I do have a demo repo. I think it's open-hand slash CVE demo that uses special classes. There's a single CVE agent subclass that we still say on LLM. It's a little bit more about forgiveness than just this one script. Sam, maybe you use for some control process. So I'm going to use it. It's completely different. And we finish with a few minutes. So you need to move on. I'll be moving off. That's the question. That's the question. Yeah. The last one is on. I'm going to try to do it. That's the one. Hardware. Yeah. Maybe it was. I'm sorry. I don't want to finish right now. This time this was one of the assumptions. There were actually ways to stop 1000. It's faster. Right. No. Yeah. Hardware. Yeah. It would have. We created a fine effort to make a small audio consortium on all the smaller ones. We had different groups in the group, we're being around, so we got to try this. So many of our nights, we have to say that we, generalize, we're actually to a share of our friends. We've all come to the same level, which is the most we've found in the company. We've been doing this for six years. We've been doing this for six years. We've been doing this for six years. We've been doing this for six years. We've been doing this for six years. We've been doing this for six years.

Automating Large Scale Refactors with Parallel Agents - Robert Brennan, OpenHands

TL;DR

Takeaways

Vocabulary

Transcript