Skip to main content

How Datadog built a universal machine tool for Claude Code

TL;DR

  • AI agents are rapidly transforming software development, shifting the bottleneck from human coding to managing and verifying autonomous agent-built systems.
  • To handle the growing complexity and ambiguity, a new paradigm of "machine tools" for software is proposed, inspired by industrial manufacturing's need for precision and repeatability.
  • "Tamper" is introduced as a structural machine tool, enabling agents to generate precise, declarative specifications that are formally verified and hot-reloadable, paving the way for "dark factories" where agents build and evolve systems autonomously.

Takeaways

  • AI coding tools like CloudCode have significantly accelerated software construction, allowing a single human to build complex distributed systems, such as a Kafka-like streaming service, in a matter of days.
  • The primary bottleneck in software development has shifted from human construction to the coordination, verification, and shipping of agent-generated work to production.
  • Engineers' roles are evolving from directly writing code to "shaping the work" for agents, which involves designing the factory, defining constraints, setting outcomes, and establishing robust verification loops.
  • Agent-built tools often lack reusability and operational consistency due to their improvisational nature, creating a need for standardized "machine tools" for software.
  • "Tamper" addresses this by having agents produce precise, declarative blueprints of operational domains, explicitly defining state machines and transitions for safer, more manageable system evolution.
  • These blueprints are compiled into "transition tables," making critical control logic data-like, allowing agents to dynamically and safely modify system behavior through hot-reloading without full redeployments.
  • A multi-level verifier within Tamper acts as a gate, ensuring the safety and correctness of agent-generated changes before they are loaded into runtime.
  • The rigor previously limited to high-assurance software (e.g., aviation) becomes cost-effective for general software development, allowing for the scaling of agent-built mission-critical systems.

Vocabulary

Machine tools — A metaphor for software tools that enable agents to produce precise, repeatable, and verifiable software components, similar to jigs or CNC machines in manufacturing. CloudCode — A specific AI coding tool mentioned by the speaker, used at Datadog, demonstrating rapid increases in AI capability for software generation. Distributed queuing system — A system that manages and processes messages or tasks across multiple interconnected computers, ensuring reliability and scalability (e.g., Kafka). Formal modeling — A rigorous approach to describing and analyzing system behavior using precise mathematical or logical notations to minimize errors in critical parts. Evolutionary optimization harness — A controlled environment where parts of code are improved through variation, feedback, and adaptation, inspired by natural selection. Dark factory — A metaphorical concept for a software development process where AI agents work autonomously, designing, building, and operating software without constant human intervention. Blueprint — In the context of Tamper, a precise, declarative specification of an operational domain that defines states, legal transitions, effects, and invariants. Transition table — A data-like representation of a system's critical control logic and state changes, derived from a blueprint, designed for dynamic and safe modification by agents. Property tests — A testing technique where properties that the code should satisfy are defined, and the test framework generates various inputs to check if these properties hold.

Transcript

VP, a VP, a VN of data dog, Sesh Malah. I noticed 4 p.m. How's the energy going right now? So far? How are you liking the conference? Great. All right. Let's see some blood flowing. So picture of hands if you have heard of machine tools before. What's on the slide? Have you used any? I was expecting zero hands anyways. So that's the talk anyways. So these are tools like Jigs, fixtures, gauges, and mills that you see in manufacturing to produce precise and repeatable machine pods. The kind that you assemble them into larger machines like engines, aircrafts, nuclear reactors, lunar landing modules that we saw this morning. So machine tools were a breakthrough during the industrialization period that enabled the scale due to the interchangeability with standardization and precision. So that was the inspiration for my talk and what we built at Data Dog. And I will share with you how all this fits into building software with ClarkCord. So what you're seeing on the slide there is my view of the last 18 months, I think there is a different view of this graph at multiple sessions today. These non-scientifics are don't please consider that my e-val. It's purely personal. But is the case of ambition with the models. For most of 2025, the models for useful to me, but within very narrow boundaries, like local changes, small functions, tests, glue code, throw away prototypes for me to learn a lot, kind of learned a lot through them. And then around late 2025, I think you all have seen this in no-disted when the slopes started to change, like this exponential. I started trusting Clark with larger and more ambiguous systems code work. Before I talk about machine tools, I want to go through the lineage of some ideas that led me here. So this is 2024. It's a write-up. We were building a distributed queuing system called Courier from scratch. All of this was before agents, all by hand. Can we believe it? Like still software was built by hand then. The hard part was, for any distributed system that is human built or agent built, it's not just building the parts, but it's building the pieces and making the interaction between them observable, testable, and verifiable. So we were rigorous with formal modeling and simulations. So you see various techniques in this post. All of this is classical systems work, where you identify the parts, where mistakes would be expensive or hard to reverse, and you raise the rigor for those parts that you don't want them to slip to production. The next idea was around September 2025, we called it BITS Evolve. It's a closed loop evolutionary optimization harness inspired by AlphaEvolve from DeepMind. The idea is that you let parts of code improve themselves on a narrow, controlled harness. I think we've seen some announcements today at the keynote about dreaming and various loops. So this was us trying in September 2025. It's an ensemble or a council of models, big and small, and they generate variants of code, whatever you designate, you want them to improve. In our case, it's hard functions, blocks of code. And then you have a cascade that you see on the right-hand side of the screen with benchmarks, tests, and production observability decide what survives. It's like natural selection. So this was the first glimpse for me that parts of software maybe could be cultivated, like living organisms, like plants, or like microbes, and growing through variation with feedback and adaptation. However, the insight was that this kind of evolution is only as good as the environment it adapts within. If you have bad benchmarks, they produce bad evolutions. Weak observability, your optimizations are shallow. So then during this period of leap or the exponential that started, we were building whole distributed systems with CloudCode. I mentioned the Opus 45 inflection point, where I raised my ambition. It wasn't like sudden. I didn't wake up one day and then started trusting Claude with whole systems, or even my databases. We have a lot of them. It happened progressively. I tried harder tasks than larger subsystems, failed a lot, learned a lot through a lot of experiments, and then started seeing successes around this period. So what we decided to be more ambitious, could we build something as big as Kafka? Sure of hands if you've heard of Kafka. OK, more hands now. So Kafka is a streaming service. So we kind of attempted, can we build this from scratch? It's like repeating the same playbook that we had for courier. It's a queuing system, pretty close. Same rigor. So you will see the cascade on the right hand side of this slide. But CloudCode doing most of the construction with one human building it. In a few days, toward his belief, we had a full functional Kafka comparable system working. And we called it Helix. So the source code methodology and the details are all linked in this post. Feel free to check them out later. But taking Helix to production now requires building mileage, and that has been challenging for us. So the next natural move for us was, I spoke about bits evolve earlier. Could bits evolve, evolve parts of Helix, the same way it evolved hot functions. The ambition was like two big bricks, from small functions to like, can the whole component evolve? Can it provide the mileage we needed so we could get to production? The answer was not quite. The surface area was too large. Even with the verification cascade that I showed you in the previous slide, quite rigorous, there were too many places where the human had to interpret and correct. It's too multi-turned, too interactive. So you're like, OK, can we look for a narrower surface? Can we dial back the ambition a little bit? So this post was about that. So we chose our metrics aggregation server. We are data docs, so we have lots of metrics. Could we improve the materialization logic life, not offline, like what we did with bits evolve before? Can we optimize them for customer with a proof carrying path around the change? So a human doesn't have to review every candidate that is being generated. So if you look at the flow across these four projects, if you observe the pattern, each project exposed the bottleneck for the next. There were many talks and comments today about the bottleneck being moved. So with Corrier, the bottleneck was construction with humans building systems through careful design, modeling, and verification. It took us one year to build Corrier. And then if you compare it, it took three days to build Kafka. In like what? To all months. So with bits evolve, the bottleneck moved to the feedback loop. Like the model produces variation, but the harness decides what survives. And with Helix, the Kafka system that we built, the bottleneck moved again, where agents could build large parts of the systems, so we have seen it. But then the human have to coordinate to ship the work to production through tools and mechanisms built for humans. So that's the Amdus law that Dario was talking about earlier today. So that's the jump we are making if you look at this slide at the center from mechanization to industrialization. So mechanization means agents are doing more of the work now. An industrialization, if you were to borrow the metaphor, and that's why I kind of introduced machine tools, means work becomes repeatable, verifiable, controllable, and scalable. So the idea of a machine tool is, I know it sounds cool, but why do we need them now for agent build software? Or in these for just lunar modules to land? Because of the complexity and the ambiguity growing. So each time we are trying to increase our ambition levels, started from targeted changes on existing systems. In the last four months, about 90% of data dog used AI coding tools for production code. That's roughly 3,000 engineers. And CloudCode drove at least 2,000 thirds of that. And most of the work, as I described about Helix, was still single human driving, like one engineer steering one or more agent sessions. And the work was moving across this map, more complex to generate on one axis and more ambiguous to verify on the other axis. Here are a few concrete examples. I'm not going to enumerate all of these, but feel free to take a picture if useful. And if any of these resonate with you, I'm happy to chat about more afterward a later on. But the main point I want to underline here is that these are generating personalized flows in our software development lifecycle. Because one human could do a lot more than what they were used to do before. So for engineers in the software delivery world lifecycle I was talking about, if I were to use the word flow, it used to mean direct relationship between intent and code. You understood the problem. You wrote the code. You tested it. You reviewed it. You shipped it. You operated it. You repeated it again and over and over again. But with agents, that abstraction level is changing rapidly. I don't know. I haven't seen code. I mean, I haven't seen code in a while. I've made my piece with it because I've been a manager for a while. But many people, like they're still trying to go through their seven stages of grief of not seeing code on a day-to-day basis. So you're no longer writing the code. You're shaping the work. You're deciding what the agent should see. We saw the key notes today, like outcomes. What tools it should have, what success means, how failure should be detected. All of this is powerful. It's like everyone's promoted three levels up into management chain, which they weren't signed up for, engineers. Because it's a huge leap, and it's also disorienting. You push against gravity very fast. It can feel sickening. We haven't acclimated to this altitude of working, specifically, engineers who love looking at code. So before this jump in model capabilities, the human team was the factory. Tools were designed around human attention, our judgment, and the operational memory of what's actually happening in production is in this human organization, our collective brains, our minds. Connecting back to career, like I said, only 12 months ago, this was the world we were in. And only four months ago, it continued, and started to change. And that's the inflection point with Claude Code and OPAS 45. We are starting to see one lead human coordinate multiple interactive sessions. I've seen like many screenshots of parallel sessions, like cranking out stuff. It's disorienting for me to personally watch. Three, four, five agents working on different parts. I heard Jarrett today saying he's doing 10 things at a time. Like the stuff she was showing was only 10% of his whatever time being spent. So these tools were still human-shaped. The agents were two orders of magnitude faster, and this tool chain is in belt for their speed. So what happened is the human became the bridge between agent execution and the human-shaped systems. And now, all these operational knowledge, like you wake up at night, something broken, it's just in that person's head. And probably in some markdown files that agents work down and in between them. So that further amplified with Claude Managed agents. So agents are doing a lot more background work. It's compute. And they start taking judgment-bearing roles. Meaning they are not just following your instructions anymore. They're making their own decisions. And they're running longer for hours, sometimes overnight, sometimes for days. I don't know, the longest task, someone like benchmarked 20 hours, 28 hours. So they construct their own tools in these sessions. They write their own code. And the mismatch is still there, like each agent invents its own tools, its own glue, its own conventions. And that system becomes really hard to share an operate. You can see the blur between what the agent sessions produced as like intermediate tools and what is your product doing, like your code. You get a lot of output fast. A lot of it is useful, but some of it can look like false progress and most of the tool construction that only makes sense in that local session. And you start to see that blur. So this is where I felt we need something more structural. If agents are going to build and operate large parts of our systems of a databases, which are mission critical, they need the equivalent of this machine tool concept that I am trying to introduce. So tamper is a machine tool. The idea is instead of agent inventing disconnected tools for every local need, it produces precise specifications of the intent on problem domain. It is a machine tool in the same sense that a JIG or a CNC machine have you seen some of those computer-aided hot machines where you give them specifications of what your screw threading needs to be, what this needs to be, this is extremely repeatable, you can run them and you can build aircrafts and things like that with them. So in this case, the agent does not improvise the final mechanism each time. It produces a precise description and it rates with tamper or a tamper like mechanism to make something work first and then later turn that into something repeatable, checkable and reusable. So you could actually build a software factory around your code base. So that's the concept of a dark factory. Simon Wilson of Simon Wilson.net, pretty amazing blog. I think he is one of the most influential AI voices right now in teaching how to work with agents in build software, has been popularizing this phrase called dark factory. I think it's a pretty good encapsulation of a software process where the agents keep working without the humans on the virtual factory floor. You can turn off the lights. So the human role now becomes like designing the factory and the constraints and the outcomes and the verification loop. So this thing can run for hours and days and weeks producing what you wanted it to produce. So something like tamper can fill in this role of a machine tool to build such factories. Let's look at a dark factory concretely with Helix as a target. What I shared about Helix, it's a Kafka like streaming service. It is probably one of the five expensive services we are running at data. Like Kafka. So we have been shadowing our production workloads with Helix. In some cases, we actually believe it can be materially cheaper than our current production solution. Can you believe it? We took like a week to build it and we started shadowing it and we saw like two to five X opportunities that it can be cheaper. But getting from this promising state to production still takes to work on the mileage. The system needs to earn that it can run and multiple people can operate it, not just the person who built it. So we created a bunch of synthetic workloads that models our production shapes. And we constructed a factory. So software factory for Helix uses tamper in three distinct ways. First, as an agent control plan to Claude managed agents where the sessions, roles, work use and operational life cycle are more precisely managed. Second, as a way for agents to build their own tools with small tamper apps, bridging the STLC tooling, like get CI deployment. And third, as a Helix control API, the interface that and the life cycle surface around the Helix data plan to exercise this workload. So that was a surprise to me. The surprise was it started to feel more general than agent infrastructure. A lot of software, if you squint closely, is just control logic around database, APIs around state, policies around mutation, life cycle transitions, integrations with external systems. So tamper could be this universal in a sense that it can be applied to any software that has this shape I described. Before I go deeper into how tamper works, you might be wondering at this point, why is this different from asking Claude Code to build a crud app, like in TypeScript or Python? Claude can do that very well. We have seen lots of PRs and lots of code going. However, in normal crud apps, the control logic is spread across routes, database constraints, service code, background jobs and documentation. It may all have good tests and coverage, but the operational model, which is generally a state machine shaped, is mostly implicit in the code base. So the idea is tamper makes that state machine explicit. This isn't particularly novel. We have this with runtimes like Irlang, Irlang OTP for decades, after runtimes, and more recently with workflow engines and durable execution, runtimes like Temporal, they all have like popularized this precise runtimes, so you can like run in municipal applications. So let's look at some tamper internals to understand. On the build path for Tempor, it asks for this operational domain I'm describing about the model what you want in the form of a blueprint. What are the states, which transitions are legal, who can request them, what effects are allowed, what invariance must hold, what happens if a tool call fails? So the agent will often iterate and arrive at this blueprint, or multiple turns. Think of it like it's own plan mode to arrive at, instead of arriving at a markdown file of some description, it could arrive at this precise declarative artifacts. And then you have a compiler equivalent like Tempor, verify it, and it can hard deploy into a runtime. There are other runtimes who does this, like Irlang Beam does that if you have heard of it, anybody heard of Irlang Beam? There you go. So the run path feels the same as any other crud shaped API. You will notice the difference. It's important to note that the agent is not generating arbitrary application code directly. We have raised the abstraction. It's generating a structured description that compiles into our runtime shape. The compilation step is outside the other one. It's same like you write Rust code and you give it to Rust compiler. So Tempor turns this blueprint into something called formal state transitions. It's very common in functional programming and actor runtimes if you have used them. This is the most important technical detail. Formal here doesn't mean every possible property is proved. It isn't theoretical. It means the basic shape of the application is represented as a precise transition system. And when you have that, it is a much better reasoning for both humans and agents than arbitrary code. Part in my completely brutalist design of this slide with no syntax highlighting. But that's an agent that inspects for a helix rollout. That spec looks like this. States actions and triggers. I actually got an idea this morning when I saw the announcement on Claude managed triggers. Maybe this could be just a pass through. You could declare a trigger here and then have cloud run them. But the idea is you define your state's actions and triggers, or the agent does it, Claude does it. And in this case, like the deployment description start rolling is only valid when planned or entity moves to rolling. And that triggers a side effect like going and patching the Kubernetes state full set, which is not important. And then a callback that comes back whether to mark the progress or fail. What temple does, so that artifact is declarative. So what temple does, it takes this and generates that spec into this transition table, a concept where the critical control logic is data like. It is not just spaghetti, imperatively encoded in court. It's just data like that's interchangeable, uncheckable. And it is not hidden in any improvised chain of service methods. This is easier for agents to work with. And they can change this dynamically with safety. That's the promise. I have seen people like writing or link scripts and hard deploying into beam. I think that's a pretty creative way of using. Run times that are extremely hardened over higher assurance systems for agents to work with. So in this case, if a rollout needs a new state or a rollback path, the agent can make a targeted spec change and it can hot reload it. So the iteration speed is even fast. You don't have to go through C.I. and deploy and everything. When you're leaving agents overnight, I think they can come up with some pretty good progress overnight without compromising on CFTPR reviews and code reviews and things like that. And we have policy gates. Because the transition table is data like, temple evaluates the state transitions and the policy decisions together such as who can mutate a deployment rollout? Which actions can an operator agent, like the team of agents that are working on the dark factory? You can say this operator agents can only do so much. Or these actions are forbidden for a builder agent. And you can specify independently as a human can a completed rollout be rolled again or can fail tool result, continue on its rollout path, etc. And then we have the side effects and effect system, also very popular and typed run times like TypeScript. FX are deliberately small typed operations here in temple. Keeping them small prevents the state machine from becoming a back door for auditory application behavior. But if you need auditory application behavior, you could package them as wasm modules, familiar with wasm. Come on. Okay, few hands. So this is where our report generated by an alarm can leave. It's a very narrow workplace. So what that means is it makes trouble shooting easier. On the last building block for temple is the verifier. Verifier is verification is basically the bottleneck right now for pretty much everything we have seen so many announcements and discussions around it. So in this in temple, the verifier is a gate before the transition table is loaded into the run times. That's what allows it to say this is safe. You can put this and load it live. And it is multiple levels like a Swiss cheese pattern. Not all levels need to find everything exhaustively. The agent will Claude is generally very good at making judgment calls. Do I need to run all levels or just some levels which same like it uses its compiler output? Like level one checks the algebra of the automation. Level two model checks the reachable state graph. Can any path reach a bad state and level three runs schedules injects falls do invariant survive timing failure conditions and etc. And level four uses property tests which I highly encourage that we have to start learning this mechanism of randomized testing with action sequences. All of this isn't exhaustive on day one. Every discovery gap compounds the verifier. I also heard Boris mentioning the compounding effect of like you find tests and you keep block fixing the gaps. A missing condition revealed in production or simulation can get added to the model or the test suite. And it gets compounds. So various all these going. I don't know. I mean I can't predict much about where this could go. But the idea is if each artifact of temper app or a bundle is concise very few lines fits in your head. I spend a lot of time running mission critical infrastructure for data dog walking up at night like thousands of times past three or four years. You won't be able to keep your complex mission critical logic in your head when you want to operate it right for a complex business domain like banking or financial systems. You still should be able to encode your business logic this way. Something a human can read. If the generated artifact is thousands of lines of tangled code we are back to where we started. And I don't know entirely yet that an LLM can just completely review and then sign off on it. If it's a few small artifacts with explicit roles when both humans and agents can modify the system without disturbing everything around it. I still feel like this is just good systems and generating anyways and higher assurance software like aviation and financial systems have been built this way for for decades. However, the cost of such rigor with humans was too high for general software until now. agents are changing their calculus. Going back to my manufacturing metaphor, the win was not that one artisan could build a brilliant machine. The win was that a machine built with machine tools made parts composable and inspecatable and replaceable that we could build larger machines. And my claim is that for software, agent builds software, we need such kind of rigor for us to scale from here on if you were to really build databases and put them in production. I learned where I started. If agents can build software autonomously inside factories with such discipline and rigor, maybe, maybe, we don't need to stop at dark factories. I don't know there is a dark in there. It kind of sounds sad. So the whole software build this way can feel like an organism that we can grow, cultivate and evolve through feedback, selection and adaptation. And it looks like, I don't know, agriculture or a directed evolution where you choose the direction in which your software needs to go. You can choose Kafka to be a queue because there are more customers using it as a fee for queue versus no, we need more buffers. We don't need all these brokers. Or maybe just dreaming, right? We have dreams now in Claude. That's all I have. Thank you so much for listening.

Feedback / ReportSpotted an issue or have an improvement idea?