Skip to main content

Building pi in a World of Slop — Mario Zechner

TL;DR

  • Current AI coding agents often produce complex, low-quality code ("garbage") because they learn from vast internet data, including "slop," and struggle with global context and long-term consequences.
  • The speaker advocates for tools like "Pi," a self-modifying and highly extensible agent harness that prioritizes human control and adaptability to individual workflows.
  • Developers should maintain agency by carefully scoping agent tasks, using them for non-critical, repetitive work, and manually writing or thoroughly reviewing important code.

Takeaways

  • Claude Code became problematic for the speaker due to feature bloat, increased bugs, lack of observability, limited model choice, and shallow extensibility, leading to a loss of control over the development context.
  • Minimalist approaches to agent harnesses, such as those that only provide keystroke access to a terminal (e.g., Terminal Bench), can outperform complex systems, suggesting current agent development is in an experimental "fuck around and find out" phase.
  • The "Pi" agent harness is designed with a minimal core that is super extensible and allows the agent to modify itself using documentation and code examples, providing full control over tools and workflow.
  • Pi extensions are TypeScript modules that support hot reload during a session, enabling rapid iteration and customization of tools, commands, event listeners, and state management.
  • Agents, as currently implemented, combine errors with "serial learning" and "delayed pain" (for humans), leading to enterprise-grade complexity within weeks, especially when left unchecked.
  • Avoid agents for critical decision-making or generating entire products; their decisions are local, based on learned internet "garbage," and they lack the human ability to learn from "pain" or understand system-wide implications.
  • Effective agent tasks are highly scoped, modular, non-mission-critical, and boring, or involve reproducing user issues, allowing humans to evaluate and finalize the reasonable output.
  • Prioritize human discipline and agency: slow down, figure out why you're building, learn to say no to unnecessary features, and manually write or meticulously review critical code.

Vocabulary

Claude Code — A specific commercial coding agent harness discussed and criticized in the talk. Dogfooding — The practice of a company using its own products or services internally. Observability — The ability to infer the internal states of a system by examining its outputs. Extensibility — The degree to which a system can be expanded or modified to add new capabilities. Coding Agent Harness — A framework or environment that allows an AI agent to interact with development tools and execute coding tasks. Terminal Bench — A specific benchmark for coding agent harnesses that provides only minimal interaction (sending keystrokes to a terminal). Self-modifying agents — AI agents capable of altering their own code, configuration, or behavior based on instructions or observations. Hot reload — A development feature that allows code changes to be applied to a running application without needing to restart it, speeding up iteration. Clankers — A pejorative term used in the talk to refer to AI agents that automatically generate low-quality or irrelevant contributions (e.g., issues, pull requests) to open-source projects. Boo-boos — A colloquial term used in the talk to refer to errors or bugs introduced into a codebase.

Transcript

I'm Mario, I build Pi in a world of slop and this is a strategy, a tragedy in three acts. Just to talk about this real quick, a bunch of people on the internet gave me money for ad space on my torso and all of that goes to a charity. So yeah, thanks guys. So act one, building Pi. In the beginning there was Claude Code and it was good, right? We all got basically catniped by that thing and stopped sleeping. I'm not sure if stuff before that but Claude Code was the one thing that kind of clicked with me the most. And to preface all of this, I love the Claude team. There are brilliant people talented super high velocity so they also created the entire game, major props to them. So this is another roast. This is just me, an old man telling you why I stopped using Claude Code and build my own thing. In 2025, I started using Claude Code in about April I think, thanks to Peter because he told us the agents are working now. And back then it was simple and predictable and fit my workflow. But eventually the token madness got hold of them I think and the team got bigger and they started dog fooding and that stuff and built a lot of features I don't need which is fine, I can just ignore them. But with velocity and more features, come more bucks. And that's bad because I used to work at construction sites and if my hammer breaks every day I'm getting really mad and if my development tools break every day, I'm also getting mad. So there was this, it's just a running gag and here's Tariq telling us that Claude Code is now a game engine and here's Mitchell from Ghost detailing us. No, it's not. And eventually they fixed the flicker but then other stuff broke and I think they're not in the third iteration of a Tui renderer. Yeah, but that's just the symptom. The real problem is that my context wasn't my context. Claude Code is the thing that controls my context. And behind my back, Claude Code does things to the context. So you have the system problem which changes on every release including the tool definitions. They would remove tools, modify tools. It's not good. They would insert system reminders in the most of it in a partium place in your context, telling the model, here's some information, it may or may not be relevant to what you're doing. That it actually says it may or may not be relevant what you're doing. And I kind of confused the model and I kind of broke my workflows. On top of all that, there's zero observability because that's how the two is constructed and I like knowing what my agents are doing. There's zero model choice which is obvious. It's the native and tropic harness so it makes sense for them to want you to use Claude, right? And there's almost zero extensibility and some of you might have written some hooks for Claude Code but I'm telling you, there are a number of hooks and the depth of those hooks is very shallow and every time a hook triggers what actually happens is a new process gets spawned. Basically the command you specified for the took to be executed and I don't find that specifically efficient. So I took a step back and looked around for alternatives and then like to especially call out AMP and factory tried. The Porsche and Lamborghini of coding agent harnesses so if you can afford them please use them. They're at the frontier, they're really good on the teams of fantastic and there's a bunch of other options and I have history in OSS so naturally I kind of gravitated towards OpenCode and again brilliant team, super execution velocity and they don't sell you hype, they sell you tools that work for the most part. I started looking under the hood of OpenCode which is back to context handling as well because that's the most important part for me and I found a bunch of things like given some conditions OpenCode, a code with just the prune tool output after a specific minimum amount of tokens and that basically the bottom is the model. There's also a list piece server support which means every time your model is calling the edit tool, OpenCode goes to the other piece server that's connected, asks are there any errors and if so, in checks that is part of the edit tool result. Which is bad because think about how you add editing code. You're not writing a line of code checking the errors, writing the next line checking the errors. You don't do that, you finish your work and then you check the errors. This confused the model. There's a bunch of other things like storing individual messages of a session in a chasing file. Each message is a chasing file on disk. There was this and this happens to all of us, no claim there, but it's not great if by default a server spins up, cores headers are set in such a way that any website your OpenCode browser can now access your OpenCode server. And entirely unrelated to all of this, I started looking into benchmarks for coding agent harnesses and found terminal bench, which is a pretty good benchmark, all things considered. And the funny part about it is that it's the most minimal kind of thing you can think of. All it gives the model is a tool to send keystrokes to a team accession and read the output of the team accession. There's no file tools, no sub agents, none of that stuff. And it's one of the best performing harnesses in the leaderboard. Here's the leaderboard from December 2025. Respective of model family, terminal scores higher, mostly even higher than the native harness of that model. So what does that tell us? A form two thesis is, we are in the fuck around and find out face of coding agents and the current form is not their final form, right? So second thesis is, we need better ways to fuck around. And for me, that means self-modifying, nailable agents, things that the agent itself can modify and I can modify depending on my workflow. So I stripped away all the things, built a minimal core but made it super extensible and made it so that the agent can modify itself. With some creature comfort, it's not entirely bare bones. So that's pie. It's an agent that adapts to your workflow instead of the other way around. It comes with four packages. An AI package is basically just an abstraction across providers and context handle between providers. An agent core, which is just a while loop, and the tool calling. You spoke to a framework, I come out of game development so I built a thing that actually doesn't flicker too much and the coding agent itself. Here's pie's system prompt. That's it. Eventually, the industry created a new standard called skills which is basically just markdown files. So we added that as well and that needs to go into system prompt. So begrudgingly, we had to add a couple more lines. And finally, here's the magic that makes pie able to modify itself. We shipped the documentation, which was handcrafted by me and an agent, and code examples of extensions. And all we need to do for the agent to modify itself is tell it, here's the documentation. Here's some code that shows you how to modify yourself by writing extensions. It comes with four tools. That's all it has. Read right, edit bash. Here's the tool definitions. Don't read the text. Just look at the size. That's it. Here's what happens when you start a new session in one of these tools. So the thing is, the models are actually reinforcement trained up to a zoo. So they didn't know what a coaching agent is because the coding agent harnesses basically what they are being trained when they are post trained. You don't need 10,000 tokens to tell them you are a coding agent. They know because they are coding agents now. Pie is also a lot of default because my security needs a different than yours. And I don't think a little dialogue that pops up every now, every time you call bash, asking you to approve is a smart security mechanism. So instead, I give you so much rope that you can build anything that's fit for your specific security needs. There's also stuff that's not built in. I'm a heathen because this is how I do it. But if you don't like that, then it just asks pie to build you sub agent support on plan mode or MCP support, whatever you need. So responsibility comes with a bunch of table stakes and then with the extensions itself. The extensions imply just type script modules. In the simplest case, a type script file on disk, you point pie at that. Here's an extension loaded as part of the harness. And with that, you get basically an extension API that lets you hook into everything and define stuff for the harness to expose to the model. And that includes tools slash command shortcuts. You can listen in on any kind of event and react and then save state in the session that's optionally provided to the agent as well or stored there for tools that analyze sessions as part of the organizational workflows. You can do custom compaction, custom providers in the full control over the tools. You can modify everything in pie. And you can then bundle all of that up and put it on mpm or on GitHub because I think if we don't need to reinvent another bunch of silos called marketplaces, we already have package managers. And all of that hot reloads. So if you develop an extension for pie, you do so in the session and you hot reloads to changes and see the effects of that immediately, which is very great. And it's also game development thingies. Game development you want high, very low iteration speeds. And that's great. So a couple of examples. Claude or on topic ships that slash by the way, which lets you talk to the agent why it goes on its main quest. I posted this little prompt on Twitter jokingly and somebody build it in five minutes with more features. And they didn't have to fork a clone pie. They just let the agent write the extension based on the prompt. Here's Nico is one of the most prolific extension writers. I don't know what the fuck is going on here. It's a chat room for all of his pie agents and they talk with each other. I would never use this, but all of this is custom, including the UI. Or you can play nest games. Or you can play Doom. And as a bunch of other examples, I'm not going to talk about. So how do you build a pie extension? You don't. You tell pie to build it for you based on your specifications. And then you just iterate with it on that and hot reload during the session. You're going to skip that example as well. And if you don't like building things yourself and I hope you do like building things yourself, but if you don't, you can look on mpm or our little search. And then you can find the interface on top of mpm to find patches for sub agents, mcp, and so on. So does it actually work? Well, here's the terminal bench leader board from October before pie had a patch in. I added that for Peter's class thingy. It's got six plays. But none of this is actually about pie. If you want to read it, I basically want you to read the control of your tools and work flows. So build the own. And if you want to know more about pie and open claw, go to this talk, please. And then eventually Peter happened, he put pie inside of open claw. It's a tentacle which meant my open source project became the target of a lot of open claw instances unbeknownst to their users. So this is act two, OSS in the age of clankers. Clankers are destroying OSS. He has sealed raw, closed down the issue and pull request tracker. Here's open clause trackers. Here's mine. Half of that is open-law instances who post garbage. So I started to rage against the clankers. If you send a pull request, it gets auto-closed with a comment that asks you to please write a nice issue in your human voice, no longer in a screen worth of text. And if I see that, right, looks good to me and your account name gets put in a file in the repository and the next time you send a pull request, it's left through. Clankers, don't read that comment. They don't go back once they posted a pull request. So that's a perfect filter. Mitchell eventually turned into vouch. Here's a clanker. I also labeled them. If you had interactions with open-claw, your issues get deeper prioritized. I also build tools where I embed issues and pull request texts into 3D space. So I see clusters of issues. I also invented OSS vacation. It's just close to track of whenever I want. So I have my life back. So does this work? Yes, sort of. Which leads me to act three. Slow the fuck down. Everything's broken. And then as people, let's say our product's been 100% built by agents. Yes, we know it's fucking sucks now. Congratulations. And I'm hearing this from my peers and this is entirely unhealthy. So here's how we should not work with agents and why, at least in my opinion. I wrote this on my blog a while ago. But the basic guess is, we're having our many of agents in you using beads. And you don't know that it's basically uninstallable. And malware and anthropic build a C compiler. It kind of works, but actually doesn't. And we're hoping the next generation of months will fix it. And here is cursor building a browser. And that's also super fucking broken. But the next generation will fix it. And Saz is dead. Software soft and six months. And my grandma just built herself a Spotify with her open claw. Come on people. So agents are actually combining boo-boo's, which is my word for errors. We've serial learning and no bottlenecks and delayed pain. The delayed pain is for you. Here's your code base on a human on one agent and ten agents. How much of the agent code can you review? Here's the same code base that's expressed in number of boo-boo's per day. How much of those boo-boo's do you think you'll find? Then you say, oh, I have a review agent. Let me introduce you to the wonderful world of the oral borough. It doesn't work. It catches some issues. The problem is that agents are merchants of learned complexity. Where did they learn that complexity from? From the internet. What's on the internet? All our old garbage code. There are some pearls on the internet, really well designed systems. But 90% of code on the internet is our old garbage. That's what the models learn from. And every decision of an agent is local. Especially if the code base is so big that it doesn't fit into its context. And if you let it go wild and add abstractions everywhere, that are intertwined. So there are at least a lot of abstractions and duplication. And backwards compatibility. Who has seen that in the output of their agent is fucking annoying or defense in death. So yeah, you get enterprise-grade complexity within two weeks. With just two humans and ten agents. Congratulations. And then you say, but my detailed spec. Yes, sure. Know what we call a sufficiently detailed spec. It's a program. So if you leave plans in your spec, what do you think happens? How does the model fill in the blanks? And with what does it fill that in? It fills it in with the garbage that it learned on the internet from our old code. Which is garbage remeo-curred. And then you say, but humans also. Yes, humans are horrible, fail-fail-able beings, but they can learn. And they are bottlenecks. There's only so many poobos they can add to your code base on a daily basis. And humans feel pain. Which is a very interesting property because humans hate pain. And once there's too much pain, the human has a bunch of options. It can quit their job. It can blame someone else and make them fix it. Or everybody plans to get an installs-refactoring you shit out of the garbage code base. Agents will happily keep shitting into your code base. And now your agents and deans, super complex memory systems will not save you. Agents don't learn the way we learn. Doesn't my most beloved people, I don't even read the code anymore. Congratulations. Something is broken and your users are screaming. So who you're going to call? Not yourself, because you haven't read the code. So you're relying on your agents, but they are not also overwhelmed because the code base is so humongous that there's absolutely zero chance they can get all the context they need to fix the issues. And long-context windows are a heck as most of you will find those this year as everybody's switching to one million tokens context windows. And the agentic search is also failing. So the agent patches locally and fucks shit up globally. If you see this in your code base, you're fucked. So you can adjust your code base anymore and also not your test because your agents wrote your tests, so good game. So here's how I think we should work. There's a bunch of properties for good agent tasks. That means scope. If you can scope it in such a way that the agent is guaranteed to find all the things it needs to find to do a good job, you're done. That means modularize your code base. If you can give it a function to evaluate how well it did the job, even better. Hill climbing, all the research. Anything non-mission critical, led it wipe, boring stuff, led it wipe, reproduction cases for user issues, which are usually only partial in information. Perfect. I don't spend any more things anymore doing that. Or if you don't have a human near you, rubber duck. So lots of task you can use them for in safe time. At the end of that, you evaluate. You take what's reasonable, most of it isn't, and then finalize. My final slide, more or less. Slow the fact down. Figure out what you're building and why, and don't just build because your agent can do it now. It's stupid. Learn to say no. This is your most valuable capability at the moment. Fewer features, but the ones that matter, and then use your agents to polish the shit out of that. And like new users, not your token maxing desires. Can't be a amount of generated code that you need to review. And non-critical code, sure, wipe slop ahead. Critical codes read every fucking line. See the keynote after me for more info on that. So how do you know what's critical? Any guesses? Well, you read the fucking code. If you do anything important, write it by hand. You can use a client to help you with that, but don't let it make the decisions for you because we've learned all the decisions it makes are learned from the internet. And that friction is the thing that builds the understanding of the system in your head, which is important. And it's also where you learn new things. And all of this requires discipline and agency, and all of this still requires humans. Thank you.

Feedback / ReportSpotted an issue or have an improvement idea?