- Building AI products fundamentally differs from traditional software due to inherent non-determinism in both user input and AI output, alongside a crucial agency-control trade-off.
- Successful AI development requires a step-by-step approach, starting with high human control and low AI agency, then gradually increasing autonomy as reliability is proven.
- Leaders must become hands-on and vulnerable, relearning intuitions about AI to effectively guide initiatives and foster a collaborative, problem-first culture.
Why most AI products fail: Lessons from 50+ AI deployments at OpenAI, Google & Amazon
- Understand that AI products are inherently non-deterministic: user input via natural language is fluid, and LLM outputs are probabilistic, making system behavior less predictable than traditional software.
- Actively manage the 'agency-control trade-off': recognize that granting AI more decision-making ability (agency) necessitates relinquishing some human control, requiring earned trust and reliability.
- Adopt a phased product development approach, starting with high human control and low AI agency (e.g., AI providing suggestions for human review) before gradually increasing autonomy.
- Prioritize a "problem-first" approach: focus on identifying and solving specific problems rather than immediately designing complex, high-autonomy agents.
- Implement "human-in-the-loop" systems to log human actions and feedback, creating a continuous improvement flywheel for AI behavior calibration and system refinement.
- Constrain AI autonomy for high-risk use cases (e.g., invasive surgery pre-authorization) and allow more agency for low-risk, "low-hanging fruit" tasks (e.g., simple blood test approvals).
- Leaders must dedicate time to hands-on learning, rebuilding their intuitions about AI, and being comfortable with not always being right, fostering a learning environment.
- Foster tighter collaboration between PMs, engineers, and data folks, as the AI lifecycle breaks traditional handoffs and requires shared ownership of feedback loops.
Non-determinism — The characteristic of a system where the same input can produce different outputs, making its behavior less predictable or fixed.
LLM — (Large Language Model) An AI model trained on vast amounts of text data to understand, generate, and process natural language.
Agency control trade-off — The principle that as an AI system is given more autonomy and ability to make decisions (agency), human control over its actions is reduced.
Agentic systems — AI systems designed to act autonomously, make decisions, and take actions in an environment to achieve specific goals.
Human-in-the-loop — A system design where human input and decision-making are integrated into an AI workflow, often for oversight, training, or correction.
Flywheel — A continuous feedback loop or process where success in one area drives success in another, leading to accelerating growth or improvement.
Prompt phrasings — The specific wording and structure of instructions or queries given to an LLM, which can significantly influence its response.
Autonomy — The capability of an AI system to operate independently and make decisions without constant human intervention.
Behavior calibration — The process of adjusting and refining an AI system's actions and responses to ensure it behaves as intended and earns trust over time.
Pre-authorization use cases — Scenarios where an AI system can assist or automate the process of obtaining prior approval for services or actions, often in healthcare or finance.
We worked on a guest post together had this really key insight that building AI products is very different from building nonAI products. >> Most people tend to ignore the non-determinism. You don't know how the user might behave with your product and you also don't know how the LLM might respond to that. The second difference is the agency control trade-off. Every time you hand over decision-m capabilities to agentic systems, you're kind of relinquishing some amount of control on your end. >> This significantly changes the way you should be building product. So we recommend building step by step. When you start small, it forces you to think about what is the problem that I'm going to solve. In all this advancements of the AI, one easy slippery slope is to keep thinking about complexities of the solution and forget the problem that you're trying to solve. >> It's not about being the first company to have an agent among your competitors. It's about have you built the right fly wheels in place so that you can improve over time. >> What kind of ways of working do you see in companies that build AI products successfully? I used to work with the CEO of now Rackspace. He would have this block every day in the morning which would say catching up with AI 4 to 6:00 a.m. Leaders have to get back to being hands-on. You must be comfortable with the fact that your intuions might not be right and you probably are the dumbest person in the room and you want to learn from everyone. >> What do you think the next year of AI is going to look like? >> Persistence is extremely valuable. Successful companies right now building in any new area. They are going through the pain of learning this, implementing this and understanding what works and what doesn't work. Pain is the new mode. Today my guests are Aishwaria Raanti and Kiti Bottom. Kiti works on codecs at OpenAI and has spent the last decade building AI and ML infrastructure at Google and at Kumo. Ash was an early AI researcher at Alexa and Microsoft and has published over 35 research papers. Together, they've led and supported over 50 AI product deployments across companies like Amazon, Data Bricks, OpenAI, Google, and both startups and large enterprises. Together, they also teach the number one rated AI course on Maven, where they teach product leaders all of the key lessons they've learned about building successful AI products. The goal of this episode is to save you and your team a lot of pain and suffering and wasted time trying to build your AI product. Whether you are already struggling to make your product work or want to avoid that struggle, this episode is for you. If you enjoy this podcast, don't forget to subscribe and follow it in your favorite podcasting app or YouTube. It helps tremendously. And if you become an annual subscriber of my newsletter, you get a year free of a ton of incredible products, including a year free of lovable, replet, bold, gamma, nad linear, Devon, Postto, Superhum, Dcript, Whisper Flow, Perplexity, Warp, Granola, Magic Pattern, Dracast, Chapter D, Mobit, and Stripe Atlas. Head on over to lenny'snewsletter.com and click product pass. With that, I bring you Awaria Oranti and Kiti bottom after a short word from our sponsors. This episode is brought to you by Merge. Product leaders hate building integrations. They're messy. They're slow to build. They're a huge drain on your road map, and they're definitely not why you got into product in the first place. Lucky for you, Merge is obsessed with integrations. With a single API, B2B SAS companies embed Merge into their product and ship 220 plus customerf facing integrations in weeks, not quarters. Think of merge like Plaid, but for everything B2B SAS. Companies like Merall AI, ramp, and use Merge to connect their customers as accounting, HR, ticketing, CRM, and file storage systems to power everything from automatic onboarding to AI ready data pipelines. Even better, Merge now supports the secure deployment of connectors to AI agents with a new product so that you can safely power AI workflows with real customer data. If your product needs customer data from dozens of systems, Merge is the fastest, safest way to get it. Book and attend a meeting at merge.dev/lenny and they'll send you a $50 Amazon gift card. That's merge.dev/lenny. This episode is brought to you by Stella, the customer research platform built for the AI era. Here's the truth about user research. It's never been more important or more painful. Teams want to understand why customers do what they do. But recruiting users, running interviews, and analyzing insights takes weeks. By the time the results are in, the moment to act has passed. Strella changes that. It's the first platform that uses AI to run and analyze in-depth interviews automatically, bringing fast and continuous user research to every team. Strella's AI moderator asks real follow-up questions, probing deeper when answers are vague, and services patterns across hundreds of conversations, all in a few hours, not weeks. Product design and research teams at companies like Amazon and Dualingo are already using Stella for Figma prototype testing, concept validation, and customer journey research, getting insights overnight instead of waiting for the next sprint. If your team wants to understand customers at the speed you ship products, try Strella. Run your next study at strea.io/lenny. That's s t re l.io/lenny. Ash and Kiti, thank you so much for being here and welcome to the podcast. >> Thank you. Thank you for having us. Super excited for this. >> Let me set the stage for the conversation that we're going to have today. So, you two have built a bunch of AI products yourself. You've gone deep with a lot of companies who uh have built AI products, have struggled to build AI products, build AI agents. You also teach a course on building AI products successfully that and you're kind of like on this mission to just reduce pain and suffering and failure uh that you constantly see people go through when they're building AI products. So to set a little just foundation for the conversation we're going to have, what are you seeing on the ground within companies trying to build AI products? What's going well? What's not going well? >> I think 2025 has been significantly different than 2024. one, the skepticism has significantly reduced. Um, there were tons of leaders last year who probably thought this would be yet another crypto wave and kind of skeptical to get started and a lot of the use cases that I saw last year were more of Snapchat on your data, right? and that was, you know, um calling themselves an AI product. And this year, a ton of companies are really rethinking their user experiences and their workflows and all of that and really understanding that you need to deconstruct and reconstruct your processes in order to have a in order to build successful AI products, right? And that's that's the good stuff. The bad stuff is the execution is still all over the place. Um, think of it, right? This is a three-year-old field. There are no play playbooks. there are no textbooks. Um so you really need to figure out as you go and the AI life cycle both pre-eployment and post- deployment is very different as compared to a traditional software life cycle. Um and so so a lot of old contracts and handoffs between traditional roles like say PMs and engineers and data folks has now been broken. It's and people are really getting adapted to this new way of working together and kind of owning the same feedback loop in a way because previously I feel like PMs and engineers and all of these folks had their own feedback loops to optimize and now you need to be probably sitting in the same room. You're probably looking at agent traces together and deciding how your uh product should behave. So it's a tighter form of collaboration. So companies are still kind of figuring that out. That's kind of what I see um in my consulting practice this year. >> So, let me follow that thread. We worked on a guest post together that came out a few months ago. And the thing that stood out to me most that stuck with me most after working on that post is you had this really uh key insight that building AI products is very different from building non-AI products. And the thing that you're big on getting across is there's two very big differences. Talk about those two differences. >> Yes. Um and again I I want to make sure that we drive home the right point. Um there are tons of uh similarities of building AI systems and software systems as well. But then there are some things that kind of fundamentally change the way you build software systems um versus AI systems, right? And one of them that most people tend to ignore is the non-determinism. Uh you're pretty much working with a non-deterministic API as compared to traditional software. What does that mean and why does that have to affect us is in traditional software you pretty much have a very well-mapped decision engine or workflow. Think of something like booking.com right you um you have an intention that uh you want to make a booking in San Francisco for two nights etc. uh the product has kind of been built uh so that your intention can be converted into a particular action and you kind of are clicking through a bunch of buttons, options, forms and all of that and you finally achieve your intention. But now that layer in AI products has completely been replaced by a very fluid um interface which is mostly natural language which means you the user can literally come up with ton of ways of saying uh or communicating their intentions, right? And that kind of changes a lot of things because now you don't know how your user is going to behave. That's on the input side. And the output is also that you're working with a non-deterministic probabilistic API which is your LLM. And LLMs are pretty sensitive to prompt phrasings and they're pretty much black boxes. So you don't even know how the output surface will look like, right? So this um you don't know how the user might behave with your product and you also don't know how the LLM might respond to that. So you're now working with an input, output, and a proc process. And you don't understand all the three very well. You're trying to kind of anticipate behavior and build for it. And with agentic systems, this kind of gets even harder. And that's where we talk about the second difference, which is the agency control trade-off. Right? What we mean by that, and I'm kind of shocked. So many people don't talk about this. They're extremely obsessed with building autonomous systems, agents can that can do work for you. But every time you hand over decision-m capabilities or autonomy to agentic systems, you're kind of relinquishing some amount of control on your end, right? And when you do that, you want to make sure that your agent has um caning your trust or it is reliable enough that you can allow it to make decisions. And that's where we talk about this agency control trade-off which is if you give your AI agent or your AI system whatever it is more agency which is the ability to make decisions you're also um losing some control and you want to make sure that the agent or the AI system has earned um that ability or has built up trust over time. >> So just to summarize what you're sharing here essentially people have been building product software products for a long time. We're now in a world where the software you're building is one non-deterministic can just do things differently like you know as you said you go to booking.com you find a hotel it's going to be the same experience every time you'll see different hotels but it's a predictable experience with AI you can't predict that it's going to be the exact same thing the thing that you uh plan it to be every time and then the other is there's this trade-off between agency and control how much will the AI do for you versus how much should the person still be in charge and the what I'm hearing is the big point here is significantly changes the way you should be building product and we're going to talk about the impact on how the product development life cycle should change as a result. Is there anything else you want to add there before we get into into that? Yeah, it's definitely like one of the key points that uh this kind of distinction needs to exist in your mind like when you're starting to build. For example, think about if your like objective is to hike uh half term inity, right? You don't start hiking it every day, but you start you know training yourself for like you know in in minor parts and then you slowly improve and then like you go to the end goal, right? I feel like that's extremely similar to what you want to build AI products in the sense that when you don't start with like agents with all the tools and all the context that you have in the company in day one and expect it to work or like you don't even tinker at that level. You need to be deliberately starting in places where there is minimal impact and more human control so that you have like a good grip of what are the current capabilities and what can I do with them and then slowly you know like lean into the more agency and lesser control. So this gives you that confidence that okay I can know that okay this is the particular problem that I'm facing and the AI can solve this extent of it and then like let me next think through what context I need to bring in what kind of tools I need to add to this to improve the uh experience right so I feel like it's also uh it's a good and a bad thing in sense that it's good that you don't have to see the complexity of the outside world of like you know all of this fancy AI agents force and feel like I cannot do that it's always everyone is starting from very uh minimalistic structures and then evolving. And the second part is like it's also good the the bad thing is that as you are like you know trying to build this oneclick agents into your company you don't have to be overwhelmed with this complexity you can like slowly graduate. So that's extremely important and we see this as a repeating pattern over and over. >> Okay. All right. So, let's actually follow that, right? Cuz that's a really important component of how you recommend people build AI stuff. AI stuff, AI products, AI agents, all the AI things. Um, so give us an example what you're talking about here. This idea of starting uh slow with agency and control and then moving kind of up this rung. >> Yeah. For example, a very important or like very prevalent uh application of AI agents is like customer support, right? Uh imagine like you are a company who has like a lot of customer support tickets and why even imagine like OpenAF faced the exact same thing when we were launching products and there was like a huge spike of uh support volume as like you know we launch successful products like image and or uh you know like GPD5 and things like that the kind of questions you get is different the kind of like you know u problems that the customers bring to you is different. So it's not about just like dumping all the uh list of help center articles that you have into the AI agent. you kind of understand what are the things that you can build and so initially the first step of it would be something like uh you have your support agents the human support agents but you will be suggesting uh in terms of okay this is what the AI thinks that is the right thing to do and then you get that feedback loop from the humans that okay this is actually a good suggestion for me in this particular case and this is a bad suggestion and then you can go back and understand okay uh this is what the drawbacks are or this is where the blind spots are and then how do I fix that? And once you get that you can increase the autonomy to say that okay I don't need to suggest to the human I'll actually show the uh show the answer directly to the customers to the customer and then we can actually add more complexity in terms of okay uh I was only answering questions based on health center articles but now let me add new functionality like I can actually issue refunds to the customers I can actually raise feature requests with the engineering team and all of these things. So if you start all with all of this on day one, it's incredibly hard to control the complexity. So we recommend like you know building step by step and then increasing it. >> Awesome. And you have a visual actually that we'll share of what this looks like. But just to kind of mirror back what you're describing this idea of start with high control, low agency in your the example you gave is the support agent is just kind of giving suggestions is not able to do anything. the user is in charge. And then as that becomes useful and you are confident it's doing the right sort of work, you give it a little more agency and you kind of pull back on the control the user has. And then if that's starting to go well, then you give it more agency and the user needs less control to control it. >> Awesome. >> I I think the higher level idea here is with AI systems, it's all about behavior calibration. It's incredibly impossible to predict up front how your system behaves. Now what do you do about it? You make sure that you don't ruin your customer experience or your end user experience. Um you keep that as is but then remove the amount of control that the human has and there is no single right way of doing it. You can decide how to constrain that autonomy. Right? Um, a very I mean a different example of how you could constrain autonomy is pre-authorization use cases. Insurance pre-authorization is a very ripe use case for AI because uh clinicians spend a lot of time um pre-authorizing uh things like blood tests, MRIs and things like that, right? And there are some cases which are more of lowhanging fruits. for instance, MRIs and blood tests because um as soon as you know patients information, it's easier to approve that and AI could do that versus something like an invasive surgery, etc. is more high-risk. You don't want to be doing that autonomously. So, you can kind of determine which of these use cases should go through that human and the loop layer versus which of the use cases AI can conveniently handle. And then all through this process, you're also logging what the human is doing, right? because you want to build a flywheel um that you could use in order to improve your system. Um so you're essentially um not ruining the user experience, not eroding trust at the same time logging what humans would otherwise do so that you can continuously improve your system. >> So let me let me give you a few more examples of this kind of progression that you recommend. And this the reason I'm spending so much time here is this is a really key part of your recommendation to help people build more successful AI products. this idea of start slow with high control and low agency and then build up over time once you've built confidence that it's doing the right sort of work. So a few more examples that you shared in your post that I'll just read. So say you're building a coding assistant. V1 would be just suggest inline completion and boilerplate snippets. V2 would be generate larger blocks like tests or refactors for humans to review. And then V3 is just apply the changes and open PRs autonomously. And then another example is a marketing assistant. So V1 would be draft emails or social copy just like here's what I would do. V2 is build a multi-step campaign and run the campaign and then launch and V3 is just launch it AB test it autooptimize campaigns across channels. >> Awesome. >> Yeah. >> And and again just to summarize where we're at just to give people the the advice we've shared so far. Uh one is just important to understand AI products are different. They're non-deterministic. And he pointed out and I forgot to actually mirror back this point both on the in on the input and the output the user experience is nondeterministic like people will see different things different outputs different chat conversations different maybe UI if it's designing the UI for you and also the output obviously is going to be nondeterministic so that's a problem and a challenge and then uh >> I mean if you think of it it's also the most beautiful part of AI which is I mean we're all much more comfortable talking than following a bunch of buttons and all of that right? So the bar to using AI products is much lower because you can be as natural as you would be with humans. But that's also the problem which is there are tons of ways we communicate. Um and it's you want to make sure that that intent is rightly communicated and the right actions are taken because most of your systems are deterministic and you want to achieve a deterministic outcome uh but with non-deterministic technology and that's where it gets a little messy. >> Awesome. Okay. That's a I love I love the the optimistic version of the why this is good. Okay. And then the other piece is this idea of this trade-off of autonomy versus control when you're designing a thing. And what I imagine what you're seeing is people try to jump to the ideal like the V3 immediately and that's when they get into trouble both. It's probably a lot harder to build that and it's just doesn't work and then they're just like okay this is a failure. What are we even doing? >> Exactly. I feel there's like a bunch of things that you actually have to uh get confidence in before you get to V3 and it's it's easy to get overwhelmed that oh my AI agent is like doing these things wrong in like 100 different ways and you're not going to actually tabulate all of them and fix it right even though you've learned like you know how do you deal with the uh evaluation practices and stuff like that. If you're starting on the wrong spot you are actually going to have a hard time like you know correcting things from there. And when you start uh small and when you start with building like a very minimalistic version with high human control and low agency, it also forces you to think about what is the problem that I'm going to solve. uh we we use this term called problem first and uh to me it was like obvious in the sense that yeah I I do need to think about the problem but it's incredible how well it resonates with the people that in all this advancements of the AI that we are seeing one easy slippery slope is to just keep thinking about uh complexities of the solution and not and forget the problem that you're trying to solve. So when you're trying to start at like a small at a smaller scale of autonomy, you start to really think about what is the problem that I'm trying to solve and how do I break it down into like levels of autonomy that I can build later. So that is incredibly useful when like and we keep repeating this pattern over and over with everyone we talk to. And there's so many other benefits to uh limiting autonomy because there there's just danger also of the thing doing too much for you and just messing up your I don't know your database sending out all these emails you never expected. There's like so many reasons this is a good idea. >> Yep. I I recently read this paper from a bunch of folks at UC Berkeley. um basically mate Zahara Stoker and the folks at data bricks and it said about 74 or 75% of the enterprises that they had spoken to um their biggest problem was reliability and that's also why they weren't uh comfortable um deploying products to their end users or building customerf facing products because they just weren't sure or they just weren't um comfortable doing that and exposing their users to a bunch of these risks, right? And that's also why they think a lot of AI products today have to do with productivity because it's much low autonomy versus you know end to end agents that would replace workflows. Um and yeah I love their work otherwise as well but I think that's very in line with what um at least we're seeing at my startup as well. >> Okay very interesting. There's an episode that'll come out before this conversation where we go deep into another problem that this avoids which is around uh prompt injection and jailbreaking and just how big of a >> uh ex risk that is for AI products where it's essentially an unsolved and unsolvable problem potentially. I'm not going to go down that track, but that's uh it's a pretty scary conversation we had that it'll be out before this conversation. >> I think that will be a huge problem once systems go mainstream. We're still so busy building AI products that we're not worried about security, but it it will be um such a huge problem to kind of u especially with this non-deterministic API again, right? So, you're kind of stuck because um there are tons of instructions that you could inject within your prompt and then yeah, it's it's going to be bad. Okay, I let's actually spend a little time here because it's actually really interesting to me and no one's talking about this stuff which is like the conversation we had is just it's pretty easy to get AI to trick to do stuff it shouldn't do and there's all these guardrail systems people put in place but turns out these guardrails aren't actually very good and you can always get around them and to your point as agents become more autonomous and robots uh it gets pretty scary that you could get AI to do things you shouldn't do. I think this is definitely a problem. But I feel in the current spectrum of like customers adopting AI, the the extent to which like you know companies can actually get advantage of AI or like improve their processes or like you know streamline the existing processes that they have. I feel it's in still in the very early stage like 2025 has been an extremely busy year for AI agents and customers trying to adopt AI. But I feel the penetration is still not as much as you would actually get advantage out of it. So with the right sort of you know human in the loop uh points in here I feel we can actually avoid a bunch of these things and focus more towards like streamlining the processes and I I am more on the optimist side in the sense that like you need to try and adopt this before actually like trying to be only highlighting the negative aspects of like what could go wrong. So I I feel like strongly u that companies has to adopt this. They definitely like no company uh at openi we talked to is has never had been the case that oh AI cannot help me in this case. It has always been that oh there is this like set of things that it can uh optimize for me and then let me see how I can adopt it. Sweet. I always like the optimistic perspective. I'm excited to for you to listen to this and see what you think because it's really interesting and uh and to your point there's a lot of things to focus on. It's one of one of many things to worry about and think about. Okay, let's get back on track here. So, we've shared a bunch of pro tips and important piece of advice. Let me ask, what other patterns and kind of ways of working do you see in companies that do this well and teams that build AI products successfully? And then just what are the most common pitfalls people fall into? So, we could just maybe start with what are other ways that companies do this well, build AI products successfully? I almost think of it as like a success triangle with three dimensions. It's never always technical. Every technology problem is a people problem first. And with companies that we have worked with, it's these three dimensions, right? Like great leaders, good culture and technical progress. Um with leaders itself, we work with a lot of companies uh for their AI transformation, training, strategy and stuff like that. And I feel like um a lot of companies the leaders have built intuitions over 10 or 15 years and they are kind of highly regarded for those intuions but now with AI in the picture those intuions will have to be relearned and leaders have to be vulnerable to do that right. Um I used to work with the CEO of now Rackspace Gajen. So he would um have this block every day in the morning which would say catching up with AI 4 to 6:00 a.m. and he would not have any meetings or anything like that and that was just his time to pick up on the latest AI um you know podcast or information and all of that and he would have um weekend white coding sessions and stuff like that. So I think leaders have to get back to being hands-on and that's not because they have to be implementing these things but more of uh rebuilding their intuitions because you must be comfortable with the fact that your intuitions might not be right. Um and you you probably are the dumbest person in the room and you want to learn from everyone. Um and that I've seen that being a very um distinguishing factor of companies that build products um which are successful because you're kind of bringing in that top down approach. It's almost always impossible for it to be bottom up. You can't have a bunch of engineers go and get buyin from the leader if they just don't trust in the technology or if they have misaligned expectations about the technology. Right? I've heard from so many folks who are building that our leaders just don't understand the extent to which AI can solve a particular problem or they just white code something and assume it's easy to take it to production and you really need to understand the range of what AI can solve today so that you can guide decisions within the company. The second one is the culture itself, right? And again, I work with enterprises where AI is not their main thing and they have um they need to bring in AI into their processes just because a competitor is doing it and just because it does make sense because there are use cases that are very ripe. Then along the way, I feel a lot of companies have this culture of FOMO and you will be replaced and those kind of things and people get really afraid. um subject matter experts are such a huge part of building AI products that work because you really need to consult them to understand how your AI is behaving or what the ideal behavior should be. But then I have spoken to a bunch of companies where the subject matter experts just don't want to talk to you because they think their job is being replaced. So as I mean again this comes from the leader itself. want to build a culture of empowerment of um augmenting AI into your own workflows so that you know you can 10x what you're doing instead of saying that you know probably uh you'll be replaced if you don't adopt AI and stuff like that. So that kind of an empowering culture always helps you want to make um your entire organization be in it together and make AI work for you instead of trying to you know guard their own jobs etc. And with AI, it's also true that it opens up a lot more opportunities than before. So you could have your employees doing a lot more things than before and 10x their productivity. Um, and the third one is the technical part which we talk about, right? I think folks that are successful are incredibly obsessed about understanding their workflows very well and augmenting parts um that could be um um that could be ripe for AI versus the ones that might need human in the loop somewhere etc. Whenever you're uh trying to automate some part of a workflow, it's never the case that you could you could use an AI agent and that will kind of solve your uh problems, right? It's always you probably have a machine learning uh model that's going to do some part of the job. You have deterministic code doing some part of the job. So you really need to be obsessed with understanding that workflow so you can choose the right tool for the problem instead of being obsessed with the technology itself. And um another pattern I see is also folks really understand this idea of working with a non-deterministic API which is your LLM. And what that means is they also understand the development life cycle looks very different and they iterate pretty quickly which is can I um can I build something iterate uh quickly in a way that it doesn't ruin my customer experience at the same time gives me enough amount of data so that I can estimate behavior right so they build that flywheel very quickly as of today it's not about being the first company to have an agent among your competitors it's about have you built the right flywheels in place so that you can improve over time When someone comes up to me and says, "We have this one-click agent. It's going to be deployed in your system and then in two or three days it'll start showing you significant gains," I would almost be skeptical because it's just not possible. And that's not because the models aren't there, but because enterprise data and infrastructure is very messy and you need a bit to even the agent needs a bit to understand um how these systems work. There are very messy taxonomies everywhere. um people tend to do things like get customer data wi1 get customer data w2 and these kind of things and all those functions exist and um they are being called and there's basically there's a lot of tech debt that you need to deal with. So most of the times if you're obsessed with the problem itself and you understand your workflows very well you will know how to improve your agents over time instead of just slapping an agent and assuming that it'll work from day one. I probably will go as far to say that if someone's selling you one click agents, it's it's pure marketing. You don't want to buy into that. I would rather go with a company that says we're going to build this pipeline for you and that that will learn over time and kind of build a flywheel to improve than something that's going to work out of the box to replace any critical workflow or to um build something that can give you significant ROI easily takes four to six months of work. Even if you have the best data layer and infrastructure layer. Amazing. There's a lot there that resonates so deeply with other conversations I've been having on this podcast. One is just for a company to be successful at seeing a lot of impact from AI, the founder CEO has to be deep into it. Uh I had Dan Shipper on the podcast and they work with a bunch of companies helping them adopt AI and he said that's the number one predictor of success is the CEO chatting with Chad GPT, Claude, whatever uh many times a day. I love this example you gave the Rackspace as like catch up on AI news in the morning every day. I was imagining he'd be like chatting with like the chatbot versus uh like reading news. >> With the kind of information you have as of today, you could just um I mean you want to choose the right um channels as well because everybody has an opinion. So whose opinion do you want to bank on? I feel like having that good quality set of people that you're listening to really makes sense. So he just has a list of two or three sources that he always looks at and and then he comes back with a bunch of questions and bounces it around with a bunch of AI experts to see what they think about it. And I was part of that group so I kind of know um >> I love that >> about the questions that he comes up with. So that's cool. >> It's pretty cool. I was like why are you doing so much? And then he says it trickles down into a bunch of decisions that we take. >> Okay, let me talk about another topic that's very it's been a hot topic on this podcast. It was a hot topic on Twitter for a while. Evals. A lot of people are obsessed with evals, think they're the solution to a lot of problems in AI. A lot of people think they're overrated, that well, you don't need evals. You can just feel the vibes and you'll you'll be all right. What's your take on evals? How far does that take people in solving a lot of the problems that you talk about in terms of like what is going on in the community? I I feel there's this false dichotomy of like there's either eval is going to solve everything or online monitoring or production monitoring is going to solve everything and I find no reason to trust like one of the extremes in the sense that I will entirely bank my application on this and or like that to solve the uh thing right so if you take a step back uh think of what are eval are basically your uh trusted product thinking or like your knowledge about the product that is going into this uh set of data sets that you're going to build in the sense that this is what matters to me like this is the kind of problems that my agent should not do and let me build a list of data sets so that I'm going to do well on those and in terms of production monitoring what you're doing doing there is uh you're deploying your application and then you're having this some sort of key metrics that actually communicate back to you on how customers are using your product like you could be deploying uh any agent And like if the C customer is giving a thumbs up for your interaction, you better want to know that. So that is what production monitoring is going to do, right? And this production monitoring has existed for products like for a long time just that now with AI agents, you need to be monitoring like a lot more granularity. It's not just the customer always giving you explicit feedback, but there is many implicit feedback that you can get. Uh for example, in chat GPD, right? Like if you are uh liking the answer you can actually give a thumbs up or if you don't like the answer sometimes customers don't give you thumbs down but actually re regenerate the answer. So that is an clear indication that the initial answer that you generated is not matting uh meeting the customer's expectation. Right. So these are the kind of implicit signals you always need to think about and that spectrum has been increasing in terms of production monitoring. Now let's come back to the initial topic of like okay is it eval or is it production monitoring? What does it matter? So I feel again we go back to this problem first approach of what is your what is it that you're trying to build like you're trying to build a reliable application for your customers that's not going to do a bad thing like it's always going to do the right thing or if it is doing a wrong thing you are uh you're basically alerted like very quickly right so the I break this down into two parts like one is you like nobody goes into uh deploying an application without actually like you know just testing that this testing could be wipes or this testing could be okay I have this like 10 questions that it should not go wrong any no matter what changes I make and let me build this and let's call this an evaluation data set now let's say you built this you deployed this and then you figured uh okay now I need to understand whether it's doing the right thing or not so if you're a high uh high uh throughput or like a high transaction customer you cannot practically sit and evaluate all the traces right you need some indication to understand what are the things that I should look at and this is where production monitoring comes into the picture that you cannot predict your uh the base in which your agent could be doing wrong but all of these other implicit signals and explicit signals those are going to communicate back to you what uh what are the traces that you need to look at and that is where production monitoring helps and once you get this kind of traces you need to examine what are the failure patterns that you're seeing in these uh different types of interactions and is there something that I really care about that should not happen and if that kind of failure modes are happening then I need to think about building an evaluation data set for it and okay let's say I built an evaluation data set for my agent trying to offer refunds where explicitly I have configured it not to so I built this evaluation data set and then like I made my changes in tools or prompts or whatever and then I deployed the second version of the product right now uh there is no guarantee that this is the only problem that you're going to see you still need production monitoring to actually have like you know catch different kinds of problems that you might encounter. So I feel eval are important, production monitoring is important but this notion of only one of them is going to solve things for you that is uh completely dismissible in my opinion. >> All right, a very reasonable answer and the point here isn't uh it's not just as simple as do both. It's more that there are different things to catch and one approach won't catch all the things you need to be paying attention to. >> Exactly. Awesome. >> I want to take two steps back and kind of talk about how much weight the term evals has had to take in the second, you know, half of 2025 because you go meet a data labeling company and they tell you our experts are writing evals. And then uh you have all of these uh folks saying that PMS should be writing evals. They're the new PRDS. And then you have folks saying that um eval is pretty much everything which is the feedback loop you're supposed to be building to improve your products. Now step back as a beginner and kind of think like what are evals? Why is everyone saying eval? And these are actually different parts of the process and nobody's wrong in the sense that yes these are eval but when a data labeling company is telling you that our um experts are writing evals they're actually referring to error analysis or you know experts just leading notes on what should be right. Lawyers and doctors write evals that doesn't mean they're building LLM judges or they're building this entire feedback loop. And when you say that a PM should be writing evals doesn't mean they have to write an LLM judge that's good enough for production. I think there's there are also very prescriptive ways of doing this and plus one to KD which is you cannot predict up front if you need to be building an LLM judge versus you need to be using um implicit signals from production monitoring etc. I think Martin Fowler at some point had this term called semantic diffusion back in the 2000s. Um um which kind of means that someone comes up with a term everybody starts butchering it with their own definitions and then you kind of lose the actual definition of it. That is kind of what is happening to eval of today. Everybody kind of sees a different side to it I guess. Um but if you make a bunch of practitioners sit together and ask them is it important to build a actionable feedback loop for AI products I think all of them will agree. Now how you do that really depends on your application itself when you go to complex use cases it's incredibly hard to build LM judges because you see a lot of emerging patterns. If you built a judge that would um you know test for verbosity or something like that, you turns out that you're seeing newer patterns that your LM judge is not able to catch and then you're just um you just end up building too many evals and at that point it just makes sense to you know look at your user signals, fix them, check if you've regressed and move on instead of actually building these judges. Um so it all depends. I think one statement that every ML practitioner will tell you is it really depends on the context. Don't be obsessed with prescriptions. They're going to change. >> Uh that's such an important point. This idea that especially that eval just means many things to different people now. It's just like a term for so many things. And uh it it's complicated to just talk about evals when you're think when you see it as the stuff data labeling companies are giving you and things are right. And there's also benchmarks. People call benchmarks a little bit eval. It's like >> I I recently spoke to a client who told me we do eval >> and I was like okay can you show me your data set? and said, "No, we just checked LM arena and artificial analysis. These are, you know, independent benchmarks and we know that this model is the right one for our use case." And I'm like, "You're not doing eval. That's not eval. Those are model." >> That makes sense. Like the word, you know, like could be used in that context. I get why people think that, but yeah, now it's just confusing it even more. >> Yep. >> Just like one more line of questioning here that I think uh that's on my mind is the reason this became kind of a big debate is Claude Code, the head of Claude Code, Boris, was like, "Nah, we don't do evalance on Claude Code. It's all vibes. What can you share kiti on codex and the codeex team of how you approach evals? So CEX we have like this balanced approach of like you know you need to have eval and you need to definitely listen to your customers and I think Alex has been on your podcast recently and he's been talking about how we extremely focused on building the right product right and a part of a big part of it is basically listening to your customers and coding agents are extremely unique compared to agents for other domains in the sense that these are actually built for customizability and these are built for engineers. So coding agent is not a product which is going to solve like these top five workflows or like top six workflows or whatever right it's meant to be customizable in multi different ways and the implication of that is that your product is going to be used in different integrations and different kinds of tools and different kinds of things. So it gets really hard to build an evaluation data set for all kinds of interactions that your customers are going to use your product for. Right? But that said, you also need to understand that okay, if I'm going to make a change, it's at least not going to like damage something that is really core to the product. So we have like evaluations uh for doing that. At the same time, we have we take like extreme care on like understanding how the customers are using it. For example, uh we built this code review product recently and uh it has been gaining like extreme amount of traction and uh I feel like many many bugs in OpenAI as well as like even external customers are getting caught with this. And now let's say if I'm making a model change to the course review or like a different kinds of uh RL mechanism that I trained with it and now if I'm going to deploy it I definitely do want to AP test and identify whether it's actually finding the right uh mistakes and are users how are users reacting to it and sometimes like if users do get annoyed by your like you know uh incorrect code riggers they go to the extent of just switching off the product right so those are the signals that you want to look at and make sure that your new changes are doing the right thing and it's extremely hard for us to you know uh think of these kind of scenarios beforehand and uh develop evaluation data sets for it. So I feel like there's a bit of both like there's a lot of wipes and there's a lot of like customer feedback and we are super active on like the social media to understand if anybody's having certain types of problems and quickly fix that. So I feel it's a it's a um how do I put this? It's like a domain of things that you do here. That makes so much sense. Okay, what I'm hearing Codex Pro evals, but it's not enough. You need to Yes. But also, uh, just watch customer behavior and feedback and also there's some vibes just like is this feeling good? Is this as I'm using it generating great code that I'm excited about that I think is great. >> I I don't think like if anybody's coming and saying that like my I have this concrete set of evas that I can like bet my life on and then I don't need to think about anything else like it it's not going to work. And every new model that we're going to launch, we uh get together as a team and like you know test different things each each person is like concentrating on something else and like we have this list of hard problems that we have and we throw that to the model and see how well they are progressing. So it's like uh custom evals for each engineer you would say and just like understand what the uh product is doing in this new model. If you're a founder, the hardest part of starting a company isn't having the idea. It's scaling the business without getting buried in back office work. That's where Brex comes in. Brex is the intelligent finance platform for founders. With Brex, you get high limit corporate cards, easy banking, high yield treasury, plus a team of AI agents that handle manual finance tasks for you. They'll do all the stuff that you don't want to do, like file your expenses, scour transactions for waste, and run reports, all according to your rules. With Brex AI agents, you can move faster while staying in full control. One in three startups in the United States already runs on Brex. You can, too, at brex.com. We've been talking for almost an hour already and we haven't even covered your extremely powerful software development workflow for building AI products that you two developed that you teach in your course that you basically combines all the stuff we've been talking about into a step-by-step approach to building AI products. You call it the continuous calibration, continuous development framework. Let's pull up a visual to show people what the heck we're talking about and then just walk us through what this is, how this works, how teams can shift the way they build their AI products to this approach to help them avoid a lot of pain and suffering. >> Before we go about explaining um the life cycle, a quick story on why Kita and I came up with this is because um there are tons of u uh companies that we keep talking to that have the pressure from their competitors because they're all building agents. we should be building agents that are entirely autonomous. And we I did end up working with a few customers where we built these end-to-end agents. And turns out that because you start off at a place where you don't know how the user might interact with your system and what kind of responses or actions the AI might come up with, it's really hard to fix problems when you have this really huge workflow which is taking four or five steps, making tons of decisions. you're you just you just end up debugging so much and then kind of hot fixing to the point where at at a time we were building for a customer support um use case which is what which is the example that we give in the newsletter as well and we to shut down the product because we were doing so many hot fixes and there was no way we could um count all the emerging or emerging problems that were coming up right and there's also quite some news online um recently I think Air Canada had this thing where um one of their agents predicted or hallucinated a policy um for a refund which was not part of their original playbook and they had to go by it because legal stuff and there have been a ton of really uh scary incidents and that's where the idea comes from right how can you build so that um you don't lose customer trust and you don't end up or your agent or um AI system doesn't end up making decisions that are super dangerous to the company itself at the same time build a flywheel so that you can improve your product as you go right and that's why we came up with this idea of continuous calibration continuous development. The idea is pretty simple which is um we have this right side of the loop which is continuous development uh where you scope capability and curate data essentially get a data set of what your expected inputs are and what um your expected outputs should be looking at. This is a very good exercise before you start building any AI product because many times you figure out that a lot of the folks within the team are just not aligned on how the product should behave and that's where your PMS can really give in a lot more information and your subject matter experts as well. So you have this data set that you know um your AI product should be doing really well on. It's it's not comprehensive but it lets you get started and then you set up the application and then design the right kind of evaluation metrics and I intentionally use the term evaluation metrics although we say eval because I just want to be very specific on what it is because evaluation is a process evaluation metrics are dimensions that you want to focus on um during the process right and then you go about deploying um run your evaluation metrics um and the second part is the continuous calibration which is the part where you understand what um behavior you hadn't expected in the beginning, right? Because when you start the development process, you have this data set that you're optimizing for, but more often than not, you realize that that data set is not comprehensive enough. Um because users start behaving with your systems in ways that you did not predict. And that's where you want to do the calibration piece. Right? I've deployed my system. Now I see that there are patterns that I did not really expect and your evaluation metrics should give you some insight into that into those patterns. But sometimes you figure out that those metrics were also not enough and you probably have new error patterns that you've not thought about and that's where you analyze your behavior, spot error patterns. You apply fixes for issues that you see but you also design newer evaluation metrics. to figure out that they are emerging patterns. And that doesn't mean you should always design evaluation metrics. There are some errors that you can just fix and not really come back to uh because they're very spot errors. For instance, there's a there's a a tool calling error just because your tool wasn't defined well and stuff like that. You can just fix it and move on, right? And this is pretty much how an AI product life cycle would look like. But what we specifically also mention is while you're going through these iterations, try to think of lower agency iterations in the beginning um and higher control iterations. What that means is constrain the number of decisions your AI systems can make and um make sure that they're humans in the loop and then increase that over time because you're kind of building a flywheel of behavior and uh you're understanding what kind of use cases are coming in or how your users are using the system right and one example I think we give in the newsletter itself is um the customer support this is a nice image that kind of shows how you can think of agency and control as two dimensions and each of your versions keep on increasing the agency or the ability of your AI system to make decisions and lower the control as you go. And one example that we give is that of the u customer support agent where you can break it down into three versions. The first version is just routing which is is your agent able to classify and route a particular ticket to the right department. And sometimes when you read this you probably think is it so hard to just do routing? Why can't an agent easily do that? And when you go to enterprises, routing itself can be a super complex problem. Any retail company, any popular retail company that you can think of has hierarchical taxonomies. Most of the times the taxonomies are incredibly messy. I have worked in you know use cases where you probably have taxonomy that says um you know some tax um some kind of hierarchy and then that says shoes and then women's shoes and men's shoes all at the same layer where idea you should be having shoes and then women's shoes and men's shoes should be sub uh you know classes right and then you're like okay fine I could just merge that and you go further and you see that there's also another section under shoes that says for women and for men and it's just not aggregated it's not uh fixed for some reason. So if an agent kind of sees this kind of a taxonomy, what is it supposed to do? Where is it supposed to route and a lot of the times we are not aware of these problems until you actually go about building something and understanding it, right? So um and when these kind of problems um real human agents see these kind of problems, they know what to check next. U maybe they realize that the the node that says for women and for men that's under shoes was last updated in 2019 which means that it's just a dead node that's lying there and not being used. So they kind of know that okay we're supposed to be looking at a different node and stuff like that. And I'm not saying agents cannot understand this or models are not capable enough to understand this, but there are really weird rules within enterprises that are not documented anywhere and you want to um make sure that the agents have all of that context instead of just throwing the problem at them, right? Um yeah. Uh coming back to the versions we had, routing was one where you have really high control because even if your agent routes to the wrong department, humans can take control and you know undo uh those actions. Um and along the way you also figure out that you probably are dealing with a ton of data issues that you need to fix and you know um um u make sure that your data layer is good enough for the agent to function. uh we do is what we said of a co-pilot which is now that you've figured out routing works fine after a few iterations and you fixed all of your data issues, you could go to the next step which is can my agent provide suggestions uh based on some standard operating procedures that we have for the customer support agent, right? And it could just generate a draft that the human can make changes to. And when you do this, you're also logging human behavior, which means that how much of this draft was used by the customer support agent or what was omitted. So you're actually getting error analysis for free when you do this because you're literally logging everything that the user is doing that you could then build back into your flywheel. And then we say post that once you figured out that those drafts look good and most of the times maybe humans are not making too many changes. They're using these drafts as is. That's when you want to go to your end toend resolution assistant that could you know um draft a resolution that could sort the ticket as well right and those are the stages of agency where you start with low agency and then you go up high, right? Um, we also have this really nice table that we put together which is what do you do at each version and what you learn that can enable you to go to the next step and what information do you get that you can feed into the loop. Right? When you're just doing your routing, you have better quality routing data. You also know what kind of prompts you need to be building to improve the routing system. Essentially, you're figuring out your structure for context engineering and um building that flywheel that you want, right? And while I go through this, I want to also be very clear that two things. One is when you build with CCCD in mind, it doesn't mean that you fix the problem all for once. It's possible that you probably gone through V3 and you see a new distribution of data that you never previously imagined. But um this is just one way to lower your risk which is you get enough information about how users behave with your system before going to a point of complete um autonomy. And the second thing is um you're also kind of um building this um you know implicit logging system. Uh a lot of people come and tell us that oh wait there are eval right why do you need something like this? The issue with just building a bunch of evaluation metrics and then having um them in production is evaluation metrics catch only the errors that you're already aware already aware of. But there can be a lot of emerging patterns that you understand only after you put things in production. Right? So for those emerging patterns, you're kind of creating um um you know a low-risk uh kind of a framework so that you could understand user behavior and not really be in a position where there are tons of errors and you're trying to fix all of them at once. And this is not the only way to do it. There are tons of different ways. You want to decide how you constrain your autonomy. It could be based on the number of actions that the agent is taking, which is what we do in this example. It could be based on topic. there just some um domains where it's uh pretty high risk to make a system completely autonomous for um certain decisions but for some other topics it's okay to make them completely autonomous and depending on the complexity of the problem and that's where you really want your product managers your you know um engineers and subject matter experts to align on how to build the system and continuously improve it. The idea is just behavior calibration and not losing user trust as you do that behavior calibration. I guess >> we'll link folks to this actual post if they want to go really deep. You basically go through all of these steps by step a bunch of examples. And the idea here is as you said that like the reason everything about what you're describing here is about making it uh continuous and iterative and kind of moving along this progression of higher autonomy, less control. And this idea of even calling continuous calibration continuous development is communicating it's this kind of iterative process. And just to be clear, this this naming is kind of a owed to uh CI CICD, continuous integration, continuous deployment >> suite. And the idea here is like that this is the version of that for AI where instead of just like integrating into unit tests and deploying constantly, it's >> uh running evals, looking at results, iterating on on the metrics you're watching, figuring out where it's breaking, and iterating on that. Awesome. Okay, so again, we'll point people to this post if they want to go deeper. That was a great overview. Is there anything else before I go in a different topic around this framework specifically that you think is important for people to know? >> I think one of the most common questions we get is how do I know if I need to go to the next stage or if this is calibrated enough, right? There's not really a rule book you can follow, but it's all about minimizing surprise, which means let's say you're calibrating every one or two days. Um, and you figure out that you're not seeing new data distribution patterns. your users have been pretty consistent with how they're behaving with the system, then the amount of information you gain is kind of very low and that's when you know you can actually go to the next um stage, right? And it's all about the wipes at that point. Like do you know you're ready? Um you're not receiving any new information. But also it really helps to understand that sometimes there are events that could completely uh you know mess up the calibration of your system. An example is um GPD 40 doesn't exist anymore or it's going to be deprecated in APIs as well. So most companies that were using 40 should switch to five and five has very different properties. So that's where your calibration's off again. You want to go back and do this process again. Sometimes users start users start behaving with systems also differently over time or user behavior evolves even with consumer products right you don't talk to chat GPT the same way you were talking say two years ago just because you know the capabilities have increased so much and and also just people get excited when um you know these systems can solve one task they want to try it out on other tasks as well. Uh we built this system um for underwriters at some point, right? Underwriting is a painful task. There are agreements that are like you know uh you know loan uh applications that are like 30 or 40 pages. And the idea for this bank was to build a system that could help underwriters pick policies and you know um um information about the bank so that they could approve loans, right? And for a good three or four months, everybody was pretty impressed with the system. We had underwriters actually report gains in terms of how much time they were spending etc. And post 3 months we realized that they were so excited with the product that they started asking very deep questions that we never anticipated. They would just throw the entire application document at the system and go like for a case that looks like this what did previous underwriters do and for a user that just seems like a natural extension of what they were doing but the building behind it should significantly change. Now you need to understand what does for a case like this mean in the context of the loan itself. Is it referring to people of a particular you know income range or is it referring to people in a particular geo and stuff like that and then you need to pick up historical documents analyze those documents and then tell them um okay this is what it looks like versus just saying that there's a policy X Y and Z and you want to um you know look up that policy. Um so something that might seem very natural to a end user might be very hard to build as a product builder and you see that user behavior also evolves over time and that's when you know you you know that you want to go back and recalibrate. >> What do you think is uh overhyped in the AI space right now and even more importantly what do you think is is underhyped? >> I am as I said like super optimistic in different things that are going in AI. So I wouldn't say overhyped but I feel kind of misunderstood is the concept of multi- aents. Uh people have this notion of like uh I have this incredibly complex problem. Now I'm going to break it down into hey you are this agent take care of this. You're this agent take care of this. And now if I somehow connect all of these agents they think they're the agent utopia. And it's never the case that there are incredibly successful multi-agent systems that are built right like there's no doubt about that. But I feel a lot of it comes in terms of how are you limiting the uh ways in which the system can go off tracks and for example like if you're building a supervisor agent and there are like sub agents that actually do the work for the super agent supervisor agent that is a very uh successful pattern but coming with this notion of I'm going to divide the responsibilities based on functionality and somehow uh expect all of that to work together in some sort of like gossip protocol. uh that is like extremely uh misunderstood that you could do that. I don't think like current uh ways of building and current like uh model capabilities are like right there in terms of like uh building those kind of applications. I feel that is kind of misunderstood than overrated. uh underrated. I feel it's hard to probably believe but I still feel coding agents are underrated in the sense that I feel like you can go on Twitter and you can go on Reddit and you see a lot of chatter about coding agents but talking to an engineer in like any random company uh especially outside of Bay Area you you can see like the amount of impact this coding agents can create and the penetration is very low. So I feel like 2025 uh and 2026 is going to be like an incredible year for optimizing all of these processes and I feel that is going to be creating a lot of value with AI. That's really interesting on that first point. So the idea there is uh you'll probably be more successful building and using uh an agent that is able to do its own sub agent splitting of work versus like a bunch of say codeex agents where you do this task, you do that task. You can have agents to do these things and you as a human can orchestrate it or you can have like one uh larger agent that is going to orchestrate all of these things. But letting the agents communicate in terms of peer-to-peer kind of protocol and then especially uh doing this in say a customer support kind of use case is incredibly hard to control what kind of agent is replying to your customer because you need to shift your guardrails everywhere and things like that. >> Yeah. Okay. Uh great picks. Okay, Ash, what do you got? >> Can I say emails? Will I be cancelled? >> On which in which category? Which which bucket do they go? >> Overrated. >> Overrated. Okay, go go go for it. You we won't let you get cancelled. >> Uh just kidding. I think EVAs are misunderstood. They are important folks. I'm not saying they're not important. But I think just um this um I'm going to keep um jumping across tools and going to pick up and learn a new tool is overrated. I I still am old school and feel like you would need really need to be obsessed with the business problem you're trying to solve. AI is only a tool. Try to think of it that way. Of course, you need to be learning about the latest and greatest, but don't be so obsessed with just building so quickly. Building is really cheap today. Um design is more expensive. really thinking about your product, what you're going to build, is it going to really solve a pain point is is what is way more valuable today and it will only become uh more true in the near future, right? So really obsessing about your problem and design is underrated and just wrote building is overrated I guess. >> Awesome. Okay. Uh similar sort of question from a a product point of view. What do you think the next year of AI is going to look like? give us a vision of where you think things are going to go by say by the end of 2026. >> Yeah, I feel uh there's a lot of promise in terms of uh this background agents or proactive agents who is like they're going to like basically understand your workflow even more. Uh if you think if you think of like where is AI failing to create value today, it's mainly about not understanding the context. And the reason that it's not understanding the context is it's not plugged into the right places where actual work is happening. Right? And as you do more of this, you can give the agent mode of context and then it start to see the world around you and understand what is the what are the set of metrics that you're optimizing for or what are the kind of activities that you're trying to do. It is a very easy extension from there to actually gain more out of it and then let the agent prompt you back. uh we already do this in terms of charge GPT pulse which kind of gives you this daily update of things you might care about and it's it's very nice to actually have that like jog your brain up in terms of oh this is something that I haven't thought about maybe this is good and now when you extend this to more complex tasks like a coding agent which says that like okay I have fixed five of your linear tickets and here are the patches just review them at the start of your day so I feel that is going to be like extremely useful and I see that as like a strong direction in which like products are going to build in 2026 That is so cool. So essentially agents kind of anticipating what you want to do and getting going getting ahead of you and here's I've solved these problems for you or I think this is going to crash your site. Maybe you should fix this thing right here or I see the spike here and let's refactor our database. Amazing. What a world. Okay, Ash, what do you got? >> I am all in for multimodal experiences in 2026. I think we have done quite some progress in 2025 and um not just in terms of generation but also understanding um until now I think LLMs have been our most commonly used models but as humans we are multimodal creatures I would say like um language is probably one of our last forms of evolution as the three of us are talking I think we're constantly getting so many signals I'm like oh Lenny is nodding his head so probably I would go in this direction or Lenny's bored so let me stop stop stop talking So there's a chain of thought be behind your chain of thought and you're constantly altering it with language that dimension of expression is not explored as well. So if you we could build better multimodal experiences that would get us closer to um humanlike um conversation richness and um yeah I think um and just you will also just given the kind of models there's a bunch of boring tasks as well which are ripe for AI if multimodal understanding gets better there are so many handwritten documents and really messy uh PDFs that cannot be passed even by the best of the models as of today and if It's possible. There's there'll be so much um um data that we can tap into. >> Awesome. I just saw Demis from Deep Mind AI, Google, whatever they call the whole or uh talking about this where he's thinks that's going to be a big part of where they're going, combining the image model work, the LLM, and also their world model stuff, Genie, I think is what it's called. >> So, that's going to be a wild wild time. Okay. Uh last question. If someone wants to just get better at building AI products, what's just maybe one skill or maybe two skills that you think they should lean into and develop? >> I think we did cover a bunch of best practices for AI products, which is start small, try to get your iteration going well and build a flywheel and all of that. But again, if you kind of look at it at a 10,000 ft level for anybody building today, like I was saying, implementation is going to be ridiculously cheap in the next few years. So really nail down your design, your judgment, your taste and all of that. Um and in general if you're building a career as well I feel for the past few years your your former years say the first two three years of uh building your career is always focused on execution mechanics and all of that and now we have AI that could help you ramp pretty quickly and post that I mean after a few years I think everybody everybody's job becomes about your taste your judgment and kind of um uh you know what is uniquely you. I think nail down on that part and try to figure out how you can bring in um that kind of a perspective. Um and it doesn't have to mean that you should be significantly older, have ex um years of experience. We recently hired someone and we use this very popular app uh for tracking our tasks, right? And we've been using it for years and we pay a high subscription fee for it. And this guy just came with his own white coded app to the meeting. he onboarded us um to all of it and he's like okay let's start using this and I think that kind of agency and that kind of ownership to really rethink experiences is what uh will set people apart and I'm not being blind to the fact that wipe coded apps have high maintenance costs and maybe as we scale as a company we have to replace it or we have to think of better approaches but given that we're a smalls size company now and just I I was really shocked because I never thought of it um um if you've been used to working in a certain way you associate a cost with building and I feel like folks who grew up in this age u have a much lower cost associated in their mind they just don't mind building something and going ahead with it and that's they're also very um enthusiastic to try out new tools um that's also probably why AI products have this retention problem because everybody's so excited about trying out these new tools and all of that but essentially um having the agency and ownership and I think it's also the end going to be the end of the busy work era, right? You can't be sitting in a corner doing something that doesn't move the needle for a company. You really need to be thinking about, you know, end to-end workflows, how you can bring in more impact. I think all of that will be super important. >> That reminds me, I just had Jason Lumpkit on the podcast. He's um uh very smart on sales, go to market, run Zaster, and he replaced his whole sales team with agents. He had 10 sales people, now he has 1.2 and 20 agents. And one of the agents, it was just tracking everyone's updates to Salesforce and kind of uh updating it automatically for them based on their calls. And one of the salespeople uh is like, "Okay, I'm I I quit." And it turns out he wasn't really doing anything. >> He was just sitting around >> and he's like, "Okay, this will catch me. I got to get out of here." >> Yes. >> So to your point about you can't it'll be harder to sit around and to your thumbs. Uh I think is really right. >> Yeah. I think to add on to that like feel like persistence is also something that is extremely valuable especially given that anybody who wants to build something is the information is like at your fingertips even more than like the past decade right you can learn anything overnight and become that sort of like iron man kind of approach so I feel like having that persistence and like going through the pain of like learning this implementing this and like understanding what works and what doesn't work and as you are going through this like pain of like developing multiple approaches and then solving the problem. I feel that is like going to be the real boat as an individual like I I I like to call it like pain is the new mode but uh I feel that is exactly super useful to actually have this in especially in like you know you're building these AI products. >> Say more about this. I love this concept. Pain is the new moat. Is there more there? Yeah, I feel as a company I mean like successful companies right now building in any new area they are successful not because they're first to the market or like they have this fancy feature that more customers are liking it. They went through the pain of understanding what are the set of non-negotiable things and trade them off exactly with like what are the features or like what are the model capabilities that I can use to solve that problem. it it this is not a straightforward process, right? There's no textbook to do this or like there's no straightforward way or like a known threaded path to be here. So a lot of this pain I was talking about is just like going through this iteration of like okay let's try this and if this doesn't work let's try this and that kind of knowledge that you built across the organization or across like your own experience lived experiences I feel that the that pain is what uh translates into the mode of the company right this could be like a product of eval or like something that you built and I feel that is going to be the game changer >> that is awesome it's like uh turning a coal into diamond Diamond. Yes. Okay. Uh I feel like we've done a great job helping people avoid some of the biggest issues people consistently run into building AI products. We've covered so many of the pitfalls and the ways to actually do it correctly. Before we get to our very exciting lightning round, is there anything else that you wanted to share? Anything else you want to leave listeners with? >> Be obsessed with your customers. Be obsessed with the problem. Um AI is just a tool and um try to make sure that you're really understanding your workflows. 80% of so-called AI engineers, AIPM spend their time actually understanding their workflows very well. They're not building the fanciest and the you know most uh cool models or um workflows around it. They're actually in the wheats understanding their customers behavior and data. Um, and whenever a software engineer who's never done AI before hears the term, look at your data, I think it's a huge revelation to them, but it's always been the case. You need to go there. Look at your data, understand your users, and that's going to be a huge differentiator. >> It's a great way to close it. It's not the AI isn't the answer. It's it's a tool to solve the problem. With that, we have reached our very exciting lightning round. I've got five questions for both of you. Are you ready? Yay. Yes. >> All right. So, you can both answer them. You can pick one which you want to answer. Either way, up to you. What are two or three books you find yourself recommending most to other people? >> For me, it's this book called When Breath Becomes Air, Lenny. It was written by Paul Kalaniti. I think he was um um an Indian origin neurosurgeon who was diagnosed with lung cancer at 31 or 32 and the whole book is his memoir and just is written after he was diagnosed and it's it's really beautiful especially because I read it during co and all we ever wanted to do during co is stay alive. Um there are a bunch of really nice quotes within the book as well, but I remember one of them he was kind of arguing against a very popular quote by Socrates which is the unexamined life is not worth living or something like that. And which means you really need to be thinking about your choices. You need to you know understand your values, your mission and all of that. And um Paul says, "If the unexamined life is not worth living, was the unlived life worth examining?" Which means are you spending so much time just understanding your mission and purpose that you've forgotten to live? And I think it everybody who's uh staying in the AI era and building and continuously going through this phase of reinventing themselves need to take a pause and live for a bit. I guess they need to stop evaling life too much. What really >> I was going to say that that's where my mind went. generate some emails for your life. Oh my god, we've gone too far. >> Yep. Yeah. Yeah. That's that's my favorite book. >> I I like more of science fiction books. So, I uh really like this three body problem series. Uh it's like a three book series. It's it's like has it has elements of like grander than science fiction uh life outside earth and how it impacts like human decision-m process and it also has like elements of geopolitics and how how much important or like valuable abstract science is to human progress and then that gets when that gets stopped it's it's not noticeable in everyday life but it it can cause like devastating effects. So I feel like AI helping in these areas for example is going to be like extremely crucial and that book is like a nice example of what could happen otherwise. Completely agree absolutely love might be my favorite sci-fi book except or series even and it's three I have to read them all three by the way. I find that it only got really good about one and a half books in. So if anyone's tried it and like what the heck is going on here just keep reading and get to the middle of the second one and then gets mindblowing. >> Yes. Uh, if you love sci-fi and you're an AI, you got to read this book called A Fire Upon the Deep by uh, Vernon Vege. >> Mhm. >> Check it out. It's incredible. Uh, I saw Noah Smith on his newsletter recommend this book and there's like a whole there's like sequels to it, but this is the one. It's so incredible and it's actually turns out it's about AGI and super intelligence and all these things and it's just like so epic and no one's heard of it. >> Thank you. >> There you go. I'm giving you one back. Okay, next question. What's a favorite recent movie or TV show that you've really enjoyed? >> I started re-watching Silicon Valley, and I think it's so true. It's so timeless. Everything is repeating all over again. Anybody who's watched it a few years ago should start re-watching it, and you'll see that it's eerily similar to everything that's happening right now with the AI wave. >> That's That's a good idea to rewatch it. I love that their whole business was like an algorithm to compress, like a compression algorithm. It's like maybe a precursor to LM in some small way. Very good. All right, GT, what you got? >> Uh, I'm going to digress and say not a movie or a TV show, but there's this game I picked up recently called Expedition 33. Uh, it has nothing to do with AI, but it's an incredibly incredibly well-made game in terms of the game play or like the movie and the story and the music. Uh, it it's been amazing. >> I love that you have time to play games. That's a great sign. I love that. So, an open eye. I'm just imagining you're there's nothing else going on except just coding and and >> yeah, it has been incredibly hard to find time for that. >> That's good. That's a good sign. I'm happy to hear this. Okay. What's a favorite product that you've recently discovered that you really love? >> For me, it's Whisper Flow. I think I've been using it quite a bit and I didn't know I needed it so much. Um the best part is it's a conceptual transcription tool which means if you go to you know codeex and start using whisfl it starts identifying variables and all of that and it's so seamless in terms of transcription to instruction you could say something like I'm so excited today add three exclamation marks and it seamlessly switches it adds those three exclamation marks instead of you know writing add three exclamation marks and I think it's pretty cool um um if you're not using it you should try it I'll do a plug. Get Whisper Flow for free for an entire year >> for a year for free by becoming an annual subscriber of my newsletter. >> And that's how I got access to it. Lenny, >> there we go. It's like I think I I pitched this deal. I think people don't truly understand how incredible this is. They're like, "No way. This is real." It's real. And 18 other products. Lenny's productbass.com. Check it out. Moving on. K. >> Awesome. Uh I actually am a stickler for productivity. I keep experimenting new CLI tools and like things which can uh make me faster. Uh so I feel like a recast has been amazing. Uh I've discovered all this like new shortcuts that you can use to open different things, type in shortcut commands and things like that. And caffeinate is another thing that I've recently discovered from my teammates. It helps you like prevent Mac from sleeping. So you can run this really long codeex task for like four or five hours locally. Let it build the thing and then you can wake up and be like okay this is good. I like this. >> That's hilarious. That combo codeex and caffeinate. You guys, you guys need to use it. Like build that yourself. An open air version of that or the codeex agent should just keep your Mac from sleeping. That's so funny. Uh, by the way, Raycast also part of Lenny's product pass. One year free of Raycast. >> We wen Lenny didn't tell us these folks. These are actually our favorite. >> These are just two of 19 products. No caffeinate though. I don't know if that's even paid. Okay, let's keep going. Do you have a favorite life motto that you find yourself coming back to in work or in life? >> For me, I think this is what my dad told me when I was a kid and it's always stuck, which is um um they told it couldn't be done, but the fool didn't know it, so he did it anyway. I think be foolish enough to believe that you can do anything if you put your heart to it. Especially now because you have so much data at your hand that could be pointing towards the fact that you probably will be unsuccessful. with how many podcasts made it to more than a thousand subscribers or how many companies hit more than 1 million y and there's always data to show you that you won't be successful but sometimes just be foolish and go ahead with it >> that's great yeah for me I uh am more of an overinker so I really like this quote from Steve Jobs that you can only connect the dots looking backwards so it's a lot of the times there are like numerous choices and you don't really know the optimal one to pick but life's life works in ways that you can actually see back and be like, "Oh, these are actually beautiful in terms of how I I would transition." So, I feel like that is extremely useful in like, you know, keep moving forward, keep experimenting. >> Final question. Whenever I have two guests on the podcast at once, I like to ask this question. What's something that you admire about the other person? >> I think with Kir, um, it's about he's he's pretty calm and, uh, very grounded. Um, and he's always been my sounding board. I can throw a ton of ideas at him and he always comes up with he's able to anticipate the kind of issues that might um, run into and he's extremely um, kind and lets his work speak instead of actually doing a lot of talking, I guess. But if I had to pick one, I think uh, he's the most incredible husband. So >> reveal little people know. >> Yeah. We've been married for four years and been the most beautiful four years of my life. >> Oh wow. Okay. How do you follow that? >> Yeah, it's super hard to follow that. I would say I am extremely privileged in terms of working with like really smart people in great companies in the Silicon Valley. And I feel the unique thing that stands with Ashwaryia across like any other uh smart folks I've worked on is like she has this really amazing knack of teaching and like explaining something uh in a very understandable and easy to comprehend way and that combined with persistence is like super useful especially in this uh fastmoving AI world that we are in in the sense that there's so many new things coming up it feels overwhelming but when I hear her talk about like this is how you make sense of this entire thing this is where it plugs in. I feel like oh that is so simple like I can also do that. So she empowers a lot of people by simplifying things and you know like uh explaining things in the most understandable way. So I feel that is like an incredible quality. >> Amazing. How sweet. I got to do this all the time. I need more more yes to that was that was great. Okay. Uh final questions. Where can folks find stuff that you're working on? Find you online. Talk about share your course link and then just how can listeners be useful to you? >> I write a lot on LinkedIn. Um um so if you if you want to listen to pragmatists who've been in the weeds working on AI products and um what they're seeing, you can uh follow my work. We also have a GitHub repository with about 20K stars and that repository is all about good resources for learning AI. It's completely free and if you um like what we spoke today, we also run a super popular course. We'll leave a link to it on building enterprise AI products. And the course is a lot about unlearning mindsets and following like a problem first approach uh instead of a tool first or a hype first approach. Um so you can check that out as well. And if you don't want to do the course, we write a lot. We give out a lot of free resources. We have free sessions. So make sure you follow our work. >> Yeah, I would also add that I you can also find me on LinkedIn. uh I don't like write a lot I guess but I'm super all excited to just talk to any complex product that you're building and if you have thoughts on like how you can uh use coding agents to make your life better or how what are the problems that you're seeing um always my DMs are open and like we can have a great discuss. >> Awesome. Well, Kiriti and Ash, thank you so much for being here. >> Thank you so much. >> Thank you Lenny. This was so much fun. >> So much fun. Bye everyone. >> Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at lennispodcast.com. See you in the next episode.
TL;DR
- Building AI products fundamentally differs from traditional software due to inherent non-determinism in both user input and AI output, alongside a crucial agency-control trade-off.
- Successful AI development requires a step-by-step approach, starting with high human control and low AI agency, then gradually increasing autonomy as reliability is proven.
- Leaders must become hands-on and vulnerable, relearning intuitions about AI to effectively guide initiatives and foster a collaborative, problem-first culture.
Takeaways
- Understand that AI products are inherently non-deterministic: user input via natural language is fluid, and LLM outputs are probabilistic, making system behavior less predictable than traditional software.
- Actively manage the 'agency-control trade-off': recognize that granting AI more decision-making ability (agency) necessitates relinquishing some human control, requiring earned trust and reliability.
- Adopt a phased product development approach, starting with high human control and low AI agency (e.g., AI providing suggestions for human review) before gradually increasing autonomy.
- Prioritize a "problem-first" approach: focus on identifying and solving specific problems rather than immediately designing complex, high-autonomy agents.
- Implement "human-in-the-loop" systems to log human actions and feedback, creating a continuous improvement flywheel for AI behavior calibration and system refinement.
- Constrain AI autonomy for high-risk use cases (e.g., invasive surgery pre-authorization) and allow more agency for low-risk, "low-hanging fruit" tasks (e.g., simple blood test approvals).
- Leaders must dedicate time to hands-on learning, rebuilding their intuitions about AI, and being comfortable with not always being right, fostering a learning environment.
- Foster tighter collaboration between PMs, engineers, and data folks, as the AI lifecycle breaks traditional handoffs and requires shared ownership of feedback loops.
Vocabulary
Non-determinism — The characteristic of a system where the same input can produce different outputs, making its behavior less predictable or fixed.
LLM — (Large Language Model) An AI model trained on vast amounts of text data to understand, generate, and process natural language.
Agency control trade-off — The principle that as an AI system is given more autonomy and ability to make decisions (agency), human control over its actions is reduced.
Agentic systems — AI systems designed to act autonomously, make decisions, and take actions in an environment to achieve specific goals.
Human-in-the-loop — A system design where human input and decision-making are integrated into an AI workflow, often for oversight, training, or correction.
Flywheel — A continuous feedback loop or process where success in one area drives success in another, leading to accelerating growth or improvement.
Prompt phrasings — The specific wording and structure of instructions or queries given to an LLM, which can significantly influence its response.
Autonomy — The capability of an AI system to operate independently and make decisions without constant human intervention.
Behavior calibration — The process of adjusting and refining an AI system's actions and responses to ensure it behaves as intended and earns trust over time.
Pre-authorization use cases — Scenarios where an AI system can assist or automate the process of obtaining prior approval for services or actions, often in healthcare or finance.
Transcript
We worked on a guest post together had this really key insight that building AI products is very different from building nonAI products. >> Most people tend to ignore the non-determinism. You don't know how the user might behave with your product and you also don't know how the LLM might respond to that. The second difference is the agency control trade-off. Every time you hand over decision-m capabilities to agentic systems, you're kind of relinquishing some amount of control on your end. >> This significantly changes the way you should be building product. So we recommend building step by step. When you start small, it forces you to think about what is the problem that I'm going to solve. In all this advancements of the AI, one easy slippery slope is to keep thinking about complexities of the solution and forget the problem that you're trying to solve. >> It's not about being the first company to have an agent among your competitors. It's about have you built the right fly wheels in place so that you can improve over time. >> What kind of ways of working do you see in companies that build AI products successfully? I used to work with the CEO of now Rackspace. He would have this block every day in the morning which would say catching up with AI 4 to 6:00 a.m. Leaders have to get back to being hands-on. You must be comfortable with the fact that your intuions might not be right and you probably are the dumbest person in the room and you want to learn from everyone. >> What do you think the next year of AI is going to look like? >> Persistence is extremely valuable. Successful companies right now building in any new area. They are going through the pain of learning this, implementing this and understanding what works and what doesn't work. Pain is the new mode. Today my guests are Aishwaria Raanti and Kiti Bottom. Kiti works on codecs at OpenAI and has spent the last decade building AI and ML infrastructure at Google and at Kumo. Ash was an early AI researcher at Alexa and Microsoft and has published over 35 research papers. Together, they've led and supported over 50 AI product deployments across companies like Amazon, Data Bricks, OpenAI, Google, and both startups and large enterprises. Together, they also teach the number one rated AI course on Maven, where they teach product leaders all of the key lessons they've learned about building successful AI products. The goal of this episode is to save you and your team a lot of pain and suffering and wasted time trying to build your AI product. Whether you are already struggling to make your product work or want to avoid that struggle, this episode is for you. If you enjoy this podcast, don't forget to subscribe and follow it in your favorite podcasting app or YouTube. It helps tremendously. And if you become an annual subscriber of my newsletter, you get a year free of a ton of incredible products, including a year free of lovable, replet, bold, gamma, nad linear, Devon, Postto, Superhum, Dcript, Whisper Flow, Perplexity, Warp, Granola, Magic Pattern, Dracast, Chapter D, Mobit, and Stripe Atlas. Head on over to lenny'snewsletter.com and click product pass. With that, I bring you Awaria Oranti and Kiti bottom after a short word from our sponsors. This episode is brought to you by Merge. Product leaders hate building integrations. They're messy. They're slow to build. They're a huge drain on your road map, and they're definitely not why you got into product in the first place. Lucky for you, Merge is obsessed with integrations. With a single API, B2B SAS companies embed Merge into their product and ship 220 plus customerf facing integrations in weeks, not quarters. Think of merge like Plaid, but for everything B2B SAS. Companies like Merall AI, ramp, and use Merge to connect their customers as accounting, HR, ticketing, CRM, and file storage systems to power everything from automatic onboarding to AI ready data pipelines. Even better, Merge now supports the secure deployment of connectors to AI agents with a new product so that you can safely power AI workflows with real customer data. If your product needs customer data from dozens of systems, Merge is the fastest, safest way to get it. Book and attend a meeting at merge.dev/lenny and they'll send you a $50 Amazon gift card. That's merge.dev/lenny. This episode is brought to you by Stella, the customer research platform built for the AI era. Here's the truth about user research. It's never been more important or more painful. Teams want to understand why customers do what they do. But recruiting users, running interviews, and analyzing insights takes weeks. By the time the results are in, the moment to act has passed. Strella changes that. It's the first platform that uses AI to run and analyze in-depth interviews automatically, bringing fast and continuous user research to every team. Strella's AI moderator asks real follow-up questions, probing deeper when answers are vague, and services patterns across hundreds of conversations, all in a few hours, not weeks. Product design and research teams at companies like Amazon and Dualingo are already using Stella for Figma prototype testing, concept validation, and customer journey research, getting insights overnight instead of waiting for the next sprint. If your team wants to understand customers at the speed you ship products, try Strella. Run your next study at strea.io/lenny. That's s t re l.io/lenny. Ash and Kiti, thank you so much for being here and welcome to the podcast. >> Thank you. Thank you for having us. Super excited for this. >> Let me set the stage for the conversation that we're going to have today. So, you two have built a bunch of AI products yourself. You've gone deep with a lot of companies who uh have built AI products, have struggled to build AI products, build AI agents. You also teach a course on building AI products successfully that and you're kind of like on this mission to just reduce pain and suffering and failure uh that you constantly see people go through when they're building AI products. So to set a little just foundation for the conversation we're going to have, what are you seeing on the ground within companies trying to build AI products? What's going well? What's not going well? >> I think 2025 has been significantly different than 2024. one, the skepticism has significantly reduced. Um, there were tons of leaders last year who probably thought this would be yet another crypto wave and kind of skeptical to get started and a lot of the use cases that I saw last year were more of Snapchat on your data, right? and that was, you know, um calling themselves an AI product. And this year, a ton of companies are really rethinking their user experiences and their workflows and all of that and really understanding that you need to deconstruct and reconstruct your processes in order to have a in order to build successful AI products, right? And that's that's the good stuff. The bad stuff is the execution is still all over the place. Um, think of it, right? This is a three-year-old field. There are no play playbooks. there are no textbooks. Um so you really need to figure out as you go and the AI life cycle both pre-eployment and post- deployment is very different as compared to a traditional software life cycle. Um and so so a lot of old contracts and handoffs between traditional roles like say PMs and engineers and data folks has now been broken. It's and people are really getting adapted to this new way of working together and kind of owning the same feedback loop in a way because previously I feel like PMs and engineers and all of these folks had their own feedback loops to optimize and now you need to be probably sitting in the same room. You're probably looking at agent traces together and deciding how your uh product should behave. So it's a tighter form of collaboration. So companies are still kind of figuring that out. That's kind of what I see um in my consulting practice this year. >> So, let me follow that thread. We worked on a guest post together that came out a few months ago. And the thing that stood out to me most that stuck with me most after working on that post is you had this really uh key insight that building AI products is very different from building non-AI products. And the thing that you're big on getting across is there's two very big differences. Talk about those two differences. >> Yes. Um and again I I want to make sure that we drive home the right point. Um there are tons of uh similarities of building AI systems and software systems as well. But then there are some things that kind of fundamentally change the way you build software systems um versus AI systems, right? And one of them that most people tend to ignore is the non-determinism. Uh you're pretty much working with a non-deterministic API as compared to traditional software. What does that mean and why does that have to affect us is in traditional software you pretty much have a very well-mapped decision engine or workflow. Think of something like booking.com right you um you have an intention that uh you want to make a booking in San Francisco for two nights etc. uh the product has kind of been built uh so that your intention can be converted into a particular action and you kind of are clicking through a bunch of buttons, options, forms and all of that and you finally achieve your intention. But now that layer in AI products has completely been replaced by a very fluid um interface which is mostly natural language which means you the user can literally come up with ton of ways of saying uh or communicating their intentions, right? And that kind of changes a lot of things because now you don't know how your user is going to behave. That's on the input side. And the output is also that you're working with a non-deterministic probabilistic API which is your LLM. And LLMs are pretty sensitive to prompt phrasings and they're pretty much black boxes. So you don't even know how the output surface will look like, right? So this um you don't know how the user might behave with your product and you also don't know how the LLM might respond to that. So you're now working with an input, output, and a proc process. And you don't understand all the three very well. You're trying to kind of anticipate behavior and build for it. And with agentic systems, this kind of gets even harder. And that's where we talk about the second difference, which is the agency control trade-off. Right? What we mean by that, and I'm kind of shocked. So many people don't talk about this. They're extremely obsessed with building autonomous systems, agents can that can do work for you. But every time you hand over decision-m capabilities or autonomy to agentic systems, you're kind of relinquishing some amount of control on your end, right? And when you do that, you want to make sure that your agent has um caning your trust or it is reliable enough that you can allow it to make decisions. And that's where we talk about this agency control trade-off which is if you give your AI agent or your AI system whatever it is more agency which is the ability to make decisions you're also um losing some control and you want to make sure that the agent or the AI system has earned um that ability or has built up trust over time. >> So just to summarize what you're sharing here essentially people have been building product software products for a long time. We're now in a world where the software you're building is one non-deterministic can just do things differently like you know as you said you go to booking.com you find a hotel it's going to be the same experience every time you'll see different hotels but it's a predictable experience with AI you can't predict that it's going to be the exact same thing the thing that you uh plan it to be every time and then the other is there's this trade-off between agency and control how much will the AI do for you versus how much should the person still be in charge and the what I'm hearing is the big point here is significantly changes the way you should be building product and we're going to talk about the impact on how the product development life cycle should change as a result. Is there anything else you want to add there before we get into into that? Yeah, it's definitely like one of the key points that uh this kind of distinction needs to exist in your mind like when you're starting to build. For example, think about if your like objective is to hike uh half term inity, right? You don't start hiking it every day, but you start you know training yourself for like you know in in minor parts and then you slowly improve and then like you go to the end goal, right? I feel like that's extremely similar to what you want to build AI products in the sense that when you don't start with like agents with all the tools and all the context that you have in the company in day one and expect it to work or like you don't even tinker at that level. You need to be deliberately starting in places where there is minimal impact and more human control so that you have like a good grip of what are the current capabilities and what can I do with them and then slowly you know like lean into the more agency and lesser control. So this gives you that confidence that okay I can know that okay this is the particular problem that I'm facing and the AI can solve this extent of it and then like let me next think through what context I need to bring in what kind of tools I need to add to this to improve the uh experience right so I feel like it's also uh it's a good and a bad thing in sense that it's good that you don't have to see the complexity of the outside world of like you know all of this fancy AI agents force and feel like I cannot do that it's always everyone is starting from very uh minimalistic structures and then evolving. And the second part is like it's also good the the bad thing is that as you are like you know trying to build this oneclick agents into your company you don't have to be overwhelmed with this complexity you can like slowly graduate. So that's extremely important and we see this as a repeating pattern over and over. >> Okay. All right. So, let's actually follow that, right? Cuz that's a really important component of how you recommend people build AI stuff. AI stuff, AI products, AI agents, all the AI things. Um, so give us an example what you're talking about here. This idea of starting uh slow with agency and control and then moving kind of up this rung. >> Yeah. For example, a very important or like very prevalent uh application of AI agents is like customer support, right? Uh imagine like you are a company who has like a lot of customer support tickets and why even imagine like OpenAF faced the exact same thing when we were launching products and there was like a huge spike of uh support volume as like you know we launch successful products like image and or uh you know like GPD5 and things like that the kind of questions you get is different the kind of like you know u problems that the customers bring to you is different. So it's not about just like dumping all the uh list of help center articles that you have into the AI agent. you kind of understand what are the things that you can build and so initially the first step of it would be something like uh you have your support agents the human support agents but you will be suggesting uh in terms of okay this is what the AI thinks that is the right thing to do and then you get that feedback loop from the humans that okay this is actually a good suggestion for me in this particular case and this is a bad suggestion and then you can go back and understand okay uh this is what the drawbacks are or this is where the blind spots are and then how do I fix that? And once you get that you can increase the autonomy to say that okay I don't need to suggest to the human I'll actually show the uh show the answer directly to the customers to the customer and then we can actually add more complexity in terms of okay uh I was only answering questions based on health center articles but now let me add new functionality like I can actually issue refunds to the customers I can actually raise feature requests with the engineering team and all of these things. So if you start all with all of this on day one, it's incredibly hard to control the complexity. So we recommend like you know building step by step and then increasing it. >> Awesome. And you have a visual actually that we'll share of what this looks like. But just to kind of mirror back what you're describing this idea of start with high control, low agency in your the example you gave is the support agent is just kind of giving suggestions is not able to do anything. the user is in charge. And then as that becomes useful and you are confident it's doing the right sort of work, you give it a little more agency and you kind of pull back on the control the user has. And then if that's starting to go well, then you give it more agency and the user needs less control to control it. >> Awesome. >> I I think the higher level idea here is with AI systems, it's all about behavior calibration. It's incredibly impossible to predict up front how your system behaves. Now what do you do about it? You make sure that you don't ruin your customer experience or your end user experience. Um you keep that as is but then remove the amount of control that the human has and there is no single right way of doing it. You can decide how to constrain that autonomy. Right? Um, a very I mean a different example of how you could constrain autonomy is pre-authorization use cases. Insurance pre-authorization is a very ripe use case for AI because uh clinicians spend a lot of time um pre-authorizing uh things like blood tests, MRIs and things like that, right? And there are some cases which are more of lowhanging fruits. for instance, MRIs and blood tests because um as soon as you know patients information, it's easier to approve that and AI could do that versus something like an invasive surgery, etc. is more high-risk. You don't want to be doing that autonomously. So, you can kind of determine which of these use cases should go through that human and the loop layer versus which of the use cases AI can conveniently handle. And then all through this process, you're also logging what the human is doing, right? because you want to build a flywheel um that you could use in order to improve your system. Um so you're essentially um not ruining the user experience, not eroding trust at the same time logging what humans would otherwise do so that you can continuously improve your system. >> So let me let me give you a few more examples of this kind of progression that you recommend. And this the reason I'm spending so much time here is this is a really key part of your recommendation to help people build more successful AI products. this idea of start slow with high control and low agency and then build up over time once you've built confidence that it's doing the right sort of work. So a few more examples that you shared in your post that I'll just read. So say you're building a coding assistant. V1 would be just suggest inline completion and boilerplate snippets. V2 would be generate larger blocks like tests or refactors for humans to review. And then V3 is just apply the changes and open PRs autonomously. And then another example is a marketing assistant. So V1 would be draft emails or social copy just like here's what I would do. V2 is build a multi-step campaign and run the campaign and then launch and V3 is just launch it AB test it autooptimize campaigns across channels. >> Awesome. >> Yeah. >> And and again just to summarize where we're at just to give people the the advice we've shared so far. Uh one is just important to understand AI products are different. They're non-deterministic. And he pointed out and I forgot to actually mirror back this point both on the in on the input and the output the user experience is nondeterministic like people will see different things different outputs different chat conversations different maybe UI if it's designing the UI for you and also the output obviously is going to be nondeterministic so that's a problem and a challenge and then uh >> I mean if you think of it it's also the most beautiful part of AI which is I mean we're all much more comfortable talking than following a bunch of buttons and all of that right? So the bar to using AI products is much lower because you can be as natural as you would be with humans. But that's also the problem which is there are tons of ways we communicate. Um and it's you want to make sure that that intent is rightly communicated and the right actions are taken because most of your systems are deterministic and you want to achieve a deterministic outcome uh but with non-deterministic technology and that's where it gets a little messy. >> Awesome. Okay. That's a I love I love the the optimistic version of the why this is good. Okay. And then the other piece is this idea of this trade-off of autonomy versus control when you're designing a thing. And what I imagine what you're seeing is people try to jump to the ideal like the V3 immediately and that's when they get into trouble both. It's probably a lot harder to build that and it's just doesn't work and then they're just like okay this is a failure. What are we even doing? >> Exactly. I feel there's like a bunch of things that you actually have to uh get confidence in before you get to V3 and it's it's easy to get overwhelmed that oh my AI agent is like doing these things wrong in like 100 different ways and you're not going to actually tabulate all of them and fix it right even though you've learned like you know how do you deal with the uh evaluation practices and stuff like that. If you're starting on the wrong spot you are actually going to have a hard time like you know correcting things from there. And when you start uh small and when you start with building like a very minimalistic version with high human control and low agency, it also forces you to think about what is the problem that I'm going to solve. uh we we use this term called problem first and uh to me it was like obvious in the sense that yeah I I do need to think about the problem but it's incredible how well it resonates with the people that in all this advancements of the AI that we are seeing one easy slippery slope is to just keep thinking about uh complexities of the solution and not and forget the problem that you're trying to solve. So when you're trying to start at like a small at a smaller scale of autonomy, you start to really think about what is the problem that I'm trying to solve and how do I break it down into like levels of autonomy that I can build later. So that is incredibly useful when like and we keep repeating this pattern over and over with everyone we talk to. And there's so many other benefits to uh limiting autonomy because there there's just danger also of the thing doing too much for you and just messing up your I don't know your database sending out all these emails you never expected. There's like so many reasons this is a good idea. >> Yep. I I recently read this paper from a bunch of folks at UC Berkeley. um basically mate Zahara Stoker and the folks at data bricks and it said about 74 or 75% of the enterprises that they had spoken to um their biggest problem was reliability and that's also why they weren't uh comfortable um deploying products to their end users or building customerf facing products because they just weren't sure or they just weren't um comfortable doing that and exposing their users to a bunch of these risks, right? And that's also why they think a lot of AI products today have to do with productivity because it's much low autonomy versus you know end to end agents that would replace workflows. Um and yeah I love their work otherwise as well but I think that's very in line with what um at least we're seeing at my startup as well. >> Okay very interesting. There's an episode that'll come out before this conversation where we go deep into another problem that this avoids which is around uh prompt injection and jailbreaking and just how big of a >> uh ex risk that is for AI products where it's essentially an unsolved and unsolvable problem potentially. I'm not going to go down that track, but that's uh it's a pretty scary conversation we had that it'll be out before this conversation. >> I think that will be a huge problem once systems go mainstream. We're still so busy building AI products that we're not worried about security, but it it will be um such a huge problem to kind of u especially with this non-deterministic API again, right? So, you're kind of stuck because um there are tons of instructions that you could inject within your prompt and then yeah, it's it's going to be bad. Okay, I let's actually spend a little time here because it's actually really interesting to me and no one's talking about this stuff which is like the conversation we had is just it's pretty easy to get AI to trick to do stuff it shouldn't do and there's all these guardrail systems people put in place but turns out these guardrails aren't actually very good and you can always get around them and to your point as agents become more autonomous and robots uh it gets pretty scary that you could get AI to do things you shouldn't do. I think this is definitely a problem. But I feel in the current spectrum of like customers adopting AI, the the extent to which like you know companies can actually get advantage of AI or like improve their processes or like you know streamline the existing processes that they have. I feel it's in still in the very early stage like 2025 has been an extremely busy year for AI agents and customers trying to adopt AI. But I feel the penetration is still not as much as you would actually get advantage out of it. So with the right sort of you know human in the loop uh points in here I feel we can actually avoid a bunch of these things and focus more towards like streamlining the processes and I I am more on the optimist side in the sense that like you need to try and adopt this before actually like trying to be only highlighting the negative aspects of like what could go wrong. So I I feel like strongly u that companies has to adopt this. They definitely like no company uh at openi we talked to is has never had been the case that oh AI cannot help me in this case. It has always been that oh there is this like set of things that it can uh optimize for me and then let me see how I can adopt it. Sweet. I always like the optimistic perspective. I'm excited to for you to listen to this and see what you think because it's really interesting and uh and to your point there's a lot of things to focus on. It's one of one of many things to worry about and think about. Okay, let's get back on track here. So, we've shared a bunch of pro tips and important piece of advice. Let me ask, what other patterns and kind of ways of working do you see in companies that do this well and teams that build AI products successfully? And then just what are the most common pitfalls people fall into? So, we could just maybe start with what are other ways that companies do this well, build AI products successfully? I almost think of it as like a success triangle with three dimensions. It's never always technical. Every technology problem is a people problem first. And with companies that we have worked with, it's these three dimensions, right? Like great leaders, good culture and technical progress. Um with leaders itself, we work with a lot of companies uh for their AI transformation, training, strategy and stuff like that. And I feel like um a lot of companies the leaders have built intuitions over 10 or 15 years and they are kind of highly regarded for those intuions but now with AI in the picture those intuions will have to be relearned and leaders have to be vulnerable to do that right. Um I used to work with the CEO of now Rackspace Gajen. So he would um have this block every day in the morning which would say catching up with AI 4 to 6:00 a.m. and he would not have any meetings or anything like that and that was just his time to pick up on the latest AI um you know podcast or information and all of that and he would have um weekend white coding sessions and stuff like that. So I think leaders have to get back to being hands-on and that's not because they have to be implementing these things but more of uh rebuilding their intuitions because you must be comfortable with the fact that your intuitions might not be right. Um and you you probably are the dumbest person in the room and you want to learn from everyone. Um and that I've seen that being a very um distinguishing factor of companies that build products um which are successful because you're kind of bringing in that top down approach. It's almost always impossible for it to be bottom up. You can't have a bunch of engineers go and get buyin from the leader if they just don't trust in the technology or if they have misaligned expectations about the technology. Right? I've heard from so many folks who are building that our leaders just don't understand the extent to which AI can solve a particular problem or they just white code something and assume it's easy to take it to production and you really need to understand the range of what AI can solve today so that you can guide decisions within the company. The second one is the culture itself, right? And again, I work with enterprises where AI is not their main thing and they have um they need to bring in AI into their processes just because a competitor is doing it and just because it does make sense because there are use cases that are very ripe. Then along the way, I feel a lot of companies have this culture of FOMO and you will be replaced and those kind of things and people get really afraid. um subject matter experts are such a huge part of building AI products that work because you really need to consult them to understand how your AI is behaving or what the ideal behavior should be. But then I have spoken to a bunch of companies where the subject matter experts just don't want to talk to you because they think their job is being replaced. So as I mean again this comes from the leader itself. want to build a culture of empowerment of um augmenting AI into your own workflows so that you know you can 10x what you're doing instead of saying that you know probably uh you'll be replaced if you don't adopt AI and stuff like that. So that kind of an empowering culture always helps you want to make um your entire organization be in it together and make AI work for you instead of trying to you know guard their own jobs etc. And with AI, it's also true that it opens up a lot more opportunities than before. So you could have your employees doing a lot more things than before and 10x their productivity. Um, and the third one is the technical part which we talk about, right? I think folks that are successful are incredibly obsessed about understanding their workflows very well and augmenting parts um that could be um um that could be ripe for AI versus the ones that might need human in the loop somewhere etc. Whenever you're uh trying to automate some part of a workflow, it's never the case that you could you could use an AI agent and that will kind of solve your uh problems, right? It's always you probably have a machine learning uh model that's going to do some part of the job. You have deterministic code doing some part of the job. So you really need to be obsessed with understanding that workflow so you can choose the right tool for the problem instead of being obsessed with the technology itself. And um another pattern I see is also folks really understand this idea of working with a non-deterministic API which is your LLM. And what that means is they also understand the development life cycle looks very different and they iterate pretty quickly which is can I um can I build something iterate uh quickly in a way that it doesn't ruin my customer experience at the same time gives me enough amount of data so that I can estimate behavior right so they build that flywheel very quickly as of today it's not about being the first company to have an agent among your competitors it's about have you built the right flywheels in place so that you can improve over time When someone comes up to me and says, "We have this one-click agent. It's going to be deployed in your system and then in two or three days it'll start showing you significant gains," I would almost be skeptical because it's just not possible. And that's not because the models aren't there, but because enterprise data and infrastructure is very messy and you need a bit to even the agent needs a bit to understand um how these systems work. There are very messy taxonomies everywhere. um people tend to do things like get customer data wi1 get customer data w2 and these kind of things and all those functions exist and um they are being called and there's basically there's a lot of tech debt that you need to deal with. So most of the times if you're obsessed with the problem itself and you understand your workflows very well you will know how to improve your agents over time instead of just slapping an agent and assuming that it'll work from day one. I probably will go as far to say that if someone's selling you one click agents, it's it's pure marketing. You don't want to buy into that. I would rather go with a company that says we're going to build this pipeline for you and that that will learn over time and kind of build a flywheel to improve than something that's going to work out of the box to replace any critical workflow or to um build something that can give you significant ROI easily takes four to six months of work. Even if you have the best data layer and infrastructure layer. Amazing. There's a lot there that resonates so deeply with other conversations I've been having on this podcast. One is just for a company to be successful at seeing a lot of impact from AI, the founder CEO has to be deep into it. Uh I had Dan Shipper on the podcast and they work with a bunch of companies helping them adopt AI and he said that's the number one predictor of success is the CEO chatting with Chad GPT, Claude, whatever uh many times a day. I love this example you gave the Rackspace as like catch up on AI news in the morning every day. I was imagining he'd be like chatting with like the chatbot versus uh like reading news. >> With the kind of information you have as of today, you could just um I mean you want to choose the right um channels as well because everybody has an opinion. So whose opinion do you want to bank on? I feel like having that good quality set of people that you're listening to really makes sense. So he just has a list of two or three sources that he always looks at and and then he comes back with a bunch of questions and bounces it around with a bunch of AI experts to see what they think about it. And I was part of that group so I kind of know um >> I love that >> about the questions that he comes up with. So that's cool. >> It's pretty cool. I was like why are you doing so much? And then he says it trickles down into a bunch of decisions that we take. >> Okay, let me talk about another topic that's very it's been a hot topic on this podcast. It was a hot topic on Twitter for a while. Evals. A lot of people are obsessed with evals, think they're the solution to a lot of problems in AI. A lot of people think they're overrated, that well, you don't need evals. You can just feel the vibes and you'll you'll be all right. What's your take on evals? How far does that take people in solving a lot of the problems that you talk about in terms of like what is going on in the community? I I feel there's this false dichotomy of like there's either eval is going to solve everything or online monitoring or production monitoring is going to solve everything and I find no reason to trust like one of the extremes in the sense that I will entirely bank my application on this and or like that to solve the uh thing right so if you take a step back uh think of what are eval are basically your uh trusted product thinking or like your knowledge about the product that is going into this uh set of data sets that you're going to build in the sense that this is what matters to me like this is the kind of problems that my agent should not do and let me build a list of data sets so that I'm going to do well on those and in terms of production monitoring what you're doing doing there is uh you're deploying your application and then you're having this some sort of key metrics that actually communicate back to you on how customers are using your product like you could be deploying uh any agent And like if the C customer is giving a thumbs up for your interaction, you better want to know that. So that is what production monitoring is going to do, right? And this production monitoring has existed for products like for a long time just that now with AI agents, you need to be monitoring like a lot more granularity. It's not just the customer always giving you explicit feedback, but there is many implicit feedback that you can get. Uh for example, in chat GPD, right? Like if you are uh liking the answer you can actually give a thumbs up or if you don't like the answer sometimes customers don't give you thumbs down but actually re regenerate the answer. So that is an clear indication that the initial answer that you generated is not matting uh meeting the customer's expectation. Right. So these are the kind of implicit signals you always need to think about and that spectrum has been increasing in terms of production monitoring. Now let's come back to the initial topic of like okay is it eval or is it production monitoring? What does it matter? So I feel again we go back to this problem first approach of what is your what is it that you're trying to build like you're trying to build a reliable application for your customers that's not going to do a bad thing like it's always going to do the right thing or if it is doing a wrong thing you are uh you're basically alerted like very quickly right so the I break this down into two parts like one is you like nobody goes into uh deploying an application without actually like you know just testing that this testing could be wipes or this testing could be okay I have this like 10 questions that it should not go wrong any no matter what changes I make and let me build this and let's call this an evaluation data set now let's say you built this you deployed this and then you figured uh okay now I need to understand whether it's doing the right thing or not so if you're a high uh high uh throughput or like a high transaction customer you cannot practically sit and evaluate all the traces right you need some indication to understand what are the things that I should look at and this is where production monitoring comes into the picture that you cannot predict your uh the base in which your agent could be doing wrong but all of these other implicit signals and explicit signals those are going to communicate back to you what uh what are the traces that you need to look at and that is where production monitoring helps and once you get this kind of traces you need to examine what are the failure patterns that you're seeing in these uh different types of interactions and is there something that I really care about that should not happen and if that kind of failure modes are happening then I need to think about building an evaluation data set for it and okay let's say I built an evaluation data set for my agent trying to offer refunds where explicitly I have configured it not to so I built this evaluation data set and then like I made my changes in tools or prompts or whatever and then I deployed the second version of the product right now uh there is no guarantee that this is the only problem that you're going to see you still need production monitoring to actually have like you know catch different kinds of problems that you might encounter. So I feel eval are important, production monitoring is important but this notion of only one of them is going to solve things for you that is uh completely dismissible in my opinion. >> All right, a very reasonable answer and the point here isn't uh it's not just as simple as do both. It's more that there are different things to catch and one approach won't catch all the things you need to be paying attention to. >> Exactly. Awesome. >> I want to take two steps back and kind of talk about how much weight the term evals has had to take in the second, you know, half of 2025 because you go meet a data labeling company and they tell you our experts are writing evals. And then uh you have all of these uh folks saying that PMS should be writing evals. They're the new PRDS. And then you have folks saying that um eval is pretty much everything which is the feedback loop you're supposed to be building to improve your products. Now step back as a beginner and kind of think like what are evals? Why is everyone saying eval? And these are actually different parts of the process and nobody's wrong in the sense that yes these are eval but when a data labeling company is telling you that our um experts are writing evals they're actually referring to error analysis or you know experts just leading notes on what should be right. Lawyers and doctors write evals that doesn't mean they're building LLM judges or they're building this entire feedback loop. And when you say that a PM should be writing evals doesn't mean they have to write an LLM judge that's good enough for production. I think there's there are also very prescriptive ways of doing this and plus one to KD which is you cannot predict up front if you need to be building an LLM judge versus you need to be using um implicit signals from production monitoring etc. I think Martin Fowler at some point had this term called semantic diffusion back in the 2000s. Um um which kind of means that someone comes up with a term everybody starts butchering it with their own definitions and then you kind of lose the actual definition of it. That is kind of what is happening to eval of today. Everybody kind of sees a different side to it I guess. Um but if you make a bunch of practitioners sit together and ask them is it important to build a actionable feedback loop for AI products I think all of them will agree. Now how you do that really depends on your application itself when you go to complex use cases it's incredibly hard to build LM judges because you see a lot of emerging patterns. If you built a judge that would um you know test for verbosity or something like that, you turns out that you're seeing newer patterns that your LM judge is not able to catch and then you're just um you just end up building too many evals and at that point it just makes sense to you know look at your user signals, fix them, check if you've regressed and move on instead of actually building these judges. Um so it all depends. I think one statement that every ML practitioner will tell you is it really depends on the context. Don't be obsessed with prescriptions. They're going to change. >> Uh that's such an important point. This idea that especially that eval just means many things to different people now. It's just like a term for so many things. And uh it it's complicated to just talk about evals when you're think when you see it as the stuff data labeling companies are giving you and things are right. And there's also benchmarks. People call benchmarks a little bit eval. It's like >> I I recently spoke to a client who told me we do eval >> and I was like okay can you show me your data set? and said, "No, we just checked LM arena and artificial analysis. These are, you know, independent benchmarks and we know that this model is the right one for our use case." And I'm like, "You're not doing eval. That's not eval. Those are model." >> That makes sense. Like the word, you know, like could be used in that context. I get why people think that, but yeah, now it's just confusing it even more. >> Yep. >> Just like one more line of questioning here that I think uh that's on my mind is the reason this became kind of a big debate is Claude Code, the head of Claude Code, Boris, was like, "Nah, we don't do evalance on Claude Code. It's all vibes. What can you share kiti on codex and the codeex team of how you approach evals? So CEX we have like this balanced approach of like you know you need to have eval and you need to definitely listen to your customers and I think Alex has been on your podcast recently and he's been talking about how we extremely focused on building the right product right and a part of a big part of it is basically listening to your customers and coding agents are extremely unique compared to agents for other domains in the sense that these are actually built for customizability and these are built for engineers. So coding agent is not a product which is going to solve like these top five workflows or like top six workflows or whatever right it's meant to be customizable in multi different ways and the implication of that is that your product is going to be used in different integrations and different kinds of tools and different kinds of things. So it gets really hard to build an evaluation data set for all kinds of interactions that your customers are going to use your product for. Right? But that said, you also need to understand that okay, if I'm going to make a change, it's at least not going to like damage something that is really core to the product. So we have like evaluations uh for doing that. At the same time, we have we take like extreme care on like understanding how the customers are using it. For example, uh we built this code review product recently and uh it has been gaining like extreme amount of traction and uh I feel like many many bugs in OpenAI as well as like even external customers are getting caught with this. And now let's say if I'm making a model change to the course review or like a different kinds of uh RL mechanism that I trained with it and now if I'm going to deploy it I definitely do want to AP test and identify whether it's actually finding the right uh mistakes and are users how are users reacting to it and sometimes like if users do get annoyed by your like you know uh incorrect code riggers they go to the extent of just switching off the product right so those are the signals that you want to look at and make sure that your new changes are doing the right thing and it's extremely hard for us to you know uh think of these kind of scenarios beforehand and uh develop evaluation data sets for it. So I feel like there's a bit of both like there's a lot of wipes and there's a lot of like customer feedback and we are super active on like the social media to understand if anybody's having certain types of problems and quickly fix that. So I feel it's a it's a um how do I put this? It's like a domain of things that you do here. That makes so much sense. Okay, what I'm hearing Codex Pro evals, but it's not enough. You need to Yes. But also, uh, just watch customer behavior and feedback and also there's some vibes just like is this feeling good? Is this as I'm using it generating great code that I'm excited about that I think is great. >> I I don't think like if anybody's coming and saying that like my I have this concrete set of evas that I can like bet my life on and then I don't need to think about anything else like it it's not going to work. And every new model that we're going to launch, we uh get together as a team and like you know test different things each each person is like concentrating on something else and like we have this list of hard problems that we have and we throw that to the model and see how well they are progressing. So it's like uh custom evals for each engineer you would say and just like understand what the uh product is doing in this new model. If you're a founder, the hardest part of starting a company isn't having the idea. It's scaling the business without getting buried in back office work. That's where Brex comes in. Brex is the intelligent finance platform for founders. With Brex, you get high limit corporate cards, easy banking, high yield treasury, plus a team of AI agents that handle manual finance tasks for you. They'll do all the stuff that you don't want to do, like file your expenses, scour transactions for waste, and run reports, all according to your rules. With Brex AI agents, you can move faster while staying in full control. One in three startups in the United States already runs on Brex. You can, too, at brex.com. We've been talking for almost an hour already and we haven't even covered your extremely powerful software development workflow for building AI products that you two developed that you teach in your course that you basically combines all the stuff we've been talking about into a step-by-step approach to building AI products. You call it the continuous calibration, continuous development framework. Let's pull up a visual to show people what the heck we're talking about and then just walk us through what this is, how this works, how teams can shift the way they build their AI products to this approach to help them avoid a lot of pain and suffering. >> Before we go about explaining um the life cycle, a quick story on why Kita and I came up with this is because um there are tons of u uh companies that we keep talking to that have the pressure from their competitors because they're all building agents. we should be building agents that are entirely autonomous. And we I did end up working with a few customers where we built these end-to-end agents. And turns out that because you start off at a place where you don't know how the user might interact with your system and what kind of responses or actions the AI might come up with, it's really hard to fix problems when you have this really huge workflow which is taking four or five steps, making tons of decisions. you're you just you just end up debugging so much and then kind of hot fixing to the point where at at a time we were building for a customer support um use case which is what which is the example that we give in the newsletter as well and we to shut down the product because we were doing so many hot fixes and there was no way we could um count all the emerging or emerging problems that were coming up right and there's also quite some news online um recently I think Air Canada had this thing where um one of their agents predicted or hallucinated a policy um for a refund which was not part of their original playbook and they had to go by it because legal stuff and there have been a ton of really uh scary incidents and that's where the idea comes from right how can you build so that um you don't lose customer trust and you don't end up or your agent or um AI system doesn't end up making decisions that are super dangerous to the company itself at the same time build a flywheel so that you can improve your product as you go right and that's why we came up with this idea of continuous calibration continuous development. The idea is pretty simple which is um we have this right side of the loop which is continuous development uh where you scope capability and curate data essentially get a data set of what your expected inputs are and what um your expected outputs should be looking at. This is a very good exercise before you start building any AI product because many times you figure out that a lot of the folks within the team are just not aligned on how the product should behave and that's where your PMS can really give in a lot more information and your subject matter experts as well. So you have this data set that you know um your AI product should be doing really well on. It's it's not comprehensive but it lets you get started and then you set up the application and then design the right kind of evaluation metrics and I intentionally use the term evaluation metrics although we say eval because I just want to be very specific on what it is because evaluation is a process evaluation metrics are dimensions that you want to focus on um during the process right and then you go about deploying um run your evaluation metrics um and the second part is the continuous calibration which is the part where you understand what um behavior you hadn't expected in the beginning, right? Because when you start the development process, you have this data set that you're optimizing for, but more often than not, you realize that that data set is not comprehensive enough. Um because users start behaving with your systems in ways that you did not predict. And that's where you want to do the calibration piece. Right? I've deployed my system. Now I see that there are patterns that I did not really expect and your evaluation metrics should give you some insight into that into those patterns. But sometimes you figure out that those metrics were also not enough and you probably have new error patterns that you've not thought about and that's where you analyze your behavior, spot error patterns. You apply fixes for issues that you see but you also design newer evaluation metrics. to figure out that they are emerging patterns. And that doesn't mean you should always design evaluation metrics. There are some errors that you can just fix and not really come back to uh because they're very spot errors. For instance, there's a there's a a tool calling error just because your tool wasn't defined well and stuff like that. You can just fix it and move on, right? And this is pretty much how an AI product life cycle would look like. But what we specifically also mention is while you're going through these iterations, try to think of lower agency iterations in the beginning um and higher control iterations. What that means is constrain the number of decisions your AI systems can make and um make sure that they're humans in the loop and then increase that over time because you're kind of building a flywheel of behavior and uh you're understanding what kind of use cases are coming in or how your users are using the system right and one example I think we give in the newsletter itself is um the customer support this is a nice image that kind of shows how you can think of agency and control as two dimensions and each of your versions keep on increasing the agency or the ability of your AI system to make decisions and lower the control as you go. And one example that we give is that of the u customer support agent where you can break it down into three versions. The first version is just routing which is is your agent able to classify and route a particular ticket to the right department. And sometimes when you read this you probably think is it so hard to just do routing? Why can't an agent easily do that? And when you go to enterprises, routing itself can be a super complex problem. Any retail company, any popular retail company that you can think of has hierarchical taxonomies. Most of the times the taxonomies are incredibly messy. I have worked in you know use cases where you probably have taxonomy that says um you know some tax um some kind of hierarchy and then that says shoes and then women's shoes and men's shoes all at the same layer where idea you should be having shoes and then women's shoes and men's shoes should be sub uh you know classes right and then you're like okay fine I could just merge that and you go further and you see that there's also another section under shoes that says for women and for men and it's just not aggregated it's not uh fixed for some reason. So if an agent kind of sees this kind of a taxonomy, what is it supposed to do? Where is it supposed to route and a lot of the times we are not aware of these problems until you actually go about building something and understanding it, right? So um and when these kind of problems um real human agents see these kind of problems, they know what to check next. U maybe they realize that the the node that says for women and for men that's under shoes was last updated in 2019 which means that it's just a dead node that's lying there and not being used. So they kind of know that okay we're supposed to be looking at a different node and stuff like that. And I'm not saying agents cannot understand this or models are not capable enough to understand this, but there are really weird rules within enterprises that are not documented anywhere and you want to um make sure that the agents have all of that context instead of just throwing the problem at them, right? Um yeah. Uh coming back to the versions we had, routing was one where you have really high control because even if your agent routes to the wrong department, humans can take control and you know undo uh those actions. Um and along the way you also figure out that you probably are dealing with a ton of data issues that you need to fix and you know um um u make sure that your data layer is good enough for the agent to function. uh we do is what we said of a co-pilot which is now that you've figured out routing works fine after a few iterations and you fixed all of your data issues, you could go to the next step which is can my agent provide suggestions uh based on some standard operating procedures that we have for the customer support agent, right? And it could just generate a draft that the human can make changes to. And when you do this, you're also logging human behavior, which means that how much of this draft was used by the customer support agent or what was omitted. So you're actually getting error analysis for free when you do this because you're literally logging everything that the user is doing that you could then build back into your flywheel. And then we say post that once you figured out that those drafts look good and most of the times maybe humans are not making too many changes. They're using these drafts as is. That's when you want to go to your end toend resolution assistant that could you know um draft a resolution that could sort the ticket as well right and those are the stages of agency where you start with low agency and then you go up high, right? Um, we also have this really nice table that we put together which is what do you do at each version and what you learn that can enable you to go to the next step and what information do you get that you can feed into the loop. Right? When you're just doing your routing, you have better quality routing data. You also know what kind of prompts you need to be building to improve the routing system. Essentially, you're figuring out your structure for context engineering and um building that flywheel that you want, right? And while I go through this, I want to also be very clear that two things. One is when you build with CCCD in mind, it doesn't mean that you fix the problem all for once. It's possible that you probably gone through V3 and you see a new distribution of data that you never previously imagined. But um this is just one way to lower your risk which is you get enough information about how users behave with your system before going to a point of complete um autonomy. And the second thing is um you're also kind of um building this um you know implicit logging system. Uh a lot of people come and tell us that oh wait there are eval right why do you need something like this? The issue with just building a bunch of evaluation metrics and then having um them in production is evaluation metrics catch only the errors that you're already aware already aware of. But there can be a lot of emerging patterns that you understand only after you put things in production. Right? So for those emerging patterns, you're kind of creating um um you know a low-risk uh kind of a framework so that you could understand user behavior and not really be in a position where there are tons of errors and you're trying to fix all of them at once. And this is not the only way to do it. There are tons of different ways. You want to decide how you constrain your autonomy. It could be based on the number of actions that the agent is taking, which is what we do in this example. It could be based on topic. there just some um domains where it's uh pretty high risk to make a system completely autonomous for um certain decisions but for some other topics it's okay to make them completely autonomous and depending on the complexity of the problem and that's where you really want your product managers your you know um engineers and subject matter experts to align on how to build the system and continuously improve it. The idea is just behavior calibration and not losing user trust as you do that behavior calibration. I guess >> we'll link folks to this actual post if they want to go really deep. You basically go through all of these steps by step a bunch of examples. And the idea here is as you said that like the reason everything about what you're describing here is about making it uh continuous and iterative and kind of moving along this progression of higher autonomy, less control. And this idea of even calling continuous calibration continuous development is communicating it's this kind of iterative process. And just to be clear, this this naming is kind of a owed to uh CI CICD, continuous integration, continuous deployment >> suite. And the idea here is like that this is the version of that for AI where instead of just like integrating into unit tests and deploying constantly, it's >> uh running evals, looking at results, iterating on on the metrics you're watching, figuring out where it's breaking, and iterating on that. Awesome. Okay, so again, we'll point people to this post if they want to go deeper. That was a great overview. Is there anything else before I go in a different topic around this framework specifically that you think is important for people to know? >> I think one of the most common questions we get is how do I know if I need to go to the next stage or if this is calibrated enough, right? There's not really a rule book you can follow, but it's all about minimizing surprise, which means let's say you're calibrating every one or two days. Um, and you figure out that you're not seeing new data distribution patterns. your users have been pretty consistent with how they're behaving with the system, then the amount of information you gain is kind of very low and that's when you know you can actually go to the next um stage, right? And it's all about the wipes at that point. Like do you know you're ready? Um you're not receiving any new information. But also it really helps to understand that sometimes there are events that could completely uh you know mess up the calibration of your system. An example is um GPD 40 doesn't exist anymore or it's going to be deprecated in APIs as well. So most companies that were using 40 should switch to five and five has very different properties. So that's where your calibration's off again. You want to go back and do this process again. Sometimes users start users start behaving with systems also differently over time or user behavior evolves even with consumer products right you don't talk to chat GPT the same way you were talking say two years ago just because you know the capabilities have increased so much and and also just people get excited when um you know these systems can solve one task they want to try it out on other tasks as well. Uh we built this system um for underwriters at some point, right? Underwriting is a painful task. There are agreements that are like you know uh you know loan uh applications that are like 30 or 40 pages. And the idea for this bank was to build a system that could help underwriters pick policies and you know um um information about the bank so that they could approve loans, right? And for a good three or four months, everybody was pretty impressed with the system. We had underwriters actually report gains in terms of how much time they were spending etc. And post 3 months we realized that they were so excited with the product that they started asking very deep questions that we never anticipated. They would just throw the entire application document at the system and go like for a case that looks like this what did previous underwriters do and for a user that just seems like a natural extension of what they were doing but the building behind it should significantly change. Now you need to understand what does for a case like this mean in the context of the loan itself. Is it referring to people of a particular you know income range or is it referring to people in a particular geo and stuff like that and then you need to pick up historical documents analyze those documents and then tell them um okay this is what it looks like versus just saying that there's a policy X Y and Z and you want to um you know look up that policy. Um so something that might seem very natural to a end user might be very hard to build as a product builder and you see that user behavior also evolves over time and that's when you know you you know that you want to go back and recalibrate. >> What do you think is uh overhyped in the AI space right now and even more importantly what do you think is is underhyped? >> I am as I said like super optimistic in different things that are going in AI. So I wouldn't say overhyped but I feel kind of misunderstood is the concept of multi- aents. Uh people have this notion of like uh I have this incredibly complex problem. Now I'm going to break it down into hey you are this agent take care of this. You're this agent take care of this. And now if I somehow connect all of these agents they think they're the agent utopia. And it's never the case that there are incredibly successful multi-agent systems that are built right like there's no doubt about that. But I feel a lot of it comes in terms of how are you limiting the uh ways in which the system can go off tracks and for example like if you're building a supervisor agent and there are like sub agents that actually do the work for the super agent supervisor agent that is a very uh successful pattern but coming with this notion of I'm going to divide the responsibilities based on functionality and somehow uh expect all of that to work together in some sort of like gossip protocol. uh that is like extremely uh misunderstood that you could do that. I don't think like current uh ways of building and current like uh model capabilities are like right there in terms of like uh building those kind of applications. I feel that is kind of misunderstood than overrated. uh underrated. I feel it's hard to probably believe but I still feel coding agents are underrated in the sense that I feel like you can go on Twitter and you can go on Reddit and you see a lot of chatter about coding agents but talking to an engineer in like any random company uh especially outside of Bay Area you you can see like the amount of impact this coding agents can create and the penetration is very low. So I feel like 2025 uh and 2026 is going to be like an incredible year for optimizing all of these processes and I feel that is going to be creating a lot of value with AI. That's really interesting on that first point. So the idea there is uh you'll probably be more successful building and using uh an agent that is able to do its own sub agent splitting of work versus like a bunch of say codeex agents where you do this task, you do that task. You can have agents to do these things and you as a human can orchestrate it or you can have like one uh larger agent that is going to orchestrate all of these things. But letting the agents communicate in terms of peer-to-peer kind of protocol and then especially uh doing this in say a customer support kind of use case is incredibly hard to control what kind of agent is replying to your customer because you need to shift your guardrails everywhere and things like that. >> Yeah. Okay. Uh great picks. Okay, Ash, what do you got? >> Can I say emails? Will I be cancelled? >> On which in which category? Which which bucket do they go? >> Overrated. >> Overrated. Okay, go go go for it. You we won't let you get cancelled. >> Uh just kidding. I think EVAs are misunderstood. They are important folks. I'm not saying they're not important. But I think just um this um I'm going to keep um jumping across tools and going to pick up and learn a new tool is overrated. I I still am old school and feel like you would need really need to be obsessed with the business problem you're trying to solve. AI is only a tool. Try to think of it that way. Of course, you need to be learning about the latest and greatest, but don't be so obsessed with just building so quickly. Building is really cheap today. Um design is more expensive. really thinking about your product, what you're going to build, is it going to really solve a pain point is is what is way more valuable today and it will only become uh more true in the near future, right? So really obsessing about your problem and design is underrated and just wrote building is overrated I guess. >> Awesome. Okay. Uh similar sort of question from a a product point of view. What do you think the next year of AI is going to look like? give us a vision of where you think things are going to go by say by the end of 2026. >> Yeah, I feel uh there's a lot of promise in terms of uh this background agents or proactive agents who is like they're going to like basically understand your workflow even more. Uh if you think if you think of like where is AI failing to create value today, it's mainly about not understanding the context. And the reason that it's not understanding the context is it's not plugged into the right places where actual work is happening. Right? And as you do more of this, you can give the agent mode of context and then it start to see the world around you and understand what is the what are the set of metrics that you're optimizing for or what are the kind of activities that you're trying to do. It is a very easy extension from there to actually gain more out of it and then let the agent prompt you back. uh we already do this in terms of charge GPT pulse which kind of gives you this daily update of things you might care about and it's it's very nice to actually have that like jog your brain up in terms of oh this is something that I haven't thought about maybe this is good and now when you extend this to more complex tasks like a coding agent which says that like okay I have fixed five of your linear tickets and here are the patches just review them at the start of your day so I feel that is going to be like extremely useful and I see that as like a strong direction in which like products are going to build in 2026 That is so cool. So essentially agents kind of anticipating what you want to do and getting going getting ahead of you and here's I've solved these problems for you or I think this is going to crash your site. Maybe you should fix this thing right here or I see the spike here and let's refactor our database. Amazing. What a world. Okay, Ash, what do you got? >> I am all in for multimodal experiences in 2026. I think we have done quite some progress in 2025 and um not just in terms of generation but also understanding um until now I think LLMs have been our most commonly used models but as humans we are multimodal creatures I would say like um language is probably one of our last forms of evolution as the three of us are talking I think we're constantly getting so many signals I'm like oh Lenny is nodding his head so probably I would go in this direction or Lenny's bored so let me stop stop stop talking So there's a chain of thought be behind your chain of thought and you're constantly altering it with language that dimension of expression is not explored as well. So if you we could build better multimodal experiences that would get us closer to um humanlike um conversation richness and um yeah I think um and just you will also just given the kind of models there's a bunch of boring tasks as well which are ripe for AI if multimodal understanding gets better there are so many handwritten documents and really messy uh PDFs that cannot be passed even by the best of the models as of today and if It's possible. There's there'll be so much um um data that we can tap into. >> Awesome. I just saw Demis from Deep Mind AI, Google, whatever they call the whole or uh talking about this where he's thinks that's going to be a big part of where they're going, combining the image model work, the LLM, and also their world model stuff, Genie, I think is what it's called. >> So, that's going to be a wild wild time. Okay. Uh last question. If someone wants to just get better at building AI products, what's just maybe one skill or maybe two skills that you think they should lean into and develop? >> I think we did cover a bunch of best practices for AI products, which is start small, try to get your iteration going well and build a flywheel and all of that. But again, if you kind of look at it at a 10,000 ft level for anybody building today, like I was saying, implementation is going to be ridiculously cheap in the next few years. So really nail down your design, your judgment, your taste and all of that. Um and in general if you're building a career as well I feel for the past few years your your former years say the first two three years of uh building your career is always focused on execution mechanics and all of that and now we have AI that could help you ramp pretty quickly and post that I mean after a few years I think everybody everybody's job becomes about your taste your judgment and kind of um uh you know what is uniquely you. I think nail down on that part and try to figure out how you can bring in um that kind of a perspective. Um and it doesn't have to mean that you should be significantly older, have ex um years of experience. We recently hired someone and we use this very popular app uh for tracking our tasks, right? And we've been using it for years and we pay a high subscription fee for it. And this guy just came with his own white coded app to the meeting. he onboarded us um to all of it and he's like okay let's start using this and I think that kind of agency and that kind of ownership to really rethink experiences is what uh will set people apart and I'm not being blind to the fact that wipe coded apps have high maintenance costs and maybe as we scale as a company we have to replace it or we have to think of better approaches but given that we're a smalls size company now and just I I was really shocked because I never thought of it um um if you've been used to working in a certain way you associate a cost with building and I feel like folks who grew up in this age u have a much lower cost associated in their mind they just don't mind building something and going ahead with it and that's they're also very um enthusiastic to try out new tools um that's also probably why AI products have this retention problem because everybody's so excited about trying out these new tools and all of that but essentially um having the agency and ownership and I think it's also the end going to be the end of the busy work era, right? You can't be sitting in a corner doing something that doesn't move the needle for a company. You really need to be thinking about, you know, end to-end workflows, how you can bring in more impact. I think all of that will be super important. >> That reminds me, I just had Jason Lumpkit on the podcast. He's um uh very smart on sales, go to market, run Zaster, and he replaced his whole sales team with agents. He had 10 sales people, now he has 1.2 and 20 agents. And one of the agents, it was just tracking everyone's updates to Salesforce and kind of uh updating it automatically for them based on their calls. And one of the salespeople uh is like, "Okay, I'm I I quit." And it turns out he wasn't really doing anything. >> He was just sitting around >> and he's like, "Okay, this will catch me. I got to get out of here." >> Yes. >> So to your point about you can't it'll be harder to sit around and to your thumbs. Uh I think is really right. >> Yeah. I think to add on to that like feel like persistence is also something that is extremely valuable especially given that anybody who wants to build something is the information is like at your fingertips even more than like the past decade right you can learn anything overnight and become that sort of like iron man kind of approach so I feel like having that persistence and like going through the pain of like learning this implementing this and like understanding what works and what doesn't work and as you are going through this like pain of like developing multiple approaches and then solving the problem. I feel that is like going to be the real boat as an individual like I I I like to call it like pain is the new mode but uh I feel that is exactly super useful to actually have this in especially in like you know you're building these AI products. >> Say more about this. I love this concept. Pain is the new moat. Is there more there? Yeah, I feel as a company I mean like successful companies right now building in any new area they are successful not because they're first to the market or like they have this fancy feature that more customers are liking it. They went through the pain of understanding what are the set of non-negotiable things and trade them off exactly with like what are the features or like what are the model capabilities that I can use to solve that problem. it it this is not a straightforward process, right? There's no textbook to do this or like there's no straightforward way or like a known threaded path to be here. So a lot of this pain I was talking about is just like going through this iteration of like okay let's try this and if this doesn't work let's try this and that kind of knowledge that you built across the organization or across like your own experience lived experiences I feel that the that pain is what uh translates into the mode of the company right this could be like a product of eval or like something that you built and I feel that is going to be the game changer >> that is awesome it's like uh turning a coal into diamond Diamond. Yes. Okay. Uh I feel like we've done a great job helping people avoid some of the biggest issues people consistently run into building AI products. We've covered so many of the pitfalls and the ways to actually do it correctly. Before we get to our very exciting lightning round, is there anything else that you wanted to share? Anything else you want to leave listeners with? >> Be obsessed with your customers. Be obsessed with the problem. Um AI is just a tool and um try to make sure that you're really understanding your workflows. 80% of so-called AI engineers, AIPM spend their time actually understanding their workflows very well. They're not building the fanciest and the you know most uh cool models or um workflows around it. They're actually in the wheats understanding their customers behavior and data. Um, and whenever a software engineer who's never done AI before hears the term, look at your data, I think it's a huge revelation to them, but it's always been the case. You need to go there. Look at your data, understand your users, and that's going to be a huge differentiator. >> It's a great way to close it. It's not the AI isn't the answer. It's it's a tool to solve the problem. With that, we have reached our very exciting lightning round. I've got five questions for both of you. Are you ready? Yay. Yes. >> All right. So, you can both answer them. You can pick one which you want to answer. Either way, up to you. What are two or three books you find yourself recommending most to other people? >> For me, it's this book called When Breath Becomes Air, Lenny. It was written by Paul Kalaniti. I think he was um um an Indian origin neurosurgeon who was diagnosed with lung cancer at 31 or 32 and the whole book is his memoir and just is written after he was diagnosed and it's it's really beautiful especially because I read it during co and all we ever wanted to do during co is stay alive. Um there are a bunch of really nice quotes within the book as well, but I remember one of them he was kind of arguing against a very popular quote by Socrates which is the unexamined life is not worth living or something like that. And which means you really need to be thinking about your choices. You need to you know understand your values, your mission and all of that. And um Paul says, "If the unexamined life is not worth living, was the unlived life worth examining?" Which means are you spending so much time just understanding your mission and purpose that you've forgotten to live? And I think it everybody who's uh staying in the AI era and building and continuously going through this phase of reinventing themselves need to take a pause and live for a bit. I guess they need to stop evaling life too much. What really >> I was going to say that that's where my mind went. generate some emails for your life. Oh my god, we've gone too far. >> Yep. Yeah. Yeah. That's that's my favorite book. >> I I like more of science fiction books. So, I uh really like this three body problem series. Uh it's like a three book series. It's it's like has it has elements of like grander than science fiction uh life outside earth and how it impacts like human decision-m process and it also has like elements of geopolitics and how how much important or like valuable abstract science is to human progress and then that gets when that gets stopped it's it's not noticeable in everyday life but it it can cause like devastating effects. So I feel like AI helping in these areas for example is going to be like extremely crucial and that book is like a nice example of what could happen otherwise. Completely agree absolutely love might be my favorite sci-fi book except or series even and it's three I have to read them all three by the way. I find that it only got really good about one and a half books in. So if anyone's tried it and like what the heck is going on here just keep reading and get to the middle of the second one and then gets mindblowing. >> Yes. Uh, if you love sci-fi and you're an AI, you got to read this book called A Fire Upon the Deep by uh, Vernon Vege. >> Mhm. >> Check it out. It's incredible. Uh, I saw Noah Smith on his newsletter recommend this book and there's like a whole there's like sequels to it, but this is the one. It's so incredible and it's actually turns out it's about AGI and super intelligence and all these things and it's just like so epic and no one's heard of it. >> Thank you. >> There you go. I'm giving you one back. Okay, next question. What's a favorite recent movie or TV show that you've really enjoyed? >> I started re-watching Silicon Valley, and I think it's so true. It's so timeless. Everything is repeating all over again. Anybody who's watched it a few years ago should start re-watching it, and you'll see that it's eerily similar to everything that's happening right now with the AI wave. >> That's That's a good idea to rewatch it. I love that their whole business was like an algorithm to compress, like a compression algorithm. It's like maybe a precursor to LM in some small way. Very good. All right, GT, what you got? >> Uh, I'm going to digress and say not a movie or a TV show, but there's this game I picked up recently called Expedition 33. Uh, it has nothing to do with AI, but it's an incredibly incredibly well-made game in terms of the game play or like the movie and the story and the music. Uh, it it's been amazing. >> I love that you have time to play games. That's a great sign. I love that. So, an open eye. I'm just imagining you're there's nothing else going on except just coding and and >> yeah, it has been incredibly hard to find time for that. >> That's good. That's a good sign. I'm happy to hear this. Okay. What's a favorite product that you've recently discovered that you really love? >> For me, it's Whisper Flow. I think I've been using it quite a bit and I didn't know I needed it so much. Um the best part is it's a conceptual transcription tool which means if you go to you know codeex and start using whisfl it starts identifying variables and all of that and it's so seamless in terms of transcription to instruction you could say something like I'm so excited today add three exclamation marks and it seamlessly switches it adds those three exclamation marks instead of you know writing add three exclamation marks and I think it's pretty cool um um if you're not using it you should try it I'll do a plug. Get Whisper Flow for free for an entire year >> for a year for free by becoming an annual subscriber of my newsletter. >> And that's how I got access to it. Lenny, >> there we go. It's like I think I I pitched this deal. I think people don't truly understand how incredible this is. They're like, "No way. This is real." It's real. And 18 other products. Lenny's productbass.com. Check it out. Moving on. K. >> Awesome. Uh I actually am a stickler for productivity. I keep experimenting new CLI tools and like things which can uh make me faster. Uh so I feel like a recast has been amazing. Uh I've discovered all this like new shortcuts that you can use to open different things, type in shortcut commands and things like that. And caffeinate is another thing that I've recently discovered from my teammates. It helps you like prevent Mac from sleeping. So you can run this really long codeex task for like four or five hours locally. Let it build the thing and then you can wake up and be like okay this is good. I like this. >> That's hilarious. That combo codeex and caffeinate. You guys, you guys need to use it. Like build that yourself. An open air version of that or the codeex agent should just keep your Mac from sleeping. That's so funny. Uh, by the way, Raycast also part of Lenny's product pass. One year free of Raycast. >> We wen Lenny didn't tell us these folks. These are actually our favorite. >> These are just two of 19 products. No caffeinate though. I don't know if that's even paid. Okay, let's keep going. Do you have a favorite life motto that you find yourself coming back to in work or in life? >> For me, I think this is what my dad told me when I was a kid and it's always stuck, which is um um they told it couldn't be done, but the fool didn't know it, so he did it anyway. I think be foolish enough to believe that you can do anything if you put your heart to it. Especially now because you have so much data at your hand that could be pointing towards the fact that you probably will be unsuccessful. with how many podcasts made it to more than a thousand subscribers or how many companies hit more than 1 million y and there's always data to show you that you won't be successful but sometimes just be foolish and go ahead with it >> that's great yeah for me I uh am more of an overinker so I really like this quote from Steve Jobs that you can only connect the dots looking backwards so it's a lot of the times there are like numerous choices and you don't really know the optimal one to pick but life's life works in ways that you can actually see back and be like, "Oh, these are actually beautiful in terms of how I I would transition." So, I feel like that is extremely useful in like, you know, keep moving forward, keep experimenting. >> Final question. Whenever I have two guests on the podcast at once, I like to ask this question. What's something that you admire about the other person? >> I think with Kir, um, it's about he's he's pretty calm and, uh, very grounded. Um, and he's always been my sounding board. I can throw a ton of ideas at him and he always comes up with he's able to anticipate the kind of issues that might um, run into and he's extremely um, kind and lets his work speak instead of actually doing a lot of talking, I guess. But if I had to pick one, I think uh, he's the most incredible husband. So >> reveal little people know. >> Yeah. We've been married for four years and been the most beautiful four years of my life. >> Oh wow. Okay. How do you follow that? >> Yeah, it's super hard to follow that. I would say I am extremely privileged in terms of working with like really smart people in great companies in the Silicon Valley. And I feel the unique thing that stands with Ashwaryia across like any other uh smart folks I've worked on is like she has this really amazing knack of teaching and like explaining something uh in a very understandable and easy to comprehend way and that combined with persistence is like super useful especially in this uh fastmoving AI world that we are in in the sense that there's so many new things coming up it feels overwhelming but when I hear her talk about like this is how you make sense of this entire thing this is where it plugs in. I feel like oh that is so simple like I can also do that. So she empowers a lot of people by simplifying things and you know like uh explaining things in the most understandable way. So I feel that is like an incredible quality. >> Amazing. How sweet. I got to do this all the time. I need more more yes to that was that was great. Okay. Uh final questions. Where can folks find stuff that you're working on? Find you online. Talk about share your course link and then just how can listeners be useful to you? >> I write a lot on LinkedIn. Um um so if you if you want to listen to pragmatists who've been in the weeds working on AI products and um what they're seeing, you can uh follow my work. We also have a GitHub repository with about 20K stars and that repository is all about good resources for learning AI. It's completely free and if you um like what we spoke today, we also run a super popular course. We'll leave a link to it on building enterprise AI products. And the course is a lot about unlearning mindsets and following like a problem first approach uh instead of a tool first or a hype first approach. Um so you can check that out as well. And if you don't want to do the course, we write a lot. We give out a lot of free resources. We have free sessions. So make sure you follow our work. >> Yeah, I would also add that I you can also find me on LinkedIn. uh I don't like write a lot I guess but I'm super all excited to just talk to any complex product that you're building and if you have thoughts on like how you can uh use coding agents to make your life better or how what are the problems that you're seeing um always my DMs are open and like we can have a great discuss. >> Awesome. Well, Kiriti and Ash, thank you so much for being here. >> Thank you so much. >> Thank you Lenny. This was so much fun. >> So much fun. Bye everyone. >> Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at lennispodcast.com. See you in the next episode.