DSPy: The End of Prompt Engineering - Kevin Madura, AlixPartners

Thanks everybody for joining. I'm here to talk to you today about DSPy. Feel free to jump in with questions or anything throughout the talk. I don't plan on spending the full hour and a half or so. I know it's the last session of the day, so keep it casual. Feel free to jump in. I'll start with a little bit of background. Don't want to go through too many slides. I'm technically a consultant, so I have to do some slides. But we will dive into the code for the latter half. And there's a GitHub repo that you can download to follow along and play around with it on your own. So how many people here have heard of DSPy? Well, almost everyone, that's awesome. How many people have actually used it day-to-day in production or anything like that? Three. OK, good. So hopefully we can convert some more of you today. So high-level DSPy, this is straight from the website. It's a declarative framework for how you can build modular software. And most important for someone like myself, I'm not necessarily an engineer that is writing code all day every day. As I mentioned before, I'm more of a technical consultant. So I run across a variety of different problems. Could be an investigation for a law firm. It could be helping a company understand how to improve their processes, how to deploy AI internally. Maybe we need to look through 10,000 contracts to identify a particular clause or paragraph. And so DSPy has been a really nice way for me personally and my team to iterate really, really quickly. I'm building these applications. Most importantly, building programs. It's not kind of iterating with prompts and tweaking things back and forth. It is building a proper Python program. And DSPy's a really good way for you to do that. So I mentioned before, there's a repo online. If you want to download it now, just get everything set up. I'll put this on the screen later on. But if you want to go here, just download some of the code. It's been put together over the past couple of days. It's not going to be perfect production level code. It's much more of utilities and little things here and there to just come and demonstrate the usefulness, demonstrate the point of what we're talking about today. In that, I will walk through all of these these different use cases. So sentiment classifier going through a PDF, some multimodal work, a very, very simple web research agent detecting boundaries of a PDF document. You'll see how to summarize basically arbitrary length text and then go into an optimizer with JEPO. But before we do that, just again, kind of level set. The biggest thing for me personally, DSPy's a really nice way to decompose your logic into a program that treats LLMs as a first class citizen. So at the end of the day, you're fundamentally just calling a function that under the hood just happens to be an LLM. And DSPy gives you a really nice intuitive, easy way to do that with some guarantees about the input and output types. So of course, there are structured outputs. Of course, there are other ways to do this, paidantic and others. But DSPy has a set of primitives that when you put it all together allows you to build a cohesive, modular piece of software that you then happen to be able to optimize. And we'll get into that in a minute. So just a few reasons of why I'm such an advocate. It sits at this really nice level of abstraction. So it's, I would say it doesn't get in your way as much as a Lang chain. And that's not a knock on Lang chain. It's just a different kind of paradigm way that DSPy is structured. And it allows you to focus on things that actually matter. So you're not writing that choices, zero messages, content. You're not doing string parsing. You're not doing a bunch of stuff out of the hood. You're just declaring your intent of how you want the program to operate, what you're wanting out, inputs and outputs to be. Because of this, it allows you to create computer programs, as I mentioned before, not just tweaking strings and sending them back and forth. You are building a program first. It just happens to also use LLMs. And really the most important part of this is that an Omar that Kateb, the founder of this, or the original developer of it, had this really good podcast with A16Z, think about just two or three days ago. But he put a really nice way. He said it's built in a systems mindset. And it's really about how you are encoding or expressing your intent of what you want to do. Most importantly, in a way that's transferable. So the design of your system, I would imagine, your program isn't going to move necessarily as quickly as maybe the model capabilities are under the hood. When we were seeing the releases almost every single day, different capabilities, better models. And so DSPY allows you to structure it in a way that retains the control flow, it can retain the intent of your system, your program, while allowing you to bounce from model to model to the extent that you want to or need to. Communities come for free. There's no parsing, JSON, things like that. Again, it sits at a nice level of abstraction where you can still understand what's going on under the hood. If you want to, you can go in and tweak things. But it allows you to focus on just what you want to do, while retaining the level of precision that I think most of us would like to have in building your programs. As mentioned, it's robust to model in paradigm shifts. So you can, again, keep the logic of your program, but keep those LLMs infused in basically in line. Now, that being said, there are absolutely other great libraries out there, but PIDantic AI, Langtchain, and there's many, many others that allow you to do similar things, Agno is another one. This is just one perspective. And it may not be perfect for your use case. For me, it took me a little bit to kind of rock how DSPY works. And you'll see why that is in a minute. So I would just recommend having open mind, play with it, run the code, tweak the code, do whatever you need to do, and just see how it might work for you. And really, this talk is more about ways that I found it useful. It's not a dissertation on the ins and outs of every look and cranny of DSPY. It's more of, I've run into these problems myself. Now I naturally run to DSPY to solve them, and this is kind of why. And the hope is that you can extrapolate some of this to your own use cases. So we'll go through everything fairly quickly here. But the core concepts of DSPY really comes down to arguably five or these six that you see on the screen here. So we'll go into each of these in more detail, but high level signatures specify what you want the, basically what you want your function called to do. This is when you specify your inputs, your outputs, inputs and outputs can both be typed. And you defer the rest of the, basically, the how, the implementation of it to the LLM. We'll see how that all kind of comes together in a minute. Modules themselves are ways to logically structure your program. They're based off of signatures. So module can have one or more signatures embedded within it. In addition to additional logic, and it's based off of pie torts and some of the, in terms of the methodology for how it's structured. And you'll see how that comes to be in a minute. Tools were off-rooted with tools, MCP, and others, and really tools, fundamentally, as DSPY looks at them, are just Python functions. So it's just a way for you to very easily expose Python functions to the LLM within the DSPY kind of ecosystem, if you will. Adapters live in between your signature and the LLM call itself. As we all know, prompts are ultimately just strings of text that are sent to the LLM. Signatures are a way for you to express your intent at a higher level. And so adapters are the things that sit in between those two. So it's how you translate your inputs and outputs into a format basically explodes out from your initial signature into a format that is ultimately the prompt that is sent to the LLM. And so there's some debate or some research on if certain models perform better with XML as an example, or BAML, or JSON, or others. And so adapters give you a nice, easy abstraction to basically mix and match those at will as you want. Optimizers are the most interesting. And for whatever reason, the most controversial part of DSPY, this kind of the first thing that people think of, or at least when they hear of DSPY, they think optimizers. We'll see a quote in a minute. It's not optimizers first. It is just a nice added benefit and a nice capability that DSPY offers in addition to the ability to structure your program with the signatures and modules and everything else. And metrics are used in tandem with optimizers that basically defines how you measure success in your DSPY program. So the optimizers use the metrics to determine if it's fine in the right path, if you will. So signatures I mentioned before, it's how you express your intent, your declarative intent. It can be super simple strings. And this is the weirdest part for me initially, but is one of the most powerful parts of it now? Or it can be more complicated class-based objects. If you've used PIDANTIC, that's basically what it runs on under the hood. So this is an example of one of the class-based signatures. Again, it's basically just a PIDANTIC object. What's super interesting about this is that the names of the fields themselves act almost as many prompts as part of the prompt itself. And you'll see how this comes to life in a minute. But what's ultimately passed to the model from something like this is it will say, OK, your inputs are going to be a parameter called text. And it's based off of the name of that particular parameter in this class. And so these things are actually passed through. And so it's very important to be able to name your parameters in a way that is intuitive for the model to be able to pick it up. And you can add some additional context or what have you in the description field here. So most of this, if not all of this, yes, it is proper typed Python code. But it also serves almost as a prompt, ultimately, that feeds into the model. And that's basically translated through the use of adapters. And so just to highlight here, like these, it's the ones that are a little bit darker and bold. Those are the things that are effectively part of the prompt that's been sent in. And you'll see how DSPY works with all this and formats in a way that, again, allow you to just worry about what you want, worry about constructing your signature instead of figuring out how best to word something in the prompt. Go ahead. Sure. Yeah. That's exactly right. Sure. So the question for folks online is, what if I already have a great prompt of done all this work? I'm an amazing prompt engineer. I don't want my job to go away or whatever. Yes. So you can absolutely start with a custom prompt or something that you have demonstrated works really well. And you're exactly right. That can be done in the docturing itself. There's some other methods in order for you to inject basically system instructions or add additional things at certain parts of the ultimate prompt. And or, of course, you can just inject it in the final string anyway. It's just a string that is constructed by DSP. So absolutely, this doesn't necessarily prevent you. It does not prevent you from adding in some super prompt that you already have, absolutely. And to your point, it can serve as a nice starting point from which to build the rest of the system. Here's a shorthand version of the same exact thing, which, to me, the first time I saw this, this was like baffling to me. But that's exactly how it works. Because that you're basically, again, deferring the implementation of a logic or what have you to DSP and the model to basically figure out what you want to do. So in this case, if I want a super, super simple sentiment classifier, this is basically all you need. You're just saying, OK, I'm going to give you text as an input. I want the sentiment as an integer as the output. Now, you probably want to specify some additional instructions to say, OK, your sentiment, a lower number means negative, a higher number is more positive sentiment, et cetera. But it just gives you a nice easy way to kind of scaffold these things out in a way that you don't have to worry about creating this whole prompt from hand. It's like, OK, I just want to see how this works. And then if it works, then I can add the additional instructions and I can create a module out of it, or whatever it might be. It's these shorthand or it is this shorthand that makes experimentation and iteration incredibly quick. So modules, it's the base abstraction layer for DSPY programs. There are a bunch of modules that are built in. And these are a collection of kind of prompting techniques, if you will. And you can always create your own module. So to the question before, if you have something that you know works really well, sure, yeah, put it in a module. That's now the kind of the base assumption, the base module that others can build off of. And all of DSPY is meant to be composable, optimizable. And when you deconstruct your business logic or whatever you're trying to achieve, by using these different primitives, it's intended to kind of fit together and flow together. And we'll get to optimizers in a minute, but at least for me and my team's experience, just being able to logically separate the different components of a program, but basically inlining LLM calls has been incredibly powerful for us. And it's just an added benefit that at the end of the day, because we're just kind of in the DSPY paradigm, we happen to also be able to optimize it at the end of the day. So it comes with a bunch of standard ones built in. I don't use some of these bottom ones as much, although it's super interesting. The base one at the top there is just DSPY.Predict. That's literally just an LLM call. That's just a vanilla call. Chain of thought probably isn't as relevant anymore these days because models have kind of iron those out. But it is a good example of the types of kind of prompting techniques that can be built into some of these modules. And basically all this does is add some of the strings from literature to say, OK, let's think step-by-step or whatever that might be. Same thing for React and Code Act. React is basically the way that you expose the tools to the model. So it's wrapping and doing some things under the hood with basically taking your signatures and it's injecting the Python functions that you've given it as tools and basically react as how you do a tool called like NDSPY. Program with Thought is pretty cool. It kind of forces the model to think in code and then we'll return the result. And it comes with the Python interpreter built in but you can give it some custom one, some type of custom harness if you wanted to. I haven't played with that one too too much but it is super interesting. If you have like a highly technical problem or workflow or something like that where you want the model to inject reasoning in code at certain parts of your pipeline, that's kind of a really easy way to do it. And then some of these other ones are basically just different methodologies for comparing outputs or running things in parallel. So here's what one looks like. Again, it's fairly simple. It's a Python class at the end of the day. And so you do some initialization up top. In this case, you're seeing the shorthand signature up there. So this module just to give you some context is an excerpt from one of the Python files that's in the repo is basically taking in a bunch of time entries and making sure that they adhere to certain standards, making sure that things are capitalized properly or that there are periods that they own the sentences or whatever it might be. That's from a real client use case where they had hundreds of thousands of time entries and they needed to make sure that they all adhere to the same format. This was one way to kind of do that very elegantly, at least in my opinion, was taking up top, you can define the signature. It's adding some additional instructions that were defined elsewhere. And then saying for this module, the change tense call is going to be just a vanilla predict call. And then when you actually call the module, you enter into the forward function, which you can basically interspers the LLM call, which would be the first one, and then do some kind of hard code of business logic beneath it. Tools, as I mentioned before, these are just vanilla kind of Python functions. It's the DSPY's tool interface. So under the HUD, DSPY uses light LLM. And so there needs to be some kind of coupling between the two, but fundamentally, any type of tool that you would use elsewhere, you can also use in DSPY. And this is probably obvious to most of you, but here's just an example. You have two functions, get whether search web, you include that with a signature. So in this case, I'm saying the signature is, I'm going to give you a question. Please give me an answer. I'm not even specifying the types. It's code in fur, what that means. I'm giving it the get whether and the search web tools. And I'm saying, okay, do your thing, but only go five rounds. And just so it doesn't spin off into something crazy. And then a call here is literally just calling the React agent that I created above with the question, what's the weather like in Tokyo? We'll see an example of this in the code session, but basically what this would do is give the model the prompt the tools and let it do its thing. So adapters before I covered this a little bit, they're basically prompt formatters, if you will. So the description from the docs probably says it best. It takes your signature, the inputs, other attributes, and it converts them into some type of message format that you have specified or that the adapter has specified. And so as an example, the JSON adapter, taking say a pedantic optic that we defined before, this is the actual prompt that's sent into the LLM. And so you can see the input fields. So this would have been defined as, okay, clinical note type string, patient info, as a patient details object, which would have been defined elsewhere. And then this is the definition of the patient info. Basically, JSON dump of that pedantic object. Good. So there's a database adapter or default. Yeah. That's good for most cases. That's right. Yeah, the question was if there's a basic adapter, and would this be an example of where you want to do something specific? Answer is yes. So it's a guy, Prashant, who is, I have his Twitter at the end of this presentation, but he's been great. He did some testing comparing the JSON adapter with the BAML adapter. And you can see just intuitively, even for us humans, the way that this is formatted is a little bit more intuitive. It's probably more token efficient too. Just considering, if you look at the messy JSON that's here versus the, I guess, slightly better formatted BAML that's here, can actually improve performance by five to 10% depending on your use case. So it's a good example of how you can format things differently. The rest of the program wouldn't have changed at all. You'd just specify the BAML adapter, and it totally changes how the information is presented under the hood to the LLM. Multimodality, I mean, this obviously is more at the model level, but DSPY supports multiple modalities by default. So images, audio, some others. And the same type of thing, you kind of just feed it in as part of your signature, and then you can get some very nice clean output. This allows you to work with them very, very easily, very quickly. And for those eagle-eyed participants, you can see the first line up there is attachments. It's probably a lesser-known library. Another guy on Twitter is awesome. Maxine, I think it is. He created this library that just is basically a catch-all for working with different types of files and converting them into a format that's super easy to use with LLM's. He's a big DSPY fan as well, so he made basically an adapter that's specific to this. But that's all it takes to pull in images, PDFs, whatever it might be. You'll see some examples of that. It just makes, at least, has made my life super, super easy. Here's another example of the same sort of thing. So this is a PDF of a Form 4 form, some public SEC form from Embidia. Up top, I'm just giving it the link. I'm saying, OK, attachments, do your thing, pull it down, create images, whatever you're going to do. I don't need to worry about it. I'll care about it. This is super simple rag. But basically, OK, I want to do rag over this document. I'm going to give you a question. I'm going to give you the document. And I want the answer. And you can see how simple that is, literally just feeding in the document. How many shares are sold? Interestingly, here, I'm not sure if it's super easy to see, but you actually have two transactions here. So it's going to have to do some math likely under the hood. And you can see here, the thinking and the old tornado. Go ahead. It's it. On the rag step, is it creating a vector store of some kind or creating embedding some search you know, where those is there a bunch going on the back? This is poor man's rag. I should have clarified. This is literally just pulling in the document images. And I think attachments will do some basic OCR under the hood. But it doesn't do anything other than that. That's it. All we're feeding in here. The actual document object that's being fed in is literally just the text that's been OCRed. The image is the model does the rest. All right. So optimizers. Let's see how we're doing. OK. Optimizers are super powerful, super interesting concept. Spend some research that argues, I think, that it's just as performant. If not in certain situations, more performant than fine tuning would be for certain models, for certain situations. There's all this research about in context learning and such. And so whether you want to go fine tune and do all that, nothing stops you. But I would recommend at least trying this first to see how far you can get without having to set up a bunch of infrastructure and go through all of that. See how the optimizers work. But fundamentally, what it allows you to do is the ASPI gives you the primitives that you need in the organization you need to be able to measure and then quantitatively improve that performance. And I mentioned transferability before. This transferability is enabled arguably through the use of optimizers. Because if you can get, OK, I want to have a classification task works really well with 401. But maybe it's a little bit costly because that they're running a million times a day. Can I try it with 401 nano? OK, maybe it's at 70%, whatever it might be. But I run the optimizer on 401 nano. And I can get the performance back up to maybe 87%. Maybe that's OK for my use case. But I've now just dropped my cost profile by multiple orders of magnitude. And it's the optimizer that allows you to do that type of model and use case transferability, if you will. But really, all it does at the end of the day, under the hood, is iteratively optimizer tweak that prompt, that string under the hood. And because you've constructed your program using the different modules, DSPY kind of handles all of that for you under the hood. So if you compose a program with multiple modules and you're optimizing against all that, it by itself DSPY will optimize the various components in order to improve the input and output performance. And we'll take it from the man himself, Omar. DSPY is not an optimizer. I've said this multiple times. It's just a set of programming abstractions or a way to program. You just happen to be able to optimize it. So again, the value that I've gotten to my team has gotten is mostly because of the programming abstractions. It's just this incredible out of benefit that you are also able to should you choose to optimize it afterwards. And I was listening to this to Dwarqeshen and Carpathi the other day. And this kind of, I was like prepper for this talk in this hit home perfectly. I was thinking about the optimizers. And someone smarter than me can complete, please correctly. But I think this makes sense because he was basically talking about using LLM as a judge can be a bad thing because the model being judged can find adversarial examples and degrade the performance or basically create a situation where the judge is not scoring something properly. Because he's saying that the model will find these little cracks. It'll find these little spurious things in the nooks and crannies of the giant model and find a way to cheat it. Basically saying that LLM as a judge can only go so far until the other model finds those adversarial examples. If you invert that and flip that on said, it's this property that the optimizers for DSPIRE are taking advantage of to find the nooks and crannies in the model, whether it's a bigger model or smaller model, to improve the performance against your data set. So that's what the optimizer is doing. It's finding these nooks and crannies in the model to optimize and improve that performance. So a typical flow, I'm not going to spend too much time on this, but fairly logical, constructor program, which is decomposing your logic into the modules. You use your metrics to define basically the contours of how the program works. And you optimize all that through to get your final result. So another talk that this guy, Chris Potts, just had maybe two days ago where he made the point. This is what I was mentioning before, where Jeppo, which is, you probably saw some of the talks the other day, where the optimizers are on par or exceed the performance of something like GRPO and other kind of fine tuning methods. So pretty impressive. I think it's an active area of research. People that's smarter than me, like Omar and Chris and others are leading the way on this. But point being, I think prompt optimization is a pretty exciting place to be. And if nothing else is worth exploring. And then finally, metrics. Again, these are kind of the building blocks that allow you to define what success looks like for the optimizer. So this is what it's using. And you can have many of these. And we'll see examples of this where, again, at a high level, your program works on inputs or works on outputs, the optimizer is going to use the metrics to understand, OK, my last tweak in the prompts did it improve performance? Did it do great performance? And the way you define your metrics provides that direct feedback for the optimizer to work on. So here's another example, super simple one from that time entry example I mentioned before. So they can be, metrics can either be fairly rigorous in terms of does this equal one or some type of a quality check or a little bit more subjective. We're using LLLM as a judge to say whatever was this generated string does it adhere to these various criteria, whatever might be, but that itself can be a metric. And so all of this at the side is a very long-winded way of saying, in my opinion, this is probably most different, all of what you need to construct arbitrarily complex workflows, data processing pipelines, business logic, whatever that might be, different ways to work with LLMs. If nothing else, DSPY gives you the primitives that you need in order to build these modular composable systems. So if you're interested in some people online, there's many, many more. There's a discord community as well. But usually these people are on top of the latest and greatest. And so I would recommend giving them a follow. You don't need to follow me. I don't really do much. But the others on there are really pretty good. OK, so the fun part, we'll actually get into some code. So if you haven't had a chance, now's your last chance to get the repo. But I'll just kind of go through a few different examples here of what we talked about, maybe. OK, so I'll set up Phoenix, which is from Arise, which is basically an observability platform. I just did this today. So I don't know if it's going to work or not. But we'll see. We'll give it a shot. But basically, what this allows you to do is have a bunch of observability and tracing for all the calls that are happening under the hood. We'll see if this works. We'll give it another five seconds. But it should, I think, automatically do all this stuff for me. Yeah, so let's see. Yeah, all right, so something's up. OK, cool. So I'm just going to run through the notebook, which is a collection of different use cases, basically putting into practice a lot of what we just saw. Feel free to jump in, any questions, anything like that. We'll start with this notebook. There's a couple of other more proper Python programs that we'll walk through afterwards. But really, the intent is a rapid fire review of different ways that DSPY has been useful to me and others. So loadin.n file. Usually, I'll have some type of config object like this, where I can very easily use these later on. So if I'm like called model mixing, so if I have a super hairy problem, or some workload, I know will need the power of a reason model like GPT-5 or something else like that. I'll define multiple LMs. So one will be 4.1, one will be 5. Maybe I'll do a 4.1 nano, Gemini 2.5 flash, stuff like that. And then I can intermingle or intersperses them, depending on what I think or what I'm reasonably sure the workload will be. And you'll see how that comes into play in terms of classification and others. I'll pull in a few others here. I'm using OpenRouter for this. So if you have an OpenRouter API key, would recommend plugging that in. So now I have three different LLMs I can work with. I've Claude, I've Gemini, at 4.1 Mini. And then I'll ask, basically, for each of them, who's best between Google Anthropical Open AI, all of them are hedging a little bit. They say subjective, subjective, undefined. All right, great. It's not very helpful. But because DSPI works on Pidentic, I can define the answer as a literal. So I'm basically forcing it to only give me those three options. And then I can go through each of those. And you can see each of them, of course, chooses their own organization. The reason that those came back so fast, that DSPI has caching automated under the hood. So as long as nothing has changed in terms of your signature definitions, or basically, if nothing has changed, this is super useful for testing. It will just load it from the cache. So I ran this before, that's why those came back so quickly. But that's another super useful piece here. Let's see. OK. Make sure we're up and running. So if I change this to hello with the space, you can see we're making a live call. OK, great. We're still up. So super simple sentiment classifier. Obviously, this can be built into something arbitrarily complex. Make this a little bit bigger. But I'm basically giving it the text, the sentiment that you saw before. And I'm adding that additional specification to say, OK, lower. It's more negative. Higher is more positive. I'm going to define that as my signature. I'm going to pass this into just a super simple predict object. And then I'm going to say, OK, well, this hotel stinks. OK, it's probably pretty negative. Now, if I flip that too, I'm feeling pretty happy. Whoops. Good thing I'm not in a hotel right now. You can see I'm feeling pretty happy. Comes out to 8. And this might not seem that impressive. And it's not really. But the important part here is that it just demonstrates the use of the shorthand signature. So I have the string. I'm the integer. I pass in the custom instructions, which would be in the dox string if I use the class based method. The other interesting part or useful part about DSPY comes with a bunch of usage information built in. So because it's cache, it's going to be an empty object. But when I change it, you can see that I'm using Azure right now. But for each call, you get this nice breakdown. I think it's from Lite LLM. But it allows you to very easily track your usage, token usage, et cetera, for observability, and optimization and everything like that. Just nice little tidbits that are part of it here and there. I need to see. We saw the example before in the slides, but I'm going to pull in that form for off of online. I'm going to create this doc objects using attachments. You can see some of this stuff. It did under the hood. So it pulled out PDF plumber and created markdown from it, pulled out the images, et cetera. Again, I don't have to worry about all that. Attachions make that super easy. I'm going to show you what we're working with here. In this case, we have the form for. And then I'm going to do that poor man's react that I mentioned before. OK, great. How many shares were sold in total? It's going to go through that whole chain of thought and bring back to response. That's all well and good. But the power in my mind of DSPY is that you can have these arbitrarily complex data structures. That's fairly obvious because it uses PIDANTIC and everything else. But you can get a little creative with it. So in this case, I'm going to say, OK, a different type of document analyzer signature. I'm just going to give it a document. And then I'm just going to defer to the model on defining the structure of what it thinks is most important from the document. So in this case, I'm defining a dictionary object. And so it will, hopefully, return to me a series of key value pairs that describe important information in the document in a structured way. And so you can see here, again, this is probably cached. But I passed in, I did it all on one line in this case. But I'm saying, I want to do chain of thought using the document analyzer signature. And I'm going to pass in the input field, which is just the document here. I'm going to pass in the document that I got before. And you can see here, it pulled out a bunch of great information in the super structured way. And I didn't have to really think about it. I just defered all this to the model to DSPY for how to do this. Now, of course, you can do the inverse in saying, OK, I'm a very specific business use case. I have something specific in terms of the formatting or the content that I want to get out of the document. I define that as just kind of your typical pedantic classes. So in this case, I want to pull out if there's multiple transactions, the schema itself, important information like the filing date. I'm going to define the document analyzer schema signature. Again, super simple input field, which is just the document itself, which is parsed by attachments, gives me the text and the images. And then I'm passing in the document schema parameter, which has the document schema type, which is defined above. And this is effectively what you would pass into structured outputs, but just doing it in the DSPY way, where it's going to give you basically the output in that specific format. So you can see, pull out things super nicely, filing date, form type, transactions themselves, and then the ultimate answer. And it's nice because it exposes it in a way that you can use that notation so you can just very quickly access the resulting objects. So looking at adapters, I'll use another little tidbit from DSPY, which is the inspect history. So for those who want to know what's going on under the hood, inspect history will give you the raw dump of what's actually going on. So you can see here the system message that was constructed under the hood was all of this. So you can see input fields are document, output fields are reasoning in the schema. It's going to pass these in. And then you can see here the actual document content that was extracted and put into the prompt. With some metadata, this is all generated by attachments. And then you get the response, which follows this specific format. So you can see the different fields that are here. And it's this relatively arbitrary response, basically format for the names, which is then parsed by DSPY and passed back to you as the user. So I can do response.documentSkema and get the actual result. To show you what the BAML adapter looks like, we can basically do two different calls. So this is an example from my buddy Push-Off online again. So what we do here is define a PIDANTIC model, super simple one, patient address, and then patient details. Patient details has the patient address object within it. And then we're going to say we're going to create a super simple DSPY signature to say, take a clinical note, which is a string. The patient info is the output type. So I'm going to run this two different ways, the first time with the smart LLM that I mentioned before and just use the built-in adapter. So I don't specify anything there. And then the second one, we'll be using the BAML adapter, which is defined there. So I guess a few things going on here. One is the ability to use Python's context, which is the line starting with width, which will allow you to basically break out of what the global LLM has been defined as and use a specific one just for that call. So you can see, in this case, I'm using the same LLM, but if I wanted to change this to LLM anthropic or something, I think that should work. But basically, what that's doing is just offloading that call to the other, whatever LLM that you're defining for that particular call and something happened. And I'm on a VPN. So let's kill that. Sorry, Alex Partners. OK, great. So we had two separate calls. One was to the smart LLM, which is, I think, 401. The other one was to anthropic. Everything else is the exact same, the notes is exact same, et cetera. We got the same exact output. That's great. But what I wanted to show here is the adapters themselves. So in this case, I'm doing inspect history equals 2. So I'm going to get both of the last two calls. And we're going to see how the prompts are going to be different. And so you can see here, the first one, this is the built-in JSON schema, this crazy long JSON string. Yeah, LLMs are good enough to handle that. But probably not for super complicated ones. And then you see here, for the second one, you use the BAML notation, which, as we saw in the slides, a little bit easier to comprehend. And on super complicated use cases, you can actually have a measurable improvement. Multimodal example, same sort of thing as before. I'll pull in the image itself. Let's just see what we're working with. OK, great. We're looking at these various street signs. And I'm just going to ask it super simple question. It's this time of day. Can I park here now when it went, should I leave? And you can see I'm just passing in, again, the super simple shorthand for defining a signature, which then I get about the Boolean in this case in a string of one I can leave. So modules themselves, it's, again, fairly simple. You just kind of wrap all this in a class. Question? Oh, good question. Yeah, so when you do, yes. So for those online, the question was, does it always return reasoning by default? When you call dspy.chain of thought as part of the module where it's built in, it's adding the reasoning automatically into your response. You're not defining that. It's a great question. It's not defined in the signature as you can see up here. But it will add that in and expose that to you to the extent that you want to retain it for any reason. So if I changed this to predict, you wouldn't get that same response. You literally just get that part. So that's actually a good segue to the modules. So modules basically just wrapping all that into some type of replicable logic. And so we're just, we're giving it the signature here. We're saying self-duperdict. We're, in this case, it's just a demonstration of how it's being used as a class. So I'll just add this module identifier and sort of counter. But this can be any type of arbitrary business logic or control flow or any database action. So whatever it might be, when this image analyzer class is called, this function would run. And then when you actually invoke it, this is when it's actually going to run the core logic. And so you can see I'm just passing in. So I'm instantiating it. The analyzer of AI123. And then I'll call it great. I call that. And you can see the counter incrementing each time I actually make the call. So super simple example. We don't have a ton of time. But I'll show you some of the other modules and how that kind of works out. In terms of tool calling, fairly straightforward, I'm going to define two different functions, perplexity search and get URL content, creating a bio agent module. So this is going to define Gemini25 as this particular module's LLM. It's going to create an answer generator object, which is a react call. So I'm going to basically do tool calling whenever this is called. And then the forward function is literally just calling that answer generator with the parameters that are provided to it. And then I'm creating an async version of that function as well. So I can do that here. I'm going to say, OK, identify instances where a particular person has been at their company for more than 10 years. It needs to do tool calling to do this, to get the most up-to-date information. So what this is doing, a Bisco looping through. And it's going to call that bio agent, which is using the tool calls in the background. And it will make a determination as to whether their background is applicable per my criteria. In this case, Satya is true. Brian should be false. But what's interesting here while that's going? In similar to the reasoning object that you get back for chain of thought, you can get a trajectory back for things like react. So you can see what tools it's calling, the arguments that are passed in, and the observations for each of those calls, which is nice for debugging and other, obviously, other uses. I want to get to the other content. So I'm going to speed through the rest of this. This is basically an async version of the same thing. So you would run both of them in parallel, same idea. I'm going to skip the JEP example here just for a second. I can show you what the output looks like. But basically, what this is doing is creating a data set. It is showing you what's in the data set. It's creating a variety of signatures. In this case, it's going to create a system that categorizes and classifies different, basically, help messages that is part of the data set. So my sync is broken or my light is out or whatever it is. I want to classify whether it's positive, neutral, the negative, and the urgency of the actual message. It's going to categorize it. And then it's going to pack all those different modules into a single support analyzer module. And then from there, what it's going to do is define a bunch of metrics, which is based off of the data set itself. So it's going to say, OK, how do we score the urgency? This is a very simple one where it's, OK, it either matches or it doesn't. And there's other ones where it can be a little bit more subjective. And then you can run it. This is going to take too long. It probably takes 20 minutes or so. But what it will do is basically evaluate the performance of the base model and then apply those metrics and iteratively come up with new prompts to create that. Now, I want to pause here just for a second because there's different types of metrics. And in particular, for JEPA, it uses feedback from the teacher model in this case. So it can work with the same level of model, but in particular, when you're trying to use, say, a smaller model, it can actually provide textual feedback. So it says, not only did you get this classification wrong, but it's going to give you some additional information or feedback, as you can see here, for why it got it wrong and what the answer should have been, which allows it, you can read the paper, but it basically allows it to iteratively find that kind of Pareto frontier of how it should tweak the prompt to optimize it based off that feedback. It basically just tightens that iteration loop. You can see it as a bunch here and then you can run it and see how it works. But just to give you a concrete example of how it all comes together. So we took a bunch of those examples from before. We're basically going to do a bit of categorization. So I have things like contracts. I have images. I have different things that one DSPY program can comprehend and do some type of processing with. So this is something that we see fairly regularly in terms of we might run into a client situation where they have just a big dump of files. I don't really know what's in it. They want to find something of maybe fine SEC filings and process them a certain way. They want to find contracts and process those a certain way. Maybe there's some images in there and they want to process those a certain way. So this is an example of how you would do that, where if I start at the bottom here, this is a regular Python file. And it uses DSPY to do all those things that just mentioned. So we're pulling in the configurations. We're setting the regular LM, the small, and one we're used for an image. As an example, Gemini models might be better at image recognition than others. So I might want to defer or use a particular model for a particular workload. So if I detect an image, I will route the request to Gemini. If I detect something else, I'll route it to a 4.1 or whatever it might be. So I'm going to process a single file. And what it does is use our handy attachments library to put it into a format that we can use. And then I'm going to classify it. And it's not super obvious here, but I'm getting a file type from this classify file function call. And then I'm doing some different type of logic, depending on what type of file it is. So if it's an SEC filing, I do certain things. If it's a certain type of SEC filing, I do something else. If it's a contract, maybe I'll summarize it. If it's something that looks like city infrastructure, in this case, the image that we saw before, I might do some more visual interpretation of it. So if I dive into classify file super quick, it's running the document classifier. And all that is, it's basically doing a predict on the image from the file. And making sure it returns a type, what is this? Returns a type, which would be document type. And so you can see here, at the end of the day, it's fairly simple signature. And so what we've done is basically take the PDF file, in this case, take all the images from it, and take the first image or first few images, in this case, a list of images as the input field. And I'm saying, OK, just give me the type, what is this? And I'm giving it an option of these document types. So obviously, this is a fairly simple use case. But it's basically saying, given these three images, the first three pages of a document, is it necessary filing, is it a patent filing, is the contract, city infrastructure, pretty different things. So the model really shouldn't have an issue with any of those. And then we have a catch-all bucket for other. And then as I mentioned before, depending on the file type that you get back, you can process them differently. So I'm using the small model to do the same type of form-for-extraction that we saw before. And then asserting, basically, in this case, that it is what we think it is. A contract, in this case, we're saying, let's see, I have like 10 more minutes. So we'll stop after this file. But for the particular contract, we'll create this Summarizer object. So we'll go through as many pages as there are. We'll do some, basically, recursive summarization of that using a separate DSPY function. And then we'll detect some type of boundaries of that document, too. So we'll say, I want the summaries, and I want the boundaries of the document. And then we'll print those things out. So let's just see if I can run this, it's going to classify it. Should as a contract. So is your disrely on the model to city infrastructure? Yeah, the question was, I'm just relying on the model to determine if it's a city infrastructure. Yes, I mean, this is more just like a workshop, quick and dirty example. It's only because there's one picture of the street signs. And if we look in the data folder, I have a contract, some image that's irrelevant, the form for SEC filing, and then the parking, too. They're pretty different. The model should have no problem out of those categories that I gave it to categorize it properly. In some type of production use case, you would want much more stringent, or maybe even multiple passes of classification, maybe using different models to do that. But yeah, given those options, at least the many times I run it, had no problem. So in this case, I gave it one of these contract documents, and it ran some additional summarization logic under the hood. So if I go to that super quick, you can find all this in the code. But basically what it does is use three separate signatures to basically decompose the contents of the contract and then summarize them up. So it's basically just iteratively working through each of the chunks of the document to create a summary that you see here at the bottom, and then just for good measure, we're also detecting basically the boundaries of the document to say, okay, here's out of the 13 pages, you have the main document, and then some of the exhibits or the schedules that are a part of it. So let me just bring it up super quick, just to show you what we're working with. This is just some random thing I found online. And you can see, so it said the main document was from page zero to six, and the weight, and so it was zero, one, two, three, four, five, six, seems reasonable. Now we have this start in schedule one. Schedule one, it says it's the next two pages. That looks pretty good. Schedule two is just the one page, nine to nine. That looks good, and then schedule three through to the end of the document. That looks pretty good too. And so the way we did that under the hood was basically take the PDF, convert it to a list of images, and then for each of the images, pass those to classifier, and then use that to, well, let's just look at the code, but basically take the list of those classifications, give that to another DSP signature to say, given these classifications of the document, give me the structure, and basically give me a key pair of name of the section, and two integers, a two pull of integers that detect, or that determine the boundaries, essentially. So that's what that part does. If we go back, so city infrastructure, I'll do this one super quick, just because it's pretty interesting on how it uses tool calls. And while this is running, I should use the right one. Yeah. Yeah. So let's just go that super quick. So that should be boundary detector. So there's a blog post on this that I published probably in August or so that goes into a little bit more detail. The code is actually pretty crappy in that one. It's gonna be better here. But basically what it does is, this is probably the main logic. So for each of the images in the PDF, we're gonna call classify page. We're gonna gather the results. So it's doing all that asynchronous label and all that. Saying, okay, all the different page classifications that there are. And then I pass the output of that into a new signature that says, given two pull of page, I don't even define it here, given two pull of page and classification, give me this, I don't know, relatively complicated output of a dictionary of a string, two pull integer integer. And I give it this set of instructions to say just detect the boundaries. Like this is all very like non-production code obviously. But the point is that you can do these types of things super, super quickly. Like I'm not specifying much, not giving it much context. And it worked like pretty well. Like it's worked pretty well in most of my testing. Now obviously there was a ton of low-hing fruit in terms of ways to improve that, optimize it, et cetera. But all this is doing is taking that signature, these instructions, and then I call React. And then all I give it is the ability to basically self-reflect and call, get page images. So it says, okay, I'm gonna look at this boundary. Well, let me get the page images for these three pages to make sure basically that the boundary is correct. And then it uses that to construct the final answer. And so it's really, this is a perfect example of like the tight iteration loop that you can have both in building it. But then you can kind of take advantage of the model's introspective ability, if you will, to use function calls against the data itself, the data generated itself, et cetera, to kind of keep that loop going. Question. So under the cruise name, it's me. I mean, yes, I think that's probably reductive of like its full potential, but generally that's correct. I mean, yes, you can use structured outputs, but you have to do a bunch of crap basically to coordinate like feeding all the, free to then to the rest of the program. Maybe you wanna call model differently or use XML here or use a different type of model, or whatever it might be to do that. So absolutely, I'm not saying this is the only way obviously to kind of create these applications that you shouldn't use, Pydantik or shouldn't use structured outputs you absolutely should. It's just a way that once you kind of wrap your head around the primitives that DSPive gives you, you can start to very quickly build these types of arguably, I mean, these are like prototypes right now, but like if you want to take this to the next level to production scale, you have all the pieces in front of you to build that. Any other questions? I'll probably go about five minutes left. Good. Yeah, so, Jepp, and actually I'll pull up, I ran one right before this. This uses a different algorithm called MIPRO, but basically the optimizers, as long as you have well structured data, so for the machine learning folks in the room which is probably everybody, obviously the quality of your data is very important. You don't need thousands and thousands of examples necessarily, but as long as you have enough maybe 10 to 100 of inputs and outputs, and if you're constructing your metrics in a way that is relatively intuitive and that accurately describes what you're trying to achieve, the improvement can be pretty significant. And so that time entry, corrector thing that I mentioned before, you can see the output of here, it's kind of iterating through, it's measuring the output metrics for each of these, and then you can see all the way at the bottom once it goes through all of its optimization stuff. You can see the actual performance on the basic versus the optimized model, in this case went from 86 to 89. And then interestingly, this is still in development, this one in particular, but you can break it down by metrics, so you can see where the model's optimizing better, performing better across certain metrics. And this can be really telling as to whether you need to tweak your metric, maybe you need to decompose your metric, maybe there's other areas within your data set or basically the structure of your program that you can improve, but it's a really nice way to understand what's going on under the hood. And if you don't care about some of these and the optimizer isn't doing as well on them, maybe you can throw them out too. So it's a very flexible way of doing all that. Yeah, yeah, so the output of the optimizer is basically does another, it's almost like a compiled object, if you will. So DSPY allows you to save and load programs as well, so the output of the optimizer is basically does a module that you can then serialize and save off somewhere, or you can call it later as you would in the other module. And it's just manipulating the phrasing of the comps, or like what is it actually like? Yeah, yeah, under the hood, it's literally just iterating on the actual prompt itself. Maybe it's adding additional instructions and saying, well, I keep failing on this particular thing, like not capitalizing the names correctly. I need to add in my upfront criteria in the prompt an instruction to the model to say, you must capitalize names properly. And Chris, who I mentioned before, has a really good way of putting this on the bottom of the code, and I'm gonna butcher it now, but the optimizer is basically finding latent requirements that you might not have specified initially upfront. But based off of the data, it's kind of like a poor man's deep learning, I guess. But it's learning from the data, it's learning what is doing well, what is doing not so well. And it's dynamically constructing a prompt that improves the performance based off of your metrics. It's all, yeah. Yeah, question being, is it all LM guided? Yes, particularly for JEPA, LM's to improve LM's performance. So it's using the LM to dynamically construct new prompts, which are then fed into the system measured and then it kind of iterates. So it's using AI to build AI, if you will. Yeah. Why is this solution object not what? Oh, absolutely is. You can get it under the hood. I mean, the question was, why don't you just get the optimized prompt? You can, absolutely. What else? The, so what else is there other than the prompt? The DSP object itself. So the module, the way things, well, we can probably look at one if we have time. If I could see if dumps of what gets new. Yeah, yeah, sure. Let me see if I can find one quick. But fundamentally, at the end of the day, yes, you get an optimized prompt, a string that you can dump somewhere if you want to. Actually, there's a lot of pieces to the signature, right? So it's like how you describe your field. Yes. This is a perfect segue and I'll conclude right after this. I was playing around with something I was, while I was playing around this thing called DSPY Hub that I kind of created to create a repository of optimized programs. So basically, like if you're an expert in whatever, you optimize an LLM against this data set or have a great classifier for city infrastructure images or whatever, kind of like a hugging face, you can download something that has been pre-optimized. And then what I have here, this is the actual loaded program. This would be the output of the optimized process, or it is, and then I can call it as I would anything else. And so you can see here, this is the output and I used the optimized program that I downloaded from this hub. And if we inspect maybe the loaded program, you can see under the hood, it's a predict object with a string signature of time and reasoning. Here is the optimized prompt ultimately. This is the output of the optimization process, this long string here. And then the various specifications and definitions of the inputs and outputs. So, it's up to your use case. So if I have a document classifier, it might be a good example. If in my business I come across whatever, documents of a certain type, I might optimize a classifier against those. And then I can use that somewhere else on a different project or something like that. So out of 100,000 documents, I want to find only the pages that have an invoice on it as an example. Now sure, 100%, you can use a typical ML classifier to do that. That's great. This is just an example, but you can also theoretically train or optimize a model to do that type of classification or some type of generation of text or what have you, which then you have the optimized state of, which then lives in your data processing pipeline. And you can use it for other types of purposes or give it to other teams or whatever it might be. So it's just up to your particular use case. Something like this like hub, maybe it's not useful because each individual's use case is so hyper-specific, I don't really know, but yeah, you can do with it, whatever you want. Probably the last question, yeah. You know, it used, and you can help in 30 minutes later or a day later, man. So is the question more about continuous learning? Ish, like how would you do that here? Well, how are you thinking? Yeah, well then you can. Yeah, that's right. It would basically be added to the data set and then you would use the latest optimized and just keep optimizing off of that. Draw a true data set. That's right. You will collect the other data. Yes. If you're a good, if you're a good engineer, you probably did it, but I'm not recommending replacing ML models with optimized DSPire programs for particular use cases, maybe like classification to terrible example, I recognize that. But for other areas, in theory, yes, you can do something like that, yes. But for particular LLM tasks, I'm sure we all have interesting ones. If you have something that is relatively well-defined, where you have known inputs and outputs, it might be a candidate for something worth optimizing. If nothing else to transfer it to a smaller model, to preserve the level of performance at a lower cost, that's really the biggest benefit I see. All right, last question. DSPire also has large content. Yeah, so the question was, can DSPire be expensive? And then for large context, how have you seen that and how have you managed that? The expensive part is totally up to you. If you call a function a million times asynchronously, you're going to generate a lot of costs. I don't think DSPire makes it easier to call things, but it's not inherently expensive. It might, to your point, add more content to the prompt. Like, sure, the signature is a string, but the actual text that's sent to the model is much longer than that. That's totally true. I wouldn't say that it's a large cost driver. I mean, again, it's ultimately more of a programming paradigm. So you can write your compressed adapter if you want that reduces the amount that's sent to the model. In terms of large context, it's kind of the same answer, I think, in terms of if you're worried about that, maybe you have some additional logic, either in the program itself or in an adapter or part of the module that keeps track of that, maybe you do some context compression or something like that. There's some really good talks about that past three days, obviously. I have a feeling that that will kind of go away at some point. What either context windows get bigger or context management is abstracted away somehow. I don't really have an answer, just that's more of an intuition. But DSPY, again, kind of gives you the tools, the primitives for you to do that, should you choose and kind of track that state, check that management over time. So I think that's it. We're going to get kicked out soon. So thanks so much for your time. I really appreciate it.

DSPy: The End of Prompt Engineering - Kevin Madura, AlixPartners

TL;DR

Takeaways

Vocabulary

Transcript