- Generic, shallow, or outdated AI-generated content, termed "AI Slop," highlights the need for well-structured systems and human oversight in technical content creation.
- Effective AI system design involves balancing "autonomy" — from simple prompts to complex multi-agent systems — to control cost, maintain flexibility, and handle dynamic tasks.
- The presented "Deep Research" system combines an exploratory research agent with a deterministic writing workflow to augment human technical writers, focusing on high-quality, cited content production.
Full Workshop: Build Your Own Deep Research Agents - Louis-François Bouchard, Paul Iusztin, Samridhi
- Combat "AI Slop" with Structured Systems: AI-generated content is often generic, meaningless, or outdated. Counter this by implementing systems that perform thorough research, enforce structured writing, and incorporate human feedback.
- Choose the Right Autonomy Level: Employ an "Autonomy Slider" mental model to decide between simple prompts, advanced workflows, or full agent systems, balancing control, cost, and the need for dynamic behavior.
- Workflows vs. Agents: Use augmented language model workflows for tasks with predetermined, sequential steps. Opt for agents when the system must take autonomous actions, react dynamically to its environment, or make complex plans (e.g., calling APIs, branching logic).
- Manage Context Budget Effectively: Be aware of "context rut," where LLM performance degrades well before the maximum context window is reached. Actively manage context by pruning, summarizing, retrieving, or delegating tasks to tools or sub-agents.
- Delegate with Tools and Multi-Agents: Leverage tools as specialist capabilities within a single agent for complex but sequential tasks. Consider multi-agent systems for very large context loads (e.g., >200k tokens) or highly distributed, autonomous decision-making.
- Augment, Don't Replace, Human Writers: AI systems should primarily augment human content creators, not replace them. The human touch remains essential for relatability, humor, and connecting with the audience in high-quality technical content.
- Iterate and Integrate Specialized APIs: Continuously test and refine agent designs based on feedback. Integrate specialized external APIs for specific functions like dynamic web scraping (
firecrawl), precise web search with sources (Perplexity/Gemini with grounding), and YouTube video transcription (Gemini).
AI Slop — Generic, meaningless, shallow, or outdated content generated by artificial intelligence models.
Autonomy Slider — A conceptual model representing the spectrum of AI system complexity, from simple prompts to full agent systems, inversely correlating with control and directly with cost.
Workflow — An augmented language model system (with data, tools, memory) that executes a predefined, sequential series of steps without dynamic decision-making or reaction to an environment.
Agent — An AI system capable of taking autonomous actions, reacting dynamically to its environment, and planning which tools or steps to execute.
Context Window — The maximum amount of text (tokens) an LLM can process at once.
Context Rut — The observed degradation of an LLM's performance and ability to leverage information, often occurring well before the stated maximum context window limit is reached.
Deep Research System — An advanced AI system designed to thoroughly research a given topic by planning searches, using web access and tools, inspecting sources, synthesizing information, and iterating to produce a comprehensive report.
mcp (Multi-Capability Protocol) — An open standard framework for giving AI agents access to tools and data, simplifying agent creation and communication.
Grounding — The process of ensuring an AI model's responses are based on factual information from specific, verifiable sources, often by providing those sources to the model.
Hello everyone, not too loud. I hope it's fine, perfect. So yeah, we'll start to kick off the workshop and we'll introduce ourselves shortly. But first, we just wanted to present the slide because that's basically LinkedIn this year. It's the type of content that when you ask chat GPT, that's basically what you get. There are very generic response and there are a few things wrong with this actual response that we don't really like seeing. The obvious ones are the Slop word, the AI Slop, so all the delve intricacies that we all know about. But there are more, and fortunately there aren't even some M-dashes in there. But there are some other problems, like the lines are a bit up. But the most, that is more general, like most companies, myths or most people myths, it does that often. These are some examples from LinkedIn posts where they all say most themes, most people are here, most AI projects. And I'm actually guilty of that one. So we learn as we do. And there are some other problems with hallucinations or hear outdated information. Obviously, this was generated like last week and GPT 4 is not state of the art. And likewise, there are some AI phrases Slop that we see a lot, the obvious, rapid-leave evolving one but also the classic, it's not about something but something else that it generates a lot. And the worst of all is that typically this is really meaningless and shallow. It doesn't provide any value or anything useful. And so what we need here is a proper research and proper writing around these language models. And to give more context of this workshop, we basically, towards the I create courses and tons of videos training and technical content. And to do that, we need to create technical content around AI engineering. So we need technical writers, which means we need AI engineers, writers, editors that iterate together and a lot of time to create a good storytelling, a good, just a good article, good lesson or good video in general, which in the end cut a lot of money. So what we try to do is to automate this process, which we did or at least a part of it because another part is really a human thing to make a good story relate to the others. And so that's what we did. We built a system for replacing this whole process of doing a very thorough research and then technical writing. So what it does is we give a topic like what is harness engineering. And we have our own deep research agent that will search various websites and tools, use tools to do lots of things. And then another system to write an actual technical article with code images and ideally a non-slop writing in there. And actually the system, since we build courses, we used it to build a course to teach it to build the system. So it was a really fun project to do in general and it allowed us to iterate and test a lot and to have a user feedback directly with our students. So it was really nice. And that's what we try to share here in this to our workshop or at least a much more compact version with a smaller, simpler, deep research system and a writing agent strictly for shorter content for a LinkedIn type of post. And also we try to share what we learned building it. You can open the or get the GitHub repository, it's public. In the first 30 minutes, I will just talk with slides so you won't be needing it. But after that, my colleagues will jump in with the code so it might be worth to have it. And we will show this to our code again later on. So you can get that on time. But basically, I will cover what we learned in some terminologies because we are an individual company and that's what we like to do. So I will cover some basics to bring everyone to the same level or at least to our definitions of things and share what we learned by building it. And then some really will take over to talk about the actual deep research agent and follow with the rest. So who are we? On Mayan, I'm the city and co-founder of TORDI. My name is Louis François. And I was a PhD student in AI until a chat GPT came out where I basically switched to being an educator full time and creating towards AI to focus around that. I've been doing videos to explain research paper for seven years now. And obviously right now, I'm more around AI engineering. And I've been developing in the space since 2020. At the first startup, I was higher than that. So I will leave. Sam, you did through the interview yourself. Hi everyone. I'm Sam Ruby. I am a machine learning engineer. I'm also a technical writer and consultant helping towards AI. Cool. So nice to meet you all. Hello, everyone. I'm Paulis Dyn and I'm building a software for eight years. And I've been into the teaching AI space for four plus years. And I'm also the author of the engineering handbook bestseller. And I'm excited to show this workshop to you guys today. So I'll start with introducing, we always start with a problem. And here it's pretty general. I start introducing the AI engineering problem space. Where basically it's just that all our decisions as AI engineers are governed by constraints that typically are applied less to software engineers, which is like the cusp or task with which can swing a lot based on the model you use, the architecture you use. We have latency requirements with reasoning models and other things that we can control. We obviously have some quality to respect and data privacy in this case that we need to be careful of. Fortunately, we have a stack to be to help us, whether it is from prompt engineering, context engineering, to using tools orchestration or building our own evaluation system. Then we like to see that as some sort of autonomy slider that goes from very simple, just prompting a model to building agent systems. Where the more you add complexity from prompting to more advanced workflows to agent systems, the more autonomy you add, but also the less control you have over your whole system. And obviously, the higher the cost. And to us, that's very important to choose when we build for clients. So, that's why we try to teach what we build in this loop. But to us, it's really important to build a right system where we don't need too much autonomy where it can add problems and uncertainty. And obviously most people are interested in building agents, but most of these agents that our clients want are actually somewhat super simple workflows or at least workflows that we can come up with pretty easily. And so I want to try to start by defining workflows up to multi agent systems finishing with this deep research example that we have that we have made. So what is a workflow? A workflow is very simple. It's just a language model, but augmented. Because language models just take in tokens and text and generate tokens and text. So you cannot do much with text other than reading and writing emails, which some of us do more than that. So ideally you want to add stuff to it, like adding data that it may not have access to whether it's at your company or some other data that needs to best answer the user. You can add, you can give it access to tools to do other things than just generating tokens. And you can add a memory so that it can remember the interaction and be much more useful. But still this is not an agent. It's just a thing we can build easily that can be reliable. And a workflow can be even more complex than that. You can chain prompts together to reduce the latency and complexity to do many more things based on strict conditions. You can even do that with a router to decide automatically based on conditions on what you want to do, whether it is to use a smaller model or just do a different task and then come back. You can do that in parallel to be more efficient or to improve the results with majority voting and other techniques. And you can even add loops to automatically improve the results based on a judge feedback. But all this is still not agent-tick. It's just a workflow that we can build and add all these things. We can even combine them. We can build really advanced workflows. So when does that thing become an agent? And here again, I'm using an illustration from Anthropic. It's a definition we really like. It's just when it needs to take actions and more importantly, when it can react to what happens in the environment. So very simply, it should be able to take autonomous action and react to what happens, as I said. And be able ideally to plan to decide which tool to use and more importantly, which tool not to use, how to respond, and everything. So when should we use this thing, an agent-tick system, compared to a workflow? Well, typically, we always want to use the simplest solution. It's just a prompt works. That's ideal. And when we build things, we always try to start with questions and use our sort of autonomous slider. So just a list of simple questions we can answer would be if the model already knows enough about the task to be done, whether it is a user asking questions or some other task. Obviously, you can just prompt it and do it. And ideally, add some examples, some future examples, just because it really helps the model to adapt somewhat on the fly, which would be just simple prompting. Then if you need external context, if it's not too big and you can paste it in the prompt in the prompt, under somewhat around 200,000 tokens, you can just paste it right away and ideally have it before you use ASCIT and use context caching so that it's more efficient to just, for example, answer several questions about the same report or the same your privacy documents or whatever internal documents you may have. Now if the context is not known until the user asks the question to your model, this is where you may try to inject knowledge on the fly based on the question, whether it is because your data is private and you need to retrieve it to answer the question, or if it's something recent that the models haven't been trained on or domain specific, you might want to inject knowledge. And if you want to do all this in different ways or depending on conditions, this is where you use a more advanced workflow with predetermined steps and sequences. And so in practice, an example of a workflow we built for a client was a super ticket handling where basically you have always the same steps in the same order, the system receives the tickets, it classifies it, wrote it to the right team, it drafts a response, it can validate it against the policy of the team and then send it. And here the key points is that these steps are always the same, the order never changes, it always needs to do these six steps in the same order. And so building this as an agent would just add overhead without adding anything extra, you don't need to dynamically react on anything, right? Here you just need to execute these steps. So a workflow definitely makes sense in this case. So for an agent, what you want to ask yourself is if your system needs to take actions or to be able to branch dynamically. So whether it is to call different API, depending on what the user needs or write the database or not, do reasoning and other things. This is where an agent can be useful. And an example we had was a CRM platform in Canada that wanted to create a chatbot for their CRM platform, for their users, their clients to be able to generate marketing content automatically. And initially they reached out to us because they were applying to an AI grant and they wanted a multi-agent system to do a lot of different things with agents for everything, which did sound really well for the grant because it was like AI focused. But it seemed really overkill to make many agents to do something quite simple as a marketing chatbot. So in the end, what we always do is to try to really understand what the client needs and wants, which is oftentimes very different. And so by talking with them, we figured that the actual workflow was always somewhat the same, always simple. We needed the agent needed to make a plan to decide what it should do, then retrieve the data specific to the client, generate the content, validate it and fix it if needed. And so the tasks were always sequential, the same shoot task that we wanted to do. They were super coupled, always the same thing about generating marketing content, but just for different clients and for different formats. And the whole sequence is context dependent. So whether it is the plan, the retrieval or generation or fix, it all needs to have the content in mind to be able to generate the best content as possible. So splitting decisions across different agents as the client wanted would have caused a lot of issues, whether it is information or just end of errors with two goals and other things that would reduce reliability. So instead we just built one agent but use tools as our capability. And this is really great because with tools you can have their own system prompt, they can have their own validation logic, even their own LLM or own LLM calls. So you can do tons of things with tools. And from our example, we had like a validation tool, one tool specific to all format, the SMS tool or email tool. And the result is that we use tools as specialists, but the global context stays within our only agent, the decision maker and the planner. But if we do that, it means that everything stays within our agent. So we have a constraint, which is the context window of our agent of the model that we are using, which is made of many things, the system prompt, obviously the instructions that we give it, the tool definitions, the different schemas and how to use these tools. Few shot examples of the task to be done. It's really great obviously to teach by examples as we've seen with models, they learn very quickly that way for most tasks. You can have retrieved data depending on what you want to do, and even entire conversation history, if it's a chat but or whatever type of agent that evolves over time. So all of this can take up a lot of space, which causes a problem because as we go through each of these steps, the context grows and the performance degrade, which we call context rut. And the problem is that this happens much before the actual context window limit of like one million token, for example. It worsens quite fast after like 200,000 or keeps changing, but around 200,000 now. And this is mostly because the lost in the middle problem, where we basically teach these long context models to be able to handle long context by feeding them a large corpus and sorting a random fact in them and retrieving that fact. So it's basically more retrieval of one specific thing, but it doesn't teach those models to leverage the whole book, the whole context to answer the question. So it's definitely not ideal, but it would be too expensive to build this kind of large that I said. So this type, this way of training them, because the problem of us having to manage this context budget, which means to always keep the context as lean and relevant as possible, both to reduce cost in terms of token amount, but also to improve the metrics that we are using. And we can use many techniques for that. We can train the content, we can summarize it. We can retrieve based on criteria, or there are also many interesting techniques in the Claude Code leak, a compaction method that I recommend looking into. Not two really powerful ones, but the one I'm most interested in for this talk is the delegation that we can do to either tools or sub agents with their own context, which is what most harness that we use are doing. And so this leads us to the multi agent systems where we basically want to use them when we have too much of this context, mostly like over 22s, or or just the context becomes way too large. We can delegate to tools or to agents. And there can be other reasons like if you need to make autonomous decision making or you need tool variability, or just security compliance, if you need to have one agent locally inside one hospital, for example, because I was in the healthcare field before and we used to have to have everything local. So that can be a reason for multi agents. And why am I talking about all this? Because what we've seen is that AI products are never just you build an agent or you build a multi agent, create thing. They basically combine all of all of that, they combine tools, workflows, some parts are more specific and predefined, some parts have more flexibility. And we believe that AI engineers need to understand all these to build complex systems. And deep research actually implement that it's a very good example of a complete system since it's the integrated all of these techniques. And before we show how we build our own and allow you to use it as well, just let me just cover quickly what the research is system is just in case. It's a reasoning system that will plan what it needs to research about. It has some autonomy to decide what to research about. It has tools, internet or web access. It's reliable. It will cite its sources. It has feedback loops with itself, but also with the human for human feedback. So this is much more agentic as we see it evolves in an environment, which is typically the web or user provided sources or just API access. And it's gold driven. It has one objective. We don't tell it how to do it exactly. We just tell it do a research about something. So it's typically much more than just a chatbot. It basically replaces someone that would do a very thorough research about a specific topic. It plans its search, it inspects, divots, its synthesized information and can iterate. So deep research systems are really to us one of the best projects to learn how to build such a complex end to end system. So we decided to build our own. In this case, our own is basically an end to end technical article production from a topic because we create lessons, a course lessons in our case. So it came out of a real utility, which is the most important part. So the goal here is to take a topic, research deeply about it and write a very good technical article with code integration with that is actually runable with images that is useful, not just randomly generated images throughout. And there are few challenges when doing this. First, you need a really high precision and recall when doing the research because you need as much relevant sources as possible, but you don't need too much of them because of the limited context problem. You need to reduce hallucinations and AI's love obviously and incorporate a lot of human feedback, especially in writing and the writing aspect. And as we always do with projects and I think you all should do as well, we started this with asking ourselves questions to better understand how we should do it and if we should do it at all. So the first question is, is this worthwhile? Is there a solution outside of what we could build that already exists? So our analysis is that high quality technical content is expensive to produce. We do that every day and we need a lot of people and it needs a lot of time and review and it's just real time consuming, really expensive, especially because these people need to be somewhat senior AI engineers to be able to explain anything in AI engineering. So they need expertise, but to teach someone not to build a successful product, so it's expensive and not that's their creative. Yeah, it would be ideal to automate most of this process. And as I said, the research needs strong recall and precision and on the other side, the writing needs to be more constrained needs to avoid hallucinations, be of high quality, follow your tone or the writing tone that you want, the structure that you want. And in terms of deep research tools, we use all of them pretty much daily and they are way too exhaustive, they gather way too much content and they have a lot of noise. It's not ideal for us in our case. We do use them, but we decided to build our own just to be able to do exactly what we wanted. So the decision is, yes, that this was worthwhile, but especially as a writer, augmentation, not to replace them because as we've seen, even if you do as, if you build the best system possible these days, it's really hard to create a piece of content, whether it is a video or a written lesson that connects with the reader or the viewer. So we need a human touch to be so that it's relatable to someone. You need to make jokes, good jokes, not the same ones all the time. And the ones that are appropriate to the context. So it was, it's really difficult to automate all this and we want to keep the human the loop there. Now, the first actual question we ask ourselves is what should be the architecture, if it should be an agent, multiple agents, a workflow. And the idea is this here is that first we have the research part that is really exploratory. It needs to find lots of sources, it needs to explore the web to understand what it needs to find if it misses information, so it needs to be able to iterate, to pivot, to search again. It needs a lot of flexibility. And on the other side, the writing agent is much more deterministic. It needs to follow a specific tone, to follow a structure, to, to, yeah, it needs to be much more constrained than it needs flexibility. You don't want this a is love, you don't want specific words, specific sentences, you want specific sentences, and specific formats. So it's much more constrained. So we have a conflict here where the research needs flexibility, but the writing needs constraint. And so the decision is to split into two systems, the research agent that is more exploratory, dynamic, more agentic, and the writer agent, the writer system that is more deterministic and consistent. And we have tons of these questions and we show in our course exactly how to build these two systems, but here is just two hours, so we focus on the research agent a bit more, especially for the next few questions. So here, how should it be created? Well, how should the research agent communicate with the writer workflow, the writer's step? Well, we pivoted a few times after testing it, and that's the most important part is to actually use your product, or a potential user should use your product to give you feedback as early on as possible. And on our end, we used it to build a course to use it, so it was really easy to iterate and to teach how we, what we learned and what we did. And what we saw is that our users, which were myself and the team, saw that we always used either both agent, or we didn't use the system. So we weren't just alternating between the research and then the writer. If we were to alternate, it was just with the writer at the end to add new suggestions, new topics, new information to add to the article or to remove to it. But the research was already quite perfect and done beforehand, as you will see with San Midee. And if a big change was needed, we just would rerun the process again, so we didn't need it to have proper orchestration in place. And the two agents work on the same artifacts. The research agent produced a research at MD file with all its research summarized and sent it to the writer. So they need to have ways to communicate together. So the decision that we made is to separate the two agents and run them sequentially with all orchestrations, so just a script, just a very basic script. And our two agents are in the same project so that it's easier to maintain, but also they share dedicated files. So it's easy to have them communicate with each other, but also have them communicate with us through these files. And having them together in the same project is useful for AI-assisted coding with Claude told or whatever system you are using. The next question for the research agent is how should it behave? So we really needed to understand what would be the process that we would want out of this research assistant before building it. And again, we refined this after testing. And in the end, we really needed a good human written guideline, which would basically be just what is the topic about anything I'm interested in covering in this lesson. And useful links that we think are relevant. Sometimes I send the AI news, use that around some specific topics. Sometimes it's my own videos or other topics that we covered. And we saw that. Obviously there's a like creator bias here, but we saw that the more we define the agent's goal in the guideline, so the less freedom it had, the better the results were. Then if we give links in that guideline, we send to the agent, the agent needs to scrape web pages that needs to be able to digest YouTube videos, get up, replace there is even if they are private to the user to myself, to gather all the necessary context. And then after that, it needs to do its actual deep research thing, so search the web for anything that is missing and needs understanding and then revise all this context and write that into a very nice final artifact that will be sent to the writer agent. So how did we do all that? How did we scrape web content that we give the links to and do web searches? Well, we as always simplify the problem and don't we have in the wheel? We use existing things, existing systems and APIs and divide the problem into simpler problems. So first scraping solution, we use a fire call and agent scraping and this is for many reasons that what some really will talk a bit more about the research agents, so won't cover it too much, but it's really useful to have this kind of scraping for more dynamic websites and other types of website that can block scraping. And for web searches we use the to use perplexity now Gemini with grounding for to get precise answers with sources directly which we can then scrape if we need to, but otherwise we can already prompted well enough to get the information we want from this grounding query. Now for YouTube videos because there are a lot of great information on YouTube, we want to ideally pass just the text to the agent because the video is quite heavy and we tried a lot of things, but typically a lot of things, but typically you need to download the video and extract the transcript. Fortunately Gemini now handles this directly with a YouTube URL, you just give the YouTube URL and ask questions about it or even ask to provide the transcript, so it's very useful. We decided to use that obviously. And now for the GitHub content, we just used the GitHub.js library to get a get a specific content into markdown and we can provide it with a token to access our private repositories, so it's really simple and you will see that in the repository. For framework, we use mcp with fast mcp so that we can have tools for the different processes and we can have an mcp prompt to basically give a recipe on what to do and how to use the tools to the agent, which is very useful. We will talk in depth about that in the next hour and a half. Then for the research, the setup, we basically just use Python, UV for managing dependencies and GitHub obviously. And for the models, we recently changed for everything to be Gemini, mostly because of the YouTube thing that I mentioned, but also we saw that first it works really well and we can alternate between Pro and Flash, the cheaper one and more expensive, depending on the complexity of the task. So we use them alternatively in the code and the one thing that we really like is also that it has a free tier for our students, so it's really convenient for us. But obviously we also compared to the other models we were using a person before and other Claude models and even open eyes, but we've seen that Gemini is much more interesting, especially from open eye these days. So we are using Gemini, but you need to compare with different models to be sure that you are using the right one. And finally for the more theory part, how does the research agent communicate with the user? It's inside the MCP prompt, we have specific steps to stop and interact with the users. And all these interactions are handled through files, whether it is the guideline or the research file or other files that Paul will talk about in the writer part. And by the way, everything I mentioned and much more is in some sort of cheat sheet that we also linked in the read me of the repository. So you can definitely access it for free, obviously it's on GitHub. And now is the part for where we talk about the deep research and you can so you can scan the code if you haven't to get the repository. And some indeed will take over. Sorry everyone, just give me one minute. Awesome. So I'll be talking about the deep research agent. And so what is a deep research agent that's, do is already explained that you know it can search any topic by searching the web. It can analyze YouTube videos and finally it will give you a cited report. How we've built it, we've used MCP for this, which is an open standard for giving agents access to tools and data. The key idea like the design choice here is that the agent is a brain that is doing all of the reasoning as like Lewis highlighted that any gaps that are there in the research part is being done by the agent. It thinks if there's anything missing in the research if it needs to run multiple researchers or it has to search something else. That is the part that is being done by the agent, whereas the server is the MCP is handling all of the capability. So it's going to be exposing all of the tools. It's going to be exposing the resources and the prompts that are there. So there's something called Fast MCP, which is a library that really makes creation of agents very easy. So it hides all of the complexities. And you do not have to write any sort of protocols or any sort of communication between agents. All of that is handled by Fast MCP. And we're using Claude Code as our agent harness. But this code is independent of that. If you're more inclined towards using cursor or get a copy that you can swap Claude Code with any of the agent harness that you like. So I'll be repetitively talking about these three things. So an MCP server exposes tools, prompts and resources. So what are tools? Tools are basically all of the actions an agent can take. So you know, like deep research or like, you know, going and doing Google search or like analyzing a video and giving us a transcript or compiling a report. So all of these action oriented things are done by tools. Whereas the second thing that the MCP exposes is prompt, which is basically instruction. You know, when you're talking to chat, JPD, you give it instructions. So this in case is prompts the agent can follow. And in our case, we have a detailed prompt that I'll show you in a minute with the code. And finally, the third thing is resources, which is the data the agent can read. So you know, this is like static data, which is the model names we're using, those sort of versions that we have. We have any sort of feature flags. Those are all part of the resources. So these are something that the agent can read. There is no action being done here. In terms of tools, we have three tools. So first is the deep research tool. The deep research tool basically does, you know, Google search and give us sources for everything that we have. So it provides us with answers for any sort of query we had we have. And also it provides us all of the sources of that. So this is we're using Gemini API for this. The second tool that we have is the analyze YouTube video tool. For this is well, we're using the Gemini API so you can take any YouTube or and you know paste it. And it's going to give you a transcript here. And finally, we have the compile research tool. So what the basically compiles the output of the previous two tools. So we're going to be talking more about the dot memory file and the details of that when I show you the code. So I just talked about, you know, the parts of the tool of the agent that we have. But you know, this is the overall architecture. So we are using Claude Code as MCP client. Just to highlight here, Claude Code for us is both, you know, MCP client and an LLAM. So, you know, it is like the brain which is being used also. It is MCP client which is talking to the MCP server that we have. The first MCV server that we have has tools, resources and prompts. These are the details about each of the tools that we have. So all of these, you know, tools, the first two tools, right to a folder called dot memory. And basically, we wanted something, we wanted to like preserve all of this because it is easier for logging. It is easier for us to verify all of the results that we're getting. So, you know, we are writing down all of our results to a folder. And the third tool reads from this folder and creates a file for us. So research, a mark done file called research.md. The first two tools do API calls to Gemini 3. And finally, everything is like, you know, red and written to this particular folder. So now I'm going to show you the code. Okay, so this is how the, let me just show them. Okay, so you know, this is how we have structured our code. You know, you don't have to get intimidated by this. So we're going to simplify it a lot. So, you know, these are the main two folders, the research folder and the writing folder. Paul is going to be talking about the writing part. I'll just go in and talk about the research part. So, if you remember, when I started my presentation, the first thing I talked about was the MCP server. So, you know, this is the MCP server file that I have. And then I, so you know, this is how you set up the MCP server. So we have, you know, some settings that we are loading up here. So these are basically the name of the models that we have. I can show that to you in a minute. And, you know, this is how you set up the server. So, you can give it a name and the version. And then you register three things. So, you know, this is something that you should remember. We're registering tools, resources and prompts. And then we're just setting up basic logging here. And we're using something called OPIC for observability. And Paul is going to be talking more in detail about that. And then we're finally creating the MCP server, you know. So, let's see how we have registered the tools. So, if we go to this file called tools, the Python tools file, this is how we have registered the tools. So, for registering the tools with, you know, first MCP, sorry, for registering tools, you need three things. The first thing is of course, the name of the tool. So here, the name of the tool is deep research. And then the second thing is arguments. So, we're passing it to arguments. The first one is the working directory. And the second is the question of the query that we're going to be asking yet. Then you need a definition for what your tool is going to be doing. So, you know, you have to be as precise as possible when you're writing this definition. And finally, we also have the definition for the arguments that we have. So, if we have the tool, then we just have the code for the observability. And then finally, we have the implementation. We're returning the implementation for the tool. So, this is it. So, you know, this is how you define a tool. And there's no complications here. But let's go into a little bit more detail about the tool itself. So, how this tool is working? So, for the deep research tool, we are basically in the first step. We're just validating all of our parts. So, you know, we're checking if we are in the current directory or not. And secondly, we are ensuring that we, the memory file that we're writing to. Basically, it's a folder where we're writing all our results to, if it exists or not. And, you know, we do all of this validation here. And then, you know, we run this grounded search on the query that we have. Once we get the result of this query, you know, we return the status. If it's successful or not, what was the query that we sent? What is the answer that we get? So, you know, we get like a detailed response. And, you know, this has been very helpful for our team to like debug. And like, you know, go through all of the data that we get so that, you know, we can refine the research results we have. Going into more detail about how we are, you know, running the research. Right? So, if you see this, the run grounded search function that we have, we have a prompt that we've already written. So, if you go here, you can see the research prompt that we have. So, we're basically saying that if you have any questions, you have to provide a detailed comprehensive answer to the question that is there. Focus on the official authoritative references and, you know, make sure you're including all the relevant details as possible. And make sure to cite your sources clearly. So, you know, this is the prompt that we're using for the deep research tool that we have. And then, we're just calling the Gemini API here. And we're getting the answer text and the sources. Then, in this part of the code, we're just... So, I mean, every time you get an, sources from Gemini API, it's not in a structured form. So, here, we're just trying to structure it in a better way. I'll show that to you when I result. So, you when I run it, you'll see that, you know, we get the output in this structure where we get the URL, we get the title, and we get the snippets. And, you know, finally, we are returning the research results. So, you know, this is the... This is all the details about our first tool. But going on to our second tool, which is the YouTube video analysis. So, this is also got the same structure as the deep research tool that I showed you. So, three things. The first is the name of the tool. The second is the arguments that it is using. So, even in this case, it is using the working directory and the YouTube world. And then, the details about the tool, you know, what it is doing. And the details about the arguments that we have. And then, we are just returning the execution of the tool. So, going into more details about what the analyze YouTube video tool looks like. So, here is the implementation for that. Similar to what we did for the deep research tool, we are like validating all of the parts here. And then, you know, we are taking out the YouTube video ID using this get video ID function. If you do not get the, you know, video ID for the YouTube world that we have provided, we get, you know, we like try to sanitize the result and get the video ID. And then, you know, we are calling... This is the line where we are calling the Gemini API. And, you know, similar to the deep research tool, we are returning all of these different details. So, that, you know, we can keep track of the things. We can see what is the output that we have got. And, let's go into the analyze YouTube video implementation. So, you know, similar to the deep research tool, we have a prompt for the YouTube transcription. And here is the prompt that we have written. So, you know, we are just saying that you have to... This is basically guiding the Gemini API to give us the transcript in a certain format. And, you know, give us all of the instructions and the details that we want to not output. So, for this part, so here we are, you know, dividing our requests into two parts. The requests that we are sending the Gemini API. In the first part, we are sending the file... Sorry. In the first part, we are sending the... All that the YouTube world that we have as a file URI. And in the second part, we are sending the prompt that I showed you. And, you know, this is like a very cool thing. So, Gemini is a multi-model model. And it actually, when you send a YouTube URL as a file URI, it actually sees the video. So, it goes through part by part. It doesn't access any sort of transcript. And that's why when we'll be running the, you know, YouTube video analyzer tool, you'll see that it takes like two to three minutes because, you know, it is actually going through your entire video. So, it is like really multi-model. I mean, in here, we are, you know, like sending the requests to Gemini. And once we get the output, we try to like clean the output that we are getting in a certain format. And then, you know, outputting it and like saving it in the dot memory folder that we get. Finally, coming back to the final tool that we have, which is the Compile Research Tool. So, what it basically does here is, it returns and compiles all of the results that we get from the previous two tools. If you go here and see the implementation for the Compile Research Tool, you'll see that what it is trying to do is it is trying to write to the output research MD file. And then it gives us like a success data sort of, or a return message. So, you know, we have seen how we have registered the tool in the MCP server. Similarly, we've got the, you know, code for registering the resources. So, you know, here is all the things that we are sharing with the agent as resources. So, basically, what is the name of the server that we have, the version that we are using, what sort of Gemini model are we using, what is our YouTube transcription model. And all of these different details are being provided to the agent using resources. And then the MCP prompt. So, you know, this is how we have registered the MCP prompt. So, basically, we are returning something called workflow instructions. And this is the detailed prompt that we have written. So, basically, in this prompt, we are telling the agent that these are the tools that are available to you. And, you know, you can use this workflow to use these tools. And, you know, some additional details about like the working directory and where all of the results should be stored. So, you know, this is our detailed prompt that we are giving to the agent. Awesome. So, going back, you know, this is like the, oh, sorry. One more thing I wanted to highlight here is that this is our end file. So, you know, this is all being accessed as resources, but this is where you would have to add your Google API key. You can add the OPEC API key if you want any sort of observability. But this is the end file that we have. Additionally, apart from that, you know, we have the scripts here, which are like, you know, helping us with the code throughout. And, you know, these are like scripts we have written additionally to manage some of the tutorials. So, sorry. Yeah. Going back. So, now we have the entire architecture there. And so, how do we exactly connected with our agent harness in our case? This is Claude Code. So, this is a very, you know, straightforward way for Claude. You just need to create an mcp.json file in the project route. And, you know, give it a command like this. And then, for our case, we are running the mcp locally. You know, we're not hosting it anywhere. It's being run locally. And it's using standard and standard out. So, for communication, so, you know, this works. But in case, you know, your mcp is hosted elsewhere. You would need an url, like, you know, notion is a very popular example. notion has its own hosted mcp. You can have access to it by getting the url. So, this part of the code would be changed slightly. So, you know, if you were using the UV command, you would be, you know, incidentally, in the UV command, you would have that url here. And you can connect with, you know, any mcp that is available, snowflake notion, anything that you want. One more thing that I would like to highlight here is that, as I said at the beginning of my talk, we're using Claude Code as an agent harness. You can use anything that you want. You know, if you like co-pilot, you can use it for this. You know, any sort of agent harness you want to use, you know, the mcp. So, you know, if you want to use it for this, you can use it for this. This is a quick Claude Code. Okay, so let's see where that file is. So, it's in the, so you have to create that mcp. JSON file in the root directly. And this is, I can explain it to you. It's a very simple thing. So, here we have, you know, the name of the mcp server. We have, so, you know, its vpresearch. And this is the command that we have. Virtual environment that you're creating, any sort of things packages that you're downloading. Downloading. It's all being handled by UV. So, you don't have to do anything. And, you know, then we have this command, which is basically running the server file that I showed you a couple of minutes ago. So, it is just doing that. So, Claude runs the server file as a sub-process locally. So, our server is running locally. And this entire code is, sorry, this entire command is only doing that. It says run fast mcp, run, and the path to the server file that we have. And this is the environment file that I told you that you're supposed to create. Okay. So, you know, now it's time to see how the code works. So, I would go to Claude. Sorry. Okay. So, if you go to Claude and you do mcp, you can see that I have two servers here. So, you know, because of the mcp, JSON file that I have, you know, it was automatically able to detect both the agents. So, we'll just be seeing the D-presearch agent right now. And, you know, this is all the details. You can see the command, you can see the arguments that we have. You can see all of the capability that the server is exposing. So, you know, you can just do view tools. And, you know, you can see all of the details about the tools. So, you know, the tool name, the description that I showed you, all of the arguments, the parameters that we have, you know, similarly for all of the other tools, you can see all of the details here. Now, you know, I'm going to try to, sorry, I'm going to try to, you know, run one of the tools and see how Claude does it. So, I've already written out this question. This was from a previous AI talk about AI agent. So, I'm going to just tell it to analyze this video for me. And, it should automatically be able to detect the tool that we have, that's being exposed by the MCP server. So, you can see here that, you know, it is automatically picked up the tool that we are exposing. And, it is, you know, it is trying to use that. And, on the side, you know, you can see that there's a dot-pollot. And, you can see that there's a dot-memory folder that I told you, you know, where we are tracking all of our results. It is being created and it's going to write the transcript to this file. So, as I told you, Gemini is watching or like, analyzing this video in real time. That's why it's taking, it's going to take like a couple of minutes for it to like, produce the transcript. So, yeah. So, basically, every time you get this question, you know, Claude Code tries to see what it can do, what tools are being available. And, you know, then a sensor request to the MCP server, can you please execute this tool? And, once the MCP, like, MCP executes the tool, it sends the results back to Claude Code and then it further analyzes that result. So, we can create a minute for that. Yeah. So, it is created. This markdown file, so it says that this video is from, you know, the AI engineer code summit. It is provided me all of the details. And, you know, it is like pretty good detailed transcript that we can see here. And, you know, it also gives, you know, gives its input at the end. Plus, I also like really like this thing that, you know, we are storing all of the transcript, but it gives you like a really neat summary at the bottom, where it says, you know, whatever the key ideas that were shared in the video. And, you know, so, and it says that, okay, the transcript is here. So, you know, this is like a short example of how we could, you know, run and see one of the tool if it's, if Claude is able to run that or not. And, now, I will try to, you know, run the entire pipeline again. So, I've already written a question for that. So, I'm telling it to do an end-to-end research. And, so now I'm trying to use the deep research tool that I have and the YouTube video tool that I have. And, trying to see, you know, how it runs both of them together. So, it says that it is trying to, it is trying to do the analyze YouTube video part. So, when you have like a, you know, like a big question like this, the entire workflow, how it works is, plot breaks down the question into like different parts and then sees which tool it needs. So, you know, it has like an entire reasoning going on that, okay, you know, if these are the sort of tools that I have available, I need the YouTube, analyze YouTube video tool for analyzing the YouTube video. I need the deep research tool, the deep research tool for the first part of the question. So, it's going to divide it into two parts and then try to call like both the tools. Okay, let's see. Yeah, sorry, it usually takes some time for it to, sure. I'm actually going to talk about that. Yeah, I'm actually going to talk about that's the next part. I was trying to remove the skill initially when I moved it to the temporary folder and it doesn't pick up the skill because it tends to do that as well. Yeah, not about that, just the prompt part though. No, not, so I mean for us we're using the skill and replacing the prompt part. So, you know, trying to use that entire thing, I'll show it to you. Like just in a minute, I think that might clarify your question. So, I mean here it gives me, you know, it is like, analyze the YouTube video again, it didn't give me the entire answer for the first part. But yeah, I mean I think it just, just give me one second. I have to clear the context of plot because it tends to use whatever I asked it previously. And I would say I like to delete the memory folder. So, till the time, you know, this is, you know, running and generating research for us. Sorry, till the time it's like running and generating research, I could talk about the next slide that we have, which is the agent skill. So, I'm sure all of you must have heard a lot about, you know, agent skill. It is very popular right now. So, this is basically a compact concise way of, you know, telling agent about capabilities and workflows. And, you know, so basically tools do the work. And, you know, the skills provide you the way to do it. So, you know, we talked about the entire prompt that I showed you, you know, the workflow prompt that we have. So, in that, you know, we were telling, sorry, we were telling our agent that, you know, these are the tools that are available. And this is how you should execute them. But instead of, you know, writing it in the prompt, we can instead create a skill out of that. And what is the benefit of doing that? You know, when one thing is already solving the problem, why should we create another one? The reason for that is that skill have something called progressive disclosure. So, when you, you know, when, like I did, you know, when I showed you, I am going to show it to you in a minute, when you load the skill. So, you know, when Claude just like shows you the skill, it doesn't load the entire information. It is only load the name and the description of the skill. Once you run a query, it's going to load the entire thing. And once that query is executed, it is going to wipe off everything. So, you know, it is not going to be there in the context. Whereas, when you are using prompt, you know, everything is going to be there in the context. It's going to clear the context. And so, you know, you would want to avoid that. So, you know, skills are a very clean way of, you know, doing that. And it is also very shareable because, you know, we internally write a lot of skills for our team and like share it with each other. So, it's a very, you know, it's a concise, shareable way. And it's more maintainable because, you know, you can check it into GitHub. You can, it's like your go-to place and you can do the entire thing there. So, okay. So, we have the results for this. But before that, I would just like to show you how we have written the skills. So, I mean, we've written the research skills. So, this is replacing the prompt that we have. So, here we have to write the name of the skill that you want. You need to write the description. So, why this description is necessary because when your agent is trying to, you know, match a query to a skill, it is going to look at this part of, you know, this part of the skill. So, you need to have like a good description here. And, you know, then this is called the front matter which is loaded. And then this is like the rest of the part of the skill. So, in our case, you know, we are reusing the research workflow prompt that we had instead of like writing all of the details in here. You can do that, but it's a design choice for us. And, you know, this is like the rest of the body for the skill. Okay. So, now how could you see all of those skills and what do I mean that it only loads the front matter? So, if you go to Claude again, and if you do, sorry, if you do skills, you'll be able to see. So, these are some of the skills that I've already downloaded and these are the project skills that we have created. So, particularly the research skill that I was talking about. And so, every time you do this and you see this, only the front matter is loaded, not the entire description. And, you can actually, if you want to use this, you can just do something like this. Or Claude can automatically pick up the skill when you ask it a question. Now, I'm going to try to ask it the same question I was trying to ask the MCP with. Okay. So, I'm going to ask it that it researches what an AI agent skills are and then also analyzes the YouTube video for me. Yeah. So, it is loading the research workflow prompt and it is able to do that because our server is running locally. And that's why it's able to read the prompt because they're all in the same folder. And now, it has started to do a deep research you can see here. Sorry. So, it has created, so it is running, you know, the deep research here. And then it has run like, firstly, it ran the analyze YouTube video tool. Then it is running the deep research tool. And I think one amazing thing here that I would like to highlight is that the initial query that I had was, you know, can you give me details about what are agent skills. And every time, you know, our agent harness identifies the gap in the output that it gets. So, in my initial question, it is going to identify all of the gap that exists. And then it's going to ask a different question every time to fill that gap. So, you know, it is doing all of the reasoning behind the scene and making sure, you know, all of those gaps are filled in the, you know, research that we're trying to do. So, yeah, I mean, it's, it usually takes around two to three minutes for me to like run this entire output. So, yeah, let's see. Sure. Yeah. So, I mean, we built it originally with, you know, the MCP portion of it. And now, you know, we are like transitioning, but, you know, we want to like keep our original code. And, I mean, not the server part, but like the prompt essentially that we have, we can replace it with skills. But, I mean, we have a lot of detailing and different components within the servers. That's the reason we wanted to like maintain that. Yeah. Yeah. Yeah. It's more of the complexity in that case. That's why we want to like maintain that. So, here we can see, you know, we've got the output of the video. And now we're just going to wait for it to like produce the output for the MCP file. So, you know, here you can see we've got, this was my initial question that I asked it. And, you know, it has given me the answer for that. And, this is like the first run that we've had. And, now it's like, you know, doing like more research on that. Because, you know, gave us an answer, but it identifies some of the gaps in that. So, this is the second query it ran. And, this is like a different question. You can see here. So, it ran two queries and it is like, okay, you know, this is good enough. And, then it says all three research tasks are complete. Completed. Let me run one more target at query and then complete everything. So, you know, that's why we wanted, you know, all of the reasoning to be done by agents. Because, you know, this is something a human would do. Because if you're reading an article or if you're doing any sort of research, you try to identify the gaps and like try to fill it with that. So, you know, that's what it is trying to do and that's why we needed all of the brain. And, you know, trying to utilize that. So, you know, this is like the third query that is it is running for me. And, then finally, we expected to like create like a compiled report, which is going to be any minute. Sure. I think, I mean, that is like in turn, I mean, we would have to like see the observability for that. And, like, check and identify why it is not like seeing the latest code. But we also have given it a video of 2020. So, I think maybe it's because of that. So, the video that I asked it to like transcribe, it is regarding that. But I think to probably answer that better, I would like go into all of the observability part of the code that we have and like see what is going on there. So, you know, finally, it's asking me if it could create a more down file for me. So, here is the research. I'm the file that our final tool creates. And, you know, it is really detailed. It has got all of the comparison tables. And, it has got all of the details. So, you know, I think this is like a really nice way to compile everything that we have gotten so far. And, it also like compiles and adds the YouTube video link that we have. Sorry, video transcript that we have at the bottom. So, yeah. Thank you so much everyone. Now, I'll hand it over to Paul. Awesome. Thank you, Luis and some ready for those great slides. And, just give me a second to set up everything over here. Okay. So, now we are moving to part three of this whole system, which is the LinkedIn writing workflow, which basically transforms this research from deep research agent to polish posts that pass lobby factors so we can transform all of you into LinkedIn influencers. Okay. So, just like a high level overview, this is our full system architecture, right? So, we did, we already implemented the research agent, which outputs this research and defile, which will be as input to the writing workflow, which on top of that has as input the guideline.und file, which basically is a fancy way to put, it's a fancy way of the user input. And this is how we will model the user input to guide the generation process. And the final output will be the LinkedIn posts, right? Okay. Now, let's actually zoom in into the architecture, which is split into three big pieces. So, the first one is where we build up the context, and ultimately we end up with the big system prompt. Because remember, this is, ultimately this is not an agent, this is just a workflow because in reality, like to write a piece of content, regardless of what is this, of LinkedIn posts, a video or an article, you don't really need agents, right? Because as Luis said in the beginning, this process can be, it's very static, you always kind of go to always the same steps. So an agent will just make everything a lot complicated. Of course, you can put agents in some parts of it, but we won't show that into this workshop. So, the first part is to love the context, which is, as I said, the guideline and defile, which is the user input, the research. And then we have some static files, the profiles, which I will dig into a more detailed bit later. So, we take all of these, build a system prompt, and then pass everything to the LLM. So, this is phase two. So, so far, nothing fancy, we just basically create this huge system prompt, and call the LLM to create the first draft of the post. And then the last phase is to apply the evaluator optimizer loop, which is probably the most interesting part of this, where we let the LLM to create a loop reviews and add it to the post a couple of times. So, this is super important because in reality, you can apply this not only to LinkedIn posts, but to any type of content, like starting from video transcripts, to reports, financial reports, medical reports, or even articles, or book chapters, or very detailed article lessons as we did in our course, where we needed to follow, like, text snippets, images, code snippets, references, and all these little details that the LLM usually sometimes get right, but most often don't get right. And even if they get right 80% of the time, it's annoying to manually go and fix that, right? That's the whole point, so all of this is automated and so on. So we'll dig, again, into this process in more depth later on. So this is just one of our examples. So this is the beginning of a LinkedIn post that we ask to generate. And just for fun, I would just want to copy this post. So it's the exact same post and put it into this, like, pretty popular Slopscore detector. So you can trust me that this actually works and hit analyze. And yay, not Slop. And this is actually lower and less Sloppy than a human, as you can see, is over there on the bottom. Of course, not all the posts are this not Slopish, but most of them are really good and that they sound human. But remember, this is LinkedIn language, so we need to sound a bit like LinkedIn, right? But ultimately, as you can see, it doesn't have any M-dashes, any weird adverbs, verbs, and things like this. It can be read very nicely, the structure is, well, it's Kimmelball, as we would like for LinkedIn and things like that. You can check the full post at this link in the GitHub repository and also try the Slop test yourself if you aren't using that link. Okay, so now let's start to dig deeper into the first phase of this workflow, where our core focus is understanding how we can actually control the generation, right? Because as we said in the beginning, we don't want just to write, hey, I want to link in post on topic X, we actually want to dump our ideas, values, and thoughts into that post that actually sounds like us. So the first trick is actually to structure this guideline, which is the user input, and structure it more than just writing a post on X, right? So this actually guides what we want to write. So usually what we did, we created a template for this where we need to fill in such as what topic we want to address, what angle we want for this, some key points, we want to address the narrative flow and things like this. And this is the only piece that changes from post to post. So basically from the whole system, this is what we need to write out ourselves as a human as input to write a different post. And yeah, as I said, this is dynamic and changes. Next, and again, you can see an example here into this example from the repo along many others, but I will show them a bit later. So the second trick is to add these writing profiles, which basically tell the LLM how to write this post, right? Because in reality, you don't really want the LLM to just do its own thing, you want to guide it. And these are static because as I said with the guideline, you define what you want to write. And this is how you define how you want to write it. It's basically the styling layer on top of it. And we mainly created three profiles, which by themselves are marked on files, which are the structure profile, which basically we define things such as how many characters on average the LinkedIn post has, like the core structure of a LinkedIn post, we want the hook, like the body, the core to action, also the terminology for the terminology. We define things such as the active voice and the AISLOP words and expressions that we want to ban. And unfortunately, this is all you can do. You just need to keep track of a huge list of delvetetistry, vibrant, and all those words that we love, and just kindly ask the LLM to not use them until you start using them as a human as well. And the last one is the character profile, which is kind of static, but you can also configure it. For example, in our writing workflow, we actually configure it under my polystened biography, right? So it knows a bit about me, like how I run to write, how I like to write my style and things like this. And this is how you can add a bit of personality to it. So again, we put all of this into the system prompt plus the guideline that we defined before. And you can see these are static, so as I said, we define them just once, and you can access them under writing profiles and look around. There are, I know, markdown files of a few hundreds of lines of words, right? So the last trick is actually to add few short examples. So this is nothing fancy, but in reality, the hard part is actually to get them, right? Basically, this is under the data collection, part of your system. And in our particular use case, we added kind of tree-linked imposed from my writing. And the key idea is to use high quality and representative few short examples and make them as very desposable in things such as topic, length, structure, and so on and so forth. And one question I think many people are curious about is why three LinkedIn posts? I know three is a magic number, but in reality, I just guessed it. And the thing is that with few short examples, because you always pass them in your system prompt, you want the lower number possible that gets the job done. So usually people tell like three, five, ten, twenty few short examples, but usually want to start with the bigger number and trim down that as much as possible until it works. And when it stops working, you put that back in. Because you want to keep the ideas that you want to keep your system prompt with the few short examples as small as possible because well, you always pass that to a lot of them, which translates to more cost, more latency, and even the greatest performance because the content grows. Okay, so for that, I made a available data set in the GitHub repository where I extracted like twenty random LinkedIn posts from my profile that got more traction. So I used that as a data set for this writing workflow and further down the line to configure other things. And one thing to highlight is probably the only way to reduce this load on the system prompt with few short examples is to fine tuning, but that that's often overkill and adds a lot of friction into your text stack. So the last part of the system is the evaluated optimizer pattern, which probably is the most fancy one from all of this. And basically it contains 12 LM calls and this is really important. It contains two different context windows, right? And the writer and the reviewer, where basically the writer first writes the draft and the reviewer with a complete different context window, text that draft and reviews it. And this is super important to like avoid bias because LM's usually tend to be biased in liking what they already written. And putting that into a completely new context window can, can remove that bias. So basically how this works, the reviewer checks adherence of the draft against to the guideline, which is the user input against the research, basically to remove hallucinations and against the profiles, those structure, terminology and character profiles. So we ensure that it adheres to it. And then the editor, which is often the writer itself applies reviews. And then we run this loop in our example, three or four times to basically write the post, review it, edit it, optimize it and so on. So four similar to like fine tuning optimization loop. And a few tricks here that we absorb over time is that just by keeping in our working directory, all those versions, it helps because writing is subjective and often if you apply this reviewer to aggressively, it might make it you might not just not like it. You know, so you want to look around the less versions and pick the one that you like the most. And again, because writing is subjective. Usually the developer optimizer pattern worked with a score. So basically the reviewer gives a score to the input and you loop until you reach upon a specific threshold. But because creative work is subjective, that threshold is very hard to quantify. And this loop becomes very noisy and not reliable. And we realize that it's just easier to just put a fixed number of iterations and let the user run these iterations again manually if it just wants to. And now let's dig a bit deeper into how the reviewer works. So basically as input, it has the current state of the post. Next, as context, it gets the guideline, the research and the profiles, which basically it's what the writer needs to look around and they are basically the rules, the writer needs to follow. And the outputs is actually a set of pedantic objects. Right. And I think this is the most interesting part here is that you actually constrained the alarm to output a list of structured objects. And this is powerful because if you give a pedantic object to an LLAM. By then, objects have this property where you can put a field under each attribute and actually explain what that field means. So basically this is a prompt engineering technique, which guides the LLAM a lot better into actually understanding what each attribute of the return object actually needs to contain. So for example, we, a review model has this profile location and comment attributes. And an example is an example output is, for example, it. We violated the terminology profile on paragraph two where the LLAM use it used the leverage band term. And like this, we, the editor, when it gets these reviews, will understand exactly what what it violated. And we, from our test, we realized that it's a lot easier to use this structured actually is a lot more performance to use the structure output than just letting the LLAM. I don't know, spit out whatever it wants. Again, you, you can check this code within this Python file. The code itself is pretty simple. So unfortunately, I won't have time to dig too deep into it. And now on the editor part, basically, it gets this list of pedantic objects. And the current state of the post. And also it gets all the context that the writer initially has because the editor is actually the writer itself, right? This is similar to how a team of writer and editors work. I write something, I give it to a team of editors, I get the reviews and I apply them. And another important thing is that reviews are not created equals, right? Especially people like to give reviews on everything usually. And this also is true for the LLAMs. And because we loop for a couple of times, we realize that it's super important to put a priority on them. So usually we always want to prioritize the guideline first, which is the user input, then the research and the profile. And this is super powerful when those those reviews clash on the same paragraphs or on the same sentences and things like this, they let me know what to pick up and what to apply. Okay, again, you can check the system, the prompt within this Python file within the repo. And here is a concrete example where we can see the post V0, which is basically what the LLAM speeds out before the developer optimizer loop. And this is how it looks like after for reviewing iterations. So we can see that the text is nicely formatted for LinkedIn, also like the first sentence, which is the most important is a lot is puncture and things like this. This is valid also for like the first part of the post and also for the second part of the post. Again, it looked or it modified the structure, the wording, everything to look more on to look more about something that you would expect on on LinkedIn. And again, you can check the whole post and all the version from zero to four within the examples directory. And ultimately, let's let me actually show you how this directory looks like. So to test this code, you actually have three levels. So I created a simple make command just to check that everything works. So no, nothing crazy here. If this works and 20 it means your code works locally. You can pass that to Claude Code to make your code work. So we serve similar to how the research agent works, everything to an MCP server prompt. And then we coupled, we connected this MCP with right post skill. And the thing is that this takes like three, four minutes to run. So I already, I already ran it to avoid wasting your time. But here here is how you can run it. So basically you take this right post skill and just as the LLAM, we take, for example, the same example that I did before. So the same example that that we took, we copy the copy the relative path. And we ask it to use the guideline from this dear. And then basically it will, it will know to pick up the guideline and the research associated with it and write a post. And the output is here. So we see we have the four versions of the post, actually five versions of the post and also the guideline, which I think is the most interesting one. So basically, as you can see, instead of just like kindly asking LLAM to write about something, you need to be very explicit about what you want from it. You need to put an angle, the target audience that you want, key points that you actually want to cover the tone and things like this. And actually a constraint of characters that overrides like the other profiles that are static. So we called it encoding in the system from that within this guideline, you actually can override like the default profiles if you want something different. Here you can go wild and input anything you want, but I just wanted to highlight like you actually need to put in some effort to think it through. You cannot just fully automate everything if you don't want to sound like true slots. And on top of this, we actually have like a small image generator. We haven't put a lot of effort into this, but did this prompt actually generates the post using this writing workflow and then ask another tool to automatically generate an image for for that post. So now to move on. So just to wrap up on what we build on the writing workflow, we build this writing workflow of pipeline that generates reviews and edits the post, which again, you can apply to any type of content with tested it with almost every type of content and it works really well. You just need to adapt basically the profiles, the examples and things like this. And then we learn like how to control the generation, the guideline, the profiles and the future examples. Again, for different content type, you will need to adapt this. And then how to serve everything as an MCP server and skills. And yes, for this local example, you could do everything through skills. You don't really need MCP servers to make it work. But from my experience, MCP servers help you distribute logic. Because skills are like your local set up, your local hack set up that works great for you. But if you want to distribute business logic, you don't really want to ask someone, hey, download those skills, install these UV dependencies over there, plug in those CLI tools. Oh, no, you also need those credentials and quickly becomes a mess to distribute this at scale. An MCP server solves that problem and skills can help you personalize how you use that. Basically, if you want a skill to be more than a prompt and actually run that code, it works, but it can become a quick mess to set it up if it's too complicated. So now let's move on to part four of the system, which is observability, more exactly monitoring and evals. So we will use the writing workflow as an example for evals and we did monitoring for both. So for monitoring, we will go really quick over this because, like, theory wise, there's not much to say. So the idea is that the core problem behind monitoring, which in my opinion, like, exploded when we started to use agents and workflow, is that debugging workflows and agents purely through the logs is hard. Like, I personally can't really understand what's going on inside the terminal, especially when you see those thinking of the agent, which hides a lot of stuff. So it quickly becomes painful to debug what's going on just using the logs. So basically, you need some tool to monitor this nicely that captures all your traces, such as all the LLM and tool calls, your input output, your metadata, captures everything about your run. And also latency and cost track, right? So and as a bonus, it also stores your traces to build an AIE Val layer on top of everything, which I will dig, which I'll dig more into it a bit later. So now let's actually look into our monitoring logs. So we used OPPIC to monitor both our agents, the research and the writing workflow. And yeah, just let's look into it because it's a lot easier to understand like this. So for monitoring, usually you have like three big concepts. You have those threads, which is basically in our use case is like the whole workflow of writing an end to end post, like starting with a post itself plus the image. For example, as you can see, this workflow has 44 messages, which basically captures all the bouncing around between the user and the LLM. You can more intuitively see it as a conversation thread, which that's why it's called a thread, you know. And then you have the traces and here you can actually dig deeper into what's going on. So a thread contains multiple traces, right? And within a thread, you can actually dig into what's going on. So for example, for a generate post trace, here we have the high level overview generate post called where we can see how long it took to run, how many tokens it consumed, how much it cost to run. And here we can see all the two calls and LLM calls that happen under the hood, right? So we can see the models use the cost per model, the latency per model and all of that. And on the right, we can also see like the high level input and output of the system, some metadata, the token use it, and basically everything that happened within that run, which is, we've helped you a lot to quickly understand what's happening, right? And we have something similar for the research agent as well. Here the thread is a lot easier to understand because, for example, we have this latest thread, which captures all those two calls, basically all the deep research, two calls, the two calls that creates the final report and all the bouncing around, and then the threads, the traces, which have just single two calls, did they capture just a single two call, where as you can see, we have just one LLM call per deep research, two calls. Okay, so now let's move on to AIEVLs. So I want to start with why EVLs matter because it's not necessarily a cool topic, but it can do a lot on your system. So basically, let's take our writing workflow example. Let's say that we want to check how well it works, right? We all start with Vibes, with Vibetaking. And when we generate just one post, we read it, we say that hey, it's cool, awesome, it works. Well, then we have 10 posts. We read them, maybe, maybe not, maybe we read two, maybe we read three and call it a day and say it works for the others as well. So when we want to check how well it works in 100 posts, it quickly becomes impossible. And just imagine that you do this at the beginning, but as you evolve your system, you start to plug in more features, right? And every feature can just break everything. And yeah, this is standard practice in software engineering, but here you work with prompts. And just one word somewhere randomly can just break all your features, if you're not careful, you know. And you need this layer on top of it. So basically, EVLs fit in three big layers. So we have optimization, which is very similar like to training a model. So basically, we have this AIEVLs layer, which lets us quantify how well our writing workflow writes LinkedIn posts, right? Then when we want to improve it, we have a score, which tells us that, hey, do we move in the right directions or we don't move? So on every change, we run this AIEVLs layer, and we know if we did better or worse. The next layer is regression testing, which is basically similar to testing classic software engineering, where whenever we start working on a new feature, we're just insured that we don't break current functionality. And the layer tree will be to actually run this in production on live traces to actually get warnings and errors and alarms and all of that, when users actually use your system. And in reality, how this is different from like normal unit integration and regression test is that everything starts from the Intel data set, right? So this is a data problem, we actually build a model here. So here is how we build it for our LinkedIn posts, where we were lucky enough to actually already have the data. So I extracted 20 real posts from my LinkedIn. Why 20? I said that is a big enough number for a workshop, but in reality, we'll probably need to go at least to 100. Then I reversed engineered the guideline and the research. Basically, I took the guideline from the post, and then I ran the deep research agent on top of the guideline to find whatever I need to support that post. And then I generated the output. So basically, I put the guideline and the research as inputs and generated new posts. And it's super important because you actually want to see results from your real system. And when you want to generate synthetic data of some sort, never as the LLM to directly generate the output, always ask it to help you generate the input, but never the output. You want the output from your real system because that's what you're testing ultimately. And then I looked around and labeled each output with binary pass and fail labels and two, three sentences critique on why exactly I gave it a pass or fail label to it. And usually I stop when I find the first error is just easier for me while labeling and also for the LLM to understand that whenever I see the first fail, I stop at it, I write, write, failed into three sentences and move on. And ultimately, I split all of this into a trained depth test, because ultimately, remember that here we're building a machine learning a model. So we need to treat it as such. So we need a classic split. And only when we have this data set and these plates, we can actually build the evaluator and evaluate the evaluator, which most often for some reason when they build element judges, they think they can just skip this all last step, which is probably the most important one. And again, what's the most important and interesting here is the data set that we created because actually creating an element judge out of this if you have the data set is just very easy. Okay, so now let's see how this element judge would work for our current scenario. So as input, it will take the generated post that it needs like to label, but also the profiles research and guideline. And this is super important because the element judge actually needs to understand the context used to generate it disposed. And ultimately it outputs the pass fail label plus the critique basically exactly the same labels as our data set. And then we pass as few short examples, our trained split from the data set. And this is the most important one because like the system prompt of element judge can be extremely simple if it has the right few short examples in place that tells the element judge what decisions to take. Right, so the the few short examples look like this, it they have the right or input, which is the guidance and research, right, so the input while writing work for system. Then the writer output, which is basically the generated post of the workflow and the labels, right, so in the future examples, we also have the labels the pass fail label and the two three sentences critique about the label. And this is it like you can check out our system prompt within this file, but there's nothing fancy. The most important part from this is building the data set. And our last step is to actually measure the judge's reliability because ultimately what we did here, the element judge is just a binary classifier, right, we we we output pass and fail label. So yeah, we use the element judge because the reality is that it's easy, but we could have just use any any any other model that gets the job done and can output to this pass and fail labels. And ultimately, when we measure judge reliability, our final goal is to align the element judge with domain expert. And how we will do that, we will do that by testing it against the depth test data sets, please. So this is a process very similar to training any other binary classifier. There's nothing new here just the word and judge is fancier. So what we do basically is test them judge against the depth test data sets, please. And then use f1 score, the f1 score, which is a combination of precision and recall to actually measure how well that may judge perform against this please. And the process looks looks like this. So we first run them judge on the death split computer that one score and adjust them judge and prompt examples to maximize this f1 score because most probably when we first start this process that one score will be low and we basically want to get a score as high as possible. And we repeat until we convert usually you need to do this a couple of times. And when you think you're done and the element judge is ready to go, you run it on the test split as the final validation step, right. So basically training machine learning model. And now it's ready to run on new data. So after we go to this step, we have the element judge, which is ready to run on new samples of data, which will just output the binary pass or fail and the critique. Now like to wire everything together that we had in this in this section is that we first build the data set. Then based on that, we build the m judge, then we calibrated the judge and only after that, we can run it on real data and usually all these steps are managed by some observability platform. We used here up, but you can use any other platform or just build something in house. It doesn't really matter, but the idea is everything should be managed by cohesive platform. Okay, so now as a quick demo, let's emulate as much as possible this calibration step because that will judge already calibrated. So I will just show you how the judge performs on the evils split. So under the hood with a pick, we build like a very simple evaluation harness where it allows us to run this calibration steps and then run it on production traces. So in reality, the code is for our use cases is pretty straightforward. So just to show you around a bit until it runs. Okay, so actually what what it does now, it takes all the generated post, right? It passes the element judge on top all those generated posts. And then what I wanted to show you is this F1 score, where basically we compute an F1 score based on the outputs from the LM judge against our labels from the data set, which I think is more interesting to see how our data set looks like really quickly to get a sense of it. So basically here we have like a single sample within our data set where what's more interesting is like we have a link to the media of the post to the guideline to the generated post basically links to everything that we need as input and output both for the deep research agent and the writing workflow. And then just putting this scope, we plug them in as a few shot examples either for the post generator or for the evaluator or for example for the depth split and things like this. So this is how we are careful that we do the splits the right way between the judge and also the generator. And for simplicity, we have all those files linked together in this YAML file here locally within GitHub. So you can check everything and plate plate and run everything very easily with with minimal configuration. Okay, so if we go back into the terminal, we saw that it had like a perfect score. So now let's actually run this on the on the test plate. And the goal here is to see that our M and Jud did not overfitted right so we ran it on the death plate. We get a perfect score. Usually this is a high signal that it overfitted. So unless the F1 score on the test plate is not one or around one. It means that our M and Jud is overfitted. So again, similar to the process before it is basically the exact same steps, but now we ran on the test split. So let's see the final F1 score, which is again one, which is good. And the scores are so good because we do these just on five samples per split. But if you like expand your split, which you should have at least like 20, 30 samples, the score probably will not be perfect. Okay, and let me actually show how this experiments look inside inside. Yeah, sure. Yeah, I think that the most important is like the dynamics between the F1 score on your. Desplit and the dynamics on your test plate because the over the when the model is overfitted is actually. Like how is compared between the two, for example, if it as I said if it has like a very big score on the depth split and a low score on the test plate, it means it overfitted. But the score itself usually is very correlated with your own data. So for example, you are pleased with the exact. Yeah, exactly. Yeah, exactly. The absolute value usually is very correlated with your data. Usually on the test plate, you say, okay, this F1 score is good enough for me. And then you run on the test plate to see that you have the same F1 score on the test plate as well. Okay, so. Again, usually we with this observability platforms, you have this experiments tab where you can actually run like. Experiments similar to fine tuning experiment trackers from from from from the old school. And for example, here we have the dev and test experiments. So let's open the test experiment that we just ran. And there we can actually see the output from the binary element judge and our label. And we can see that there's a perfect match between the two. And if we hover over this, we can also see like the critique of the judge. But in reality here, we compute the score only between the labels. And this critique is more useful for us because ultimately our goal is us as humans to look over these results and understand what's going wrong with with the system. And then as a final step, we also prepared like a online simulation where I simulated some online traces where the element judge actually runs on top of them. And you get the binary labels and the critique. But the thing is that this takes like five minutes or even more to run because it actually needs to jam for real. It needs to generate all the posts and run the m and judge on top of it. So I'll just show you a result over here, which I bundled it together as an experiment. So is this online test. And then the results from the binary from the judge where we find the score and the critique itself. And of course, this system is not yet perfect. And we probably need to refine it more because here on online traces, as you can see, it kind of failed to be honest. It just passed an F1 score. It just give us a pass on all the traces while in reality, I also have the labels for these simulated scenarios. And that failed probably I need to go back to the dev split expand the dev split with new samples. For example, I could just take these samples and put them in the dev split and start refining it again and again and again and again until it actually works. And I'm kind of running out of time, but you also have this skill to actually run a demo and when right research and writing you can, for example, write a guideline about about something that you want to write a post about and give that as input to this skill, which is just a file. And you know to pick up to do the research and then continue to writing the post. And you can also observe that in all pick with all the the threats traces and all of that. And yeah, I guess that the final step, if you haven't done so yet is to actually top up the GitHub repository, run everything yourself and read the code. And without reading the code, most probably we want to really understand what's going inside. So you can do that by accessing this link or scanning the QR code. And whenever you're ready and you want to go deeper into building this type of multi-agent systems, we have this agentic engineering course which basically was inspiration for this workshop. But instead our goal was actually to design and build production ready agents and on some small local systems within 34 lessons, three and two and port for the projects certificate and a discord community with access to us. So far it's rated five by five by 300 plus students. And if you don't believe us, the first six lessons are free to try out. And you can access it using this link or scanning the QR code. And that's it for the workshop today.
TL;DR
- Generic, shallow, or outdated AI-generated content, termed "AI Slop," highlights the need for well-structured systems and human oversight in technical content creation.
- Effective AI system design involves balancing "autonomy" — from simple prompts to complex multi-agent systems — to control cost, maintain flexibility, and handle dynamic tasks.
- The presented "Deep Research" system combines an exploratory research agent with a deterministic writing workflow to augment human technical writers, focusing on high-quality, cited content production.
Takeaways
- Combat "AI Slop" with Structured Systems: AI-generated content is often generic, meaningless, or outdated. Counter this by implementing systems that perform thorough research, enforce structured writing, and incorporate human feedback.
- Choose the Right Autonomy Level: Employ an "Autonomy Slider" mental model to decide between simple prompts, advanced workflows, or full agent systems, balancing control, cost, and the need for dynamic behavior.
- Workflows vs. Agents: Use augmented language model workflows for tasks with predetermined, sequential steps. Opt for agents when the system must take autonomous actions, react dynamically to its environment, or make complex plans (e.g., calling APIs, branching logic).
- Manage Context Budget Effectively: Be aware of "context rut," where LLM performance degrades well before the maximum context window is reached. Actively manage context by pruning, summarizing, retrieving, or delegating tasks to tools or sub-agents.
- Delegate with Tools and Multi-Agents: Leverage tools as specialist capabilities within a single agent for complex but sequential tasks. Consider multi-agent systems for very large context loads (e.g., >200k tokens) or highly distributed, autonomous decision-making.
- Augment, Don't Replace, Human Writers: AI systems should primarily augment human content creators, not replace them. The human touch remains essential for relatability, humor, and connecting with the audience in high-quality technical content.
- Iterate and Integrate Specialized APIs: Continuously test and refine agent designs based on feedback. Integrate specialized external APIs for specific functions like dynamic web scraping (
firecrawl), precise web search with sources (Perplexity/Gemini with grounding), and YouTube video transcription (Gemini).
Vocabulary
AI Slop — Generic, meaningless, shallow, or outdated content generated by artificial intelligence models.
Autonomy Slider — A conceptual model representing the spectrum of AI system complexity, from simple prompts to full agent systems, inversely correlating with control and directly with cost.
Workflow — An augmented language model system (with data, tools, memory) that executes a predefined, sequential series of steps without dynamic decision-making or reaction to an environment.
Agent — An AI system capable of taking autonomous actions, reacting dynamically to its environment, and planning which tools or steps to execute.
Context Window — The maximum amount of text (tokens) an LLM can process at once.
Context Rut — The observed degradation of an LLM's performance and ability to leverage information, often occurring well before the stated maximum context window limit is reached.
Deep Research System — An advanced AI system designed to thoroughly research a given topic by planning searches, using web access and tools, inspecting sources, synthesizing information, and iterating to produce a comprehensive report.
mcp (Multi-Capability Protocol) — An open standard framework for giving AI agents access to tools and data, simplifying agent creation and communication.
Grounding — The process of ensuring an AI model's responses are based on factual information from specific, verifiable sources, often by providing those sources to the model.
Transcript
Hello everyone, not too loud. I hope it's fine, perfect. So yeah, we'll start to kick off the workshop and we'll introduce ourselves shortly. But first, we just wanted to present the slide because that's basically LinkedIn this year. It's the type of content that when you ask chat GPT, that's basically what you get. There are very generic response and there are a few things wrong with this actual response that we don't really like seeing. The obvious ones are the Slop word, the AI Slop, so all the delve intricacies that we all know about. But there are more, and fortunately there aren't even some M-dashes in there. But there are some other problems, like the lines are a bit up. But the most, that is more general, like most companies, myths or most people myths, it does that often. These are some examples from LinkedIn posts where they all say most themes, most people are here, most AI projects. And I'm actually guilty of that one. So we learn as we do. And there are some other problems with hallucinations or hear outdated information. Obviously, this was generated like last week and GPT 4 is not state of the art. And likewise, there are some AI phrases Slop that we see a lot, the obvious, rapid-leave evolving one but also the classic, it's not about something but something else that it generates a lot. And the worst of all is that typically this is really meaningless and shallow. It doesn't provide any value or anything useful. And so what we need here is a proper research and proper writing around these language models. And to give more context of this workshop, we basically, towards the I create courses and tons of videos training and technical content. And to do that, we need to create technical content around AI engineering. So we need technical writers, which means we need AI engineers, writers, editors that iterate together and a lot of time to create a good storytelling, a good, just a good article, good lesson or good video in general, which in the end cut a lot of money. So what we try to do is to automate this process, which we did or at least a part of it because another part is really a human thing to make a good story relate to the others. And so that's what we did. We built a system for replacing this whole process of doing a very thorough research and then technical writing. So what it does is we give a topic like what is harness engineering. And we have our own deep research agent that will search various websites and tools, use tools to do lots of things. And then another system to write an actual technical article with code images and ideally a non-slop writing in there. And actually the system, since we build courses, we used it to build a course to teach it to build the system. So it was a really fun project to do in general and it allowed us to iterate and test a lot and to have a user feedback directly with our students. So it was really nice. And that's what we try to share here in this to our workshop or at least a much more compact version with a smaller, simpler, deep research system and a writing agent strictly for shorter content for a LinkedIn type of post. And also we try to share what we learned building it. You can open the or get the GitHub repository, it's public. In the first 30 minutes, I will just talk with slides so you won't be needing it. But after that, my colleagues will jump in with the code so it might be worth to have it. And we will show this to our code again later on. So you can get that on time. But basically, I will cover what we learned in some terminologies because we are an individual company and that's what we like to do. So I will cover some basics to bring everyone to the same level or at least to our definitions of things and share what we learned by building it. And then some really will take over to talk about the actual deep research agent and follow with the rest. So who are we? On Mayan, I'm the city and co-founder of TORDI. My name is Louis François. And I was a PhD student in AI until a chat GPT came out where I basically switched to being an educator full time and creating towards AI to focus around that. I've been doing videos to explain research paper for seven years now. And obviously right now, I'm more around AI engineering. And I've been developing in the space since 2020. At the first startup, I was higher than that. So I will leave. Sam, you did through the interview yourself. Hi everyone. I'm Sam Ruby. I am a machine learning engineer. I'm also a technical writer and consultant helping towards AI. Cool. So nice to meet you all. Hello, everyone. I'm Paulis Dyn and I'm building a software for eight years. And I've been into the teaching AI space for four plus years. And I'm also the author of the engineering handbook bestseller. And I'm excited to show this workshop to you guys today. So I'll start with introducing, we always start with a problem. And here it's pretty general. I start introducing the AI engineering problem space. Where basically it's just that all our decisions as AI engineers are governed by constraints that typically are applied less to software engineers, which is like the cusp or task with which can swing a lot based on the model you use, the architecture you use. We have latency requirements with reasoning models and other things that we can control. We obviously have some quality to respect and data privacy in this case that we need to be careful of. Fortunately, we have a stack to be to help us, whether it is from prompt engineering, context engineering, to using tools orchestration or building our own evaluation system. Then we like to see that as some sort of autonomy slider that goes from very simple, just prompting a model to building agent systems. Where the more you add complexity from prompting to more advanced workflows to agent systems, the more autonomy you add, but also the less control you have over your whole system. And obviously, the higher the cost. And to us, that's very important to choose when we build for clients. So, that's why we try to teach what we build in this loop. But to us, it's really important to build a right system where we don't need too much autonomy where it can add problems and uncertainty. And obviously most people are interested in building agents, but most of these agents that our clients want are actually somewhat super simple workflows or at least workflows that we can come up with pretty easily. And so I want to try to start by defining workflows up to multi agent systems finishing with this deep research example that we have that we have made. So what is a workflow? A workflow is very simple. It's just a language model, but augmented. Because language models just take in tokens and text and generate tokens and text. So you cannot do much with text other than reading and writing emails, which some of us do more than that. So ideally you want to add stuff to it, like adding data that it may not have access to whether it's at your company or some other data that needs to best answer the user. You can add, you can give it access to tools to do other things than just generating tokens. And you can add a memory so that it can remember the interaction and be much more useful. But still this is not an agent. It's just a thing we can build easily that can be reliable. And a workflow can be even more complex than that. You can chain prompts together to reduce the latency and complexity to do many more things based on strict conditions. You can even do that with a router to decide automatically based on conditions on what you want to do, whether it is to use a smaller model or just do a different task and then come back. You can do that in parallel to be more efficient or to improve the results with majority voting and other techniques. And you can even add loops to automatically improve the results based on a judge feedback. But all this is still not agent-tick. It's just a workflow that we can build and add all these things. We can even combine them. We can build really advanced workflows. So when does that thing become an agent? And here again, I'm using an illustration from Anthropic. It's a definition we really like. It's just when it needs to take actions and more importantly, when it can react to what happens in the environment. So very simply, it should be able to take autonomous action and react to what happens, as I said. And be able ideally to plan to decide which tool to use and more importantly, which tool not to use, how to respond, and everything. So when should we use this thing, an agent-tick system, compared to a workflow? Well, typically, we always want to use the simplest solution. It's just a prompt works. That's ideal. And when we build things, we always try to start with questions and use our sort of autonomous slider. So just a list of simple questions we can answer would be if the model already knows enough about the task to be done, whether it is a user asking questions or some other task. Obviously, you can just prompt it and do it. And ideally, add some examples, some future examples, just because it really helps the model to adapt somewhat on the fly, which would be just simple prompting. Then if you need external context, if it's not too big and you can paste it in the prompt in the prompt, under somewhat around 200,000 tokens, you can just paste it right away and ideally have it before you use ASCIT and use context caching so that it's more efficient to just, for example, answer several questions about the same report or the same your privacy documents or whatever internal documents you may have. Now if the context is not known until the user asks the question to your model, this is where you may try to inject knowledge on the fly based on the question, whether it is because your data is private and you need to retrieve it to answer the question, or if it's something recent that the models haven't been trained on or domain specific, you might want to inject knowledge. And if you want to do all this in different ways or depending on conditions, this is where you use a more advanced workflow with predetermined steps and sequences. And so in practice, an example of a workflow we built for a client was a super ticket handling where basically you have always the same steps in the same order, the system receives the tickets, it classifies it, wrote it to the right team, it drafts a response, it can validate it against the policy of the team and then send it. And here the key points is that these steps are always the same, the order never changes, it always needs to do these six steps in the same order. And so building this as an agent would just add overhead without adding anything extra, you don't need to dynamically react on anything, right? Here you just need to execute these steps. So a workflow definitely makes sense in this case. So for an agent, what you want to ask yourself is if your system needs to take actions or to be able to branch dynamically. So whether it is to call different API, depending on what the user needs or write the database or not, do reasoning and other things. This is where an agent can be useful. And an example we had was a CRM platform in Canada that wanted to create a chatbot for their CRM platform, for their users, their clients to be able to generate marketing content automatically. And initially they reached out to us because they were applying to an AI grant and they wanted a multi-agent system to do a lot of different things with agents for everything, which did sound really well for the grant because it was like AI focused. But it seemed really overkill to make many agents to do something quite simple as a marketing chatbot. So in the end, what we always do is to try to really understand what the client needs and wants, which is oftentimes very different. And so by talking with them, we figured that the actual workflow was always somewhat the same, always simple. We needed the agent needed to make a plan to decide what it should do, then retrieve the data specific to the client, generate the content, validate it and fix it if needed. And so the tasks were always sequential, the same shoot task that we wanted to do. They were super coupled, always the same thing about generating marketing content, but just for different clients and for different formats. And the whole sequence is context dependent. So whether it is the plan, the retrieval or generation or fix, it all needs to have the content in mind to be able to generate the best content as possible. So splitting decisions across different agents as the client wanted would have caused a lot of issues, whether it is information or just end of errors with two goals and other things that would reduce reliability. So instead we just built one agent but use tools as our capability. And this is really great because with tools you can have their own system prompt, they can have their own validation logic, even their own LLM or own LLM calls. So you can do tons of things with tools. And from our example, we had like a validation tool, one tool specific to all format, the SMS tool or email tool. And the result is that we use tools as specialists, but the global context stays within our only agent, the decision maker and the planner. But if we do that, it means that everything stays within our agent. So we have a constraint, which is the context window of our agent of the model that we are using, which is made of many things, the system prompt, obviously the instructions that we give it, the tool definitions, the different schemas and how to use these tools. Few shot examples of the task to be done. It's really great obviously to teach by examples as we've seen with models, they learn very quickly that way for most tasks. You can have retrieved data depending on what you want to do, and even entire conversation history, if it's a chat but or whatever type of agent that evolves over time. So all of this can take up a lot of space, which causes a problem because as we go through each of these steps, the context grows and the performance degrade, which we call context rut. And the problem is that this happens much before the actual context window limit of like one million token, for example. It worsens quite fast after like 200,000 or keeps changing, but around 200,000 now. And this is mostly because the lost in the middle problem, where we basically teach these long context models to be able to handle long context by feeding them a large corpus and sorting a random fact in them and retrieving that fact. So it's basically more retrieval of one specific thing, but it doesn't teach those models to leverage the whole book, the whole context to answer the question. So it's definitely not ideal, but it would be too expensive to build this kind of large that I said. So this type, this way of training them, because the problem of us having to manage this context budget, which means to always keep the context as lean and relevant as possible, both to reduce cost in terms of token amount, but also to improve the metrics that we are using. And we can use many techniques for that. We can train the content, we can summarize it. We can retrieve based on criteria, or there are also many interesting techniques in the Claude Code leak, a compaction method that I recommend looking into. Not two really powerful ones, but the one I'm most interested in for this talk is the delegation that we can do to either tools or sub agents with their own context, which is what most harness that we use are doing. And so this leads us to the multi agent systems where we basically want to use them when we have too much of this context, mostly like over 22s, or or just the context becomes way too large. We can delegate to tools or to agents. And there can be other reasons like if you need to make autonomous decision making or you need tool variability, or just security compliance, if you need to have one agent locally inside one hospital, for example, because I was in the healthcare field before and we used to have to have everything local. So that can be a reason for multi agents. And why am I talking about all this? Because what we've seen is that AI products are never just you build an agent or you build a multi agent, create thing. They basically combine all of all of that, they combine tools, workflows, some parts are more specific and predefined, some parts have more flexibility. And we believe that AI engineers need to understand all these to build complex systems. And deep research actually implement that it's a very good example of a complete system since it's the integrated all of these techniques. And before we show how we build our own and allow you to use it as well, just let me just cover quickly what the research is system is just in case. It's a reasoning system that will plan what it needs to research about. It has some autonomy to decide what to research about. It has tools, internet or web access. It's reliable. It will cite its sources. It has feedback loops with itself, but also with the human for human feedback. So this is much more agentic as we see it evolves in an environment, which is typically the web or user provided sources or just API access. And it's gold driven. It has one objective. We don't tell it how to do it exactly. We just tell it do a research about something. So it's typically much more than just a chatbot. It basically replaces someone that would do a very thorough research about a specific topic. It plans its search, it inspects, divots, its synthesized information and can iterate. So deep research systems are really to us one of the best projects to learn how to build such a complex end to end system. So we decided to build our own. In this case, our own is basically an end to end technical article production from a topic because we create lessons, a course lessons in our case. So it came out of a real utility, which is the most important part. So the goal here is to take a topic, research deeply about it and write a very good technical article with code integration with that is actually runable with images that is useful, not just randomly generated images throughout. And there are few challenges when doing this. First, you need a really high precision and recall when doing the research because you need as much relevant sources as possible, but you don't need too much of them because of the limited context problem. You need to reduce hallucinations and AI's love obviously and incorporate a lot of human feedback, especially in writing and the writing aspect. And as we always do with projects and I think you all should do as well, we started this with asking ourselves questions to better understand how we should do it and if we should do it at all. So the first question is, is this worthwhile? Is there a solution outside of what we could build that already exists? So our analysis is that high quality technical content is expensive to produce. We do that every day and we need a lot of people and it needs a lot of time and review and it's just real time consuming, really expensive, especially because these people need to be somewhat senior AI engineers to be able to explain anything in AI engineering. So they need expertise, but to teach someone not to build a successful product, so it's expensive and not that's their creative. Yeah, it would be ideal to automate most of this process. And as I said, the research needs strong recall and precision and on the other side, the writing needs to be more constrained needs to avoid hallucinations, be of high quality, follow your tone or the writing tone that you want, the structure that you want. And in terms of deep research tools, we use all of them pretty much daily and they are way too exhaustive, they gather way too much content and they have a lot of noise. It's not ideal for us in our case. We do use them, but we decided to build our own just to be able to do exactly what we wanted. So the decision is, yes, that this was worthwhile, but especially as a writer, augmentation, not to replace them because as we've seen, even if you do as, if you build the best system possible these days, it's really hard to create a piece of content, whether it is a video or a written lesson that connects with the reader or the viewer. So we need a human touch to be so that it's relatable to someone. You need to make jokes, good jokes, not the same ones all the time. And the ones that are appropriate to the context. So it was, it's really difficult to automate all this and we want to keep the human the loop there. Now, the first actual question we ask ourselves is what should be the architecture, if it should be an agent, multiple agents, a workflow. And the idea is this here is that first we have the research part that is really exploratory. It needs to find lots of sources, it needs to explore the web to understand what it needs to find if it misses information, so it needs to be able to iterate, to pivot, to search again. It needs a lot of flexibility. And on the other side, the writing agent is much more deterministic. It needs to follow a specific tone, to follow a structure, to, to, yeah, it needs to be much more constrained than it needs flexibility. You don't want this a is love, you don't want specific words, specific sentences, you want specific sentences, and specific formats. So it's much more constrained. So we have a conflict here where the research needs flexibility, but the writing needs constraint. And so the decision is to split into two systems, the research agent that is more exploratory, dynamic, more agentic, and the writer agent, the writer system that is more deterministic and consistent. And we have tons of these questions and we show in our course exactly how to build these two systems, but here is just two hours, so we focus on the research agent a bit more, especially for the next few questions. So here, how should it be created? Well, how should the research agent communicate with the writer workflow, the writer's step? Well, we pivoted a few times after testing it, and that's the most important part is to actually use your product, or a potential user should use your product to give you feedback as early on as possible. And on our end, we used it to build a course to use it, so it was really easy to iterate and to teach how we, what we learned and what we did. And what we saw is that our users, which were myself and the team, saw that we always used either both agent, or we didn't use the system. So we weren't just alternating between the research and then the writer. If we were to alternate, it was just with the writer at the end to add new suggestions, new topics, new information to add to the article or to remove to it. But the research was already quite perfect and done beforehand, as you will see with San Midee. And if a big change was needed, we just would rerun the process again, so we didn't need it to have proper orchestration in place. And the two agents work on the same artifacts. The research agent produced a research at MD file with all its research summarized and sent it to the writer. So they need to have ways to communicate together. So the decision that we made is to separate the two agents and run them sequentially with all orchestrations, so just a script, just a very basic script. And our two agents are in the same project so that it's easier to maintain, but also they share dedicated files. So it's easy to have them communicate with each other, but also have them communicate with us through these files. And having them together in the same project is useful for AI-assisted coding with Claude told or whatever system you are using. The next question for the research agent is how should it behave? So we really needed to understand what would be the process that we would want out of this research assistant before building it. And again, we refined this after testing. And in the end, we really needed a good human written guideline, which would basically be just what is the topic about anything I'm interested in covering in this lesson. And useful links that we think are relevant. Sometimes I send the AI news, use that around some specific topics. Sometimes it's my own videos or other topics that we covered. And we saw that. Obviously there's a like creator bias here, but we saw that the more we define the agent's goal in the guideline, so the less freedom it had, the better the results were. Then if we give links in that guideline, we send to the agent, the agent needs to scrape web pages that needs to be able to digest YouTube videos, get up, replace there is even if they are private to the user to myself, to gather all the necessary context. And then after that, it needs to do its actual deep research thing, so search the web for anything that is missing and needs understanding and then revise all this context and write that into a very nice final artifact that will be sent to the writer agent. So how did we do all that? How did we scrape web content that we give the links to and do web searches? Well, we as always simplify the problem and don't we have in the wheel? We use existing things, existing systems and APIs and divide the problem into simpler problems. So first scraping solution, we use a fire call and agent scraping and this is for many reasons that what some really will talk a bit more about the research agents, so won't cover it too much, but it's really useful to have this kind of scraping for more dynamic websites and other types of website that can block scraping. And for web searches we use the to use perplexity now Gemini with grounding for to get precise answers with sources directly which we can then scrape if we need to, but otherwise we can already prompted well enough to get the information we want from this grounding query. Now for YouTube videos because there are a lot of great information on YouTube, we want to ideally pass just the text to the agent because the video is quite heavy and we tried a lot of things, but typically a lot of things, but typically you need to download the video and extract the transcript. Fortunately Gemini now handles this directly with a YouTube URL, you just give the YouTube URL and ask questions about it or even ask to provide the transcript, so it's very useful. We decided to use that obviously. And now for the GitHub content, we just used the GitHub.js library to get a get a specific content into markdown and we can provide it with a token to access our private repositories, so it's really simple and you will see that in the repository. For framework, we use mcp with fast mcp so that we can have tools for the different processes and we can have an mcp prompt to basically give a recipe on what to do and how to use the tools to the agent, which is very useful. We will talk in depth about that in the next hour and a half. Then for the research, the setup, we basically just use Python, UV for managing dependencies and GitHub obviously. And for the models, we recently changed for everything to be Gemini, mostly because of the YouTube thing that I mentioned, but also we saw that first it works really well and we can alternate between Pro and Flash, the cheaper one and more expensive, depending on the complexity of the task. So we use them alternatively in the code and the one thing that we really like is also that it has a free tier for our students, so it's really convenient for us. But obviously we also compared to the other models we were using a person before and other Claude models and even open eyes, but we've seen that Gemini is much more interesting, especially from open eye these days. So we are using Gemini, but you need to compare with different models to be sure that you are using the right one. And finally for the more theory part, how does the research agent communicate with the user? It's inside the MCP prompt, we have specific steps to stop and interact with the users. And all these interactions are handled through files, whether it is the guideline or the research file or other files that Paul will talk about in the writer part. And by the way, everything I mentioned and much more is in some sort of cheat sheet that we also linked in the read me of the repository. So you can definitely access it for free, obviously it's on GitHub. And now is the part for where we talk about the deep research and you can so you can scan the code if you haven't to get the repository. And some indeed will take over. Sorry everyone, just give me one minute. Awesome. So I'll be talking about the deep research agent. And so what is a deep research agent that's, do is already explained that you know it can search any topic by searching the web. It can analyze YouTube videos and finally it will give you a cited report. How we've built it, we've used MCP for this, which is an open standard for giving agents access to tools and data. The key idea like the design choice here is that the agent is a brain that is doing all of the reasoning as like Lewis highlighted that any gaps that are there in the research part is being done by the agent. It thinks if there's anything missing in the research if it needs to run multiple researchers or it has to search something else. That is the part that is being done by the agent, whereas the server is the MCP is handling all of the capability. So it's going to be exposing all of the tools. It's going to be exposing the resources and the prompts that are there. So there's something called Fast MCP, which is a library that really makes creation of agents very easy. So it hides all of the complexities. And you do not have to write any sort of protocols or any sort of communication between agents. All of that is handled by Fast MCP. And we're using Claude Code as our agent harness. But this code is independent of that. If you're more inclined towards using cursor or get a copy that you can swap Claude Code with any of the agent harness that you like. So I'll be repetitively talking about these three things. So an MCP server exposes tools, prompts and resources. So what are tools? Tools are basically all of the actions an agent can take. So you know, like deep research or like, you know, going and doing Google search or like analyzing a video and giving us a transcript or compiling a report. So all of these action oriented things are done by tools. Whereas the second thing that the MCP exposes is prompt, which is basically instruction. You know, when you're talking to chat, JPD, you give it instructions. So this in case is prompts the agent can follow. And in our case, we have a detailed prompt that I'll show you in a minute with the code. And finally, the third thing is resources, which is the data the agent can read. So you know, this is like static data, which is the model names we're using, those sort of versions that we have. We have any sort of feature flags. Those are all part of the resources. So these are something that the agent can read. There is no action being done here. In terms of tools, we have three tools. So first is the deep research tool. The deep research tool basically does, you know, Google search and give us sources for everything that we have. So it provides us with answers for any sort of query we had we have. And also it provides us all of the sources of that. So this is we're using Gemini API for this. The second tool that we have is the analyze YouTube video tool. For this is well, we're using the Gemini API so you can take any YouTube or and you know paste it. And it's going to give you a transcript here. And finally, we have the compile research tool. So what the basically compiles the output of the previous two tools. So we're going to be talking more about the dot memory file and the details of that when I show you the code. So I just talked about, you know, the parts of the tool of the agent that we have. But you know, this is the overall architecture. So we are using Claude Code as MCP client. Just to highlight here, Claude Code for us is both, you know, MCP client and an LLAM. So, you know, it is like the brain which is being used also. It is MCP client which is talking to the MCP server that we have. The first MCV server that we have has tools, resources and prompts. These are the details about each of the tools that we have. So all of these, you know, tools, the first two tools, right to a folder called dot memory. And basically, we wanted something, we wanted to like preserve all of this because it is easier for logging. It is easier for us to verify all of the results that we're getting. So, you know, we are writing down all of our results to a folder. And the third tool reads from this folder and creates a file for us. So research, a mark done file called research.md. The first two tools do API calls to Gemini 3. And finally, everything is like, you know, red and written to this particular folder. So now I'm going to show you the code. Okay, so this is how the, let me just show them. Okay, so you know, this is how we have structured our code. You know, you don't have to get intimidated by this. So we're going to simplify it a lot. So, you know, these are the main two folders, the research folder and the writing folder. Paul is going to be talking about the writing part. I'll just go in and talk about the research part. So, if you remember, when I started my presentation, the first thing I talked about was the MCP server. So, you know, this is the MCP server file that I have. And then I, so you know, this is how you set up the MCP server. So we have, you know, some settings that we are loading up here. So these are basically the name of the models that we have. I can show that to you in a minute. And, you know, this is how you set up the server. So, you can give it a name and the version. And then you register three things. So, you know, this is something that you should remember. We're registering tools, resources and prompts. And then we're just setting up basic logging here. And we're using something called OPIC for observability. And Paul is going to be talking more in detail about that. And then we're finally creating the MCP server, you know. So, let's see how we have registered the tools. So, if we go to this file called tools, the Python tools file, this is how we have registered the tools. So, for registering the tools with, you know, first MCP, sorry, for registering tools, you need three things. The first thing is of course, the name of the tool. So here, the name of the tool is deep research. And then the second thing is arguments. So, we're passing it to arguments. The first one is the working directory. And the second is the question of the query that we're going to be asking yet. Then you need a definition for what your tool is going to be doing. So, you know, you have to be as precise as possible when you're writing this definition. And finally, we also have the definition for the arguments that we have. So, if we have the tool, then we just have the code for the observability. And then finally, we have the implementation. We're returning the implementation for the tool. So, this is it. So, you know, this is how you define a tool. And there's no complications here. But let's go into a little bit more detail about the tool itself. So, how this tool is working? So, for the deep research tool, we are basically in the first step. We're just validating all of our parts. So, you know, we're checking if we are in the current directory or not. And secondly, we are ensuring that we, the memory file that we're writing to. Basically, it's a folder where we're writing all our results to, if it exists or not. And, you know, we do all of this validation here. And then, you know, we run this grounded search on the query that we have. Once we get the result of this query, you know, we return the status. If it's successful or not, what was the query that we sent? What is the answer that we get? So, you know, we get like a detailed response. And, you know, this has been very helpful for our team to like debug. And like, you know, go through all of the data that we get so that, you know, we can refine the research results we have. Going into more detail about how we are, you know, running the research. Right? So, if you see this, the run grounded search function that we have, we have a prompt that we've already written. So, if you go here, you can see the research prompt that we have. So, we're basically saying that if you have any questions, you have to provide a detailed comprehensive answer to the question that is there. Focus on the official authoritative references and, you know, make sure you're including all the relevant details as possible. And make sure to cite your sources clearly. So, you know, this is the prompt that we're using for the deep research tool that we have. And then, we're just calling the Gemini API here. And we're getting the answer text and the sources. Then, in this part of the code, we're just... So, I mean, every time you get an, sources from Gemini API, it's not in a structured form. So, here, we're just trying to structure it in a better way. I'll show that to you when I result. So, you when I run it, you'll see that, you know, we get the output in this structure where we get the URL, we get the title, and we get the snippets. And, you know, finally, we are returning the research results. So, you know, this is the... This is all the details about our first tool. But going on to our second tool, which is the YouTube video analysis. So, this is also got the same structure as the deep research tool that I showed you. So, three things. The first is the name of the tool. The second is the arguments that it is using. So, even in this case, it is using the working directory and the YouTube world. And then, the details about the tool, you know, what it is doing. And the details about the arguments that we have. And then, we are just returning the execution of the tool. So, going into more details about what the analyze YouTube video tool looks like. So, here is the implementation for that. Similar to what we did for the deep research tool, we are like validating all of the parts here. And then, you know, we are taking out the YouTube video ID using this get video ID function. If you do not get the, you know, video ID for the YouTube world that we have provided, we get, you know, we like try to sanitize the result and get the video ID. And then, you know, we are calling... This is the line where we are calling the Gemini API. And, you know, similar to the deep research tool, we are returning all of these different details. So, that, you know, we can keep track of the things. We can see what is the output that we have got. And, let's go into the analyze YouTube video implementation. So, you know, similar to the deep research tool, we have a prompt for the YouTube transcription. And here is the prompt that we have written. So, you know, we are just saying that you have to... This is basically guiding the Gemini API to give us the transcript in a certain format. And, you know, give us all of the instructions and the details that we want to not output. So, for this part, so here we are, you know, dividing our requests into two parts. The requests that we are sending the Gemini API. In the first part, we are sending the file... Sorry. In the first part, we are sending the... All that the YouTube world that we have as a file URI. And in the second part, we are sending the prompt that I showed you. And, you know, this is like a very cool thing. So, Gemini is a multi-model model. And it actually, when you send a YouTube URL as a file URI, it actually sees the video. So, it goes through part by part. It doesn't access any sort of transcript. And that's why when we'll be running the, you know, YouTube video analyzer tool, you'll see that it takes like two to three minutes because, you know, it is actually going through your entire video. So, it is like really multi-model. I mean, in here, we are, you know, like sending the requests to Gemini. And once we get the output, we try to like clean the output that we are getting in a certain format. And then, you know, outputting it and like saving it in the dot memory folder that we get. Finally, coming back to the final tool that we have, which is the Compile Research Tool. So, what it basically does here is, it returns and compiles all of the results that we get from the previous two tools. If you go here and see the implementation for the Compile Research Tool, you'll see that what it is trying to do is it is trying to write to the output research MD file. And then it gives us like a success data sort of, or a return message. So, you know, we have seen how we have registered the tool in the MCP server. Similarly, we've got the, you know, code for registering the resources. So, you know, here is all the things that we are sharing with the agent as resources. So, basically, what is the name of the server that we have, the version that we are using, what sort of Gemini model are we using, what is our YouTube transcription model. And all of these different details are being provided to the agent using resources. And then the MCP prompt. So, you know, this is how we have registered the MCP prompt. So, basically, we are returning something called workflow instructions. And this is the detailed prompt that we have written. So, basically, in this prompt, we are telling the agent that these are the tools that are available to you. And, you know, you can use this workflow to use these tools. And, you know, some additional details about like the working directory and where all of the results should be stored. So, you know, this is our detailed prompt that we are giving to the agent. Awesome. So, going back, you know, this is like the, oh, sorry. One more thing I wanted to highlight here is that this is our end file. So, you know, this is all being accessed as resources, but this is where you would have to add your Google API key. You can add the OPEC API key if you want any sort of observability. But this is the end file that we have. Additionally, apart from that, you know, we have the scripts here, which are like, you know, helping us with the code throughout. And, you know, these are like scripts we have written additionally to manage some of the tutorials. So, sorry. Yeah. Going back. So, now we have the entire architecture there. And so, how do we exactly connected with our agent harness in our case? This is Claude Code. So, this is a very, you know, straightforward way for Claude. You just need to create an mcp.json file in the project route. And, you know, give it a command like this. And then, for our case, we are running the mcp locally. You know, we're not hosting it anywhere. It's being run locally. And it's using standard and standard out. So, for communication, so, you know, this works. But in case, you know, your mcp is hosted elsewhere. You would need an url, like, you know, notion is a very popular example. notion has its own hosted mcp. You can have access to it by getting the url. So, this part of the code would be changed slightly. So, you know, if you were using the UV command, you would be, you know, incidentally, in the UV command, you would have that url here. And you can connect with, you know, any mcp that is available, snowflake notion, anything that you want. One more thing that I would like to highlight here is that, as I said at the beginning of my talk, we're using Claude Code as an agent harness. You can use anything that you want. You know, if you like co-pilot, you can use it for this. You know, any sort of agent harness you want to use, you know, the mcp. So, you know, if you want to use it for this, you can use it for this. This is a quick Claude Code. Okay, so let's see where that file is. So, it's in the, so you have to create that mcp. JSON file in the root directly. And this is, I can explain it to you. It's a very simple thing. So, here we have, you know, the name of the mcp server. We have, so, you know, its vpresearch. And this is the command that we have. Virtual environment that you're creating, any sort of things packages that you're downloading. Downloading. It's all being handled by UV. So, you don't have to do anything. And, you know, then we have this command, which is basically running the server file that I showed you a couple of minutes ago. So, it is just doing that. So, Claude runs the server file as a sub-process locally. So, our server is running locally. And this entire code is, sorry, this entire command is only doing that. It says run fast mcp, run, and the path to the server file that we have. And this is the environment file that I told you that you're supposed to create. Okay. So, you know, now it's time to see how the code works. So, I would go to Claude. Sorry. Okay. So, if you go to Claude and you do mcp, you can see that I have two servers here. So, you know, because of the mcp, JSON file that I have, you know, it was automatically able to detect both the agents. So, we'll just be seeing the D-presearch agent right now. And, you know, this is all the details. You can see the command, you can see the arguments that we have. You can see all of the capability that the server is exposing. So, you know, you can just do view tools. And, you know, you can see all of the details about the tools. So, you know, the tool name, the description that I showed you, all of the arguments, the parameters that we have, you know, similarly for all of the other tools, you can see all of the details here. Now, you know, I'm going to try to, sorry, I'm going to try to, you know, run one of the tools and see how Claude does it. So, I've already written out this question. This was from a previous AI talk about AI agent. So, I'm going to just tell it to analyze this video for me. And, it should automatically be able to detect the tool that we have, that's being exposed by the MCP server. So, you can see here that, you know, it is automatically picked up the tool that we are exposing. And, it is, you know, it is trying to use that. And, on the side, you know, you can see that there's a dot-pollot. And, you can see that there's a dot-memory folder that I told you, you know, where we are tracking all of our results. It is being created and it's going to write the transcript to this file. So, as I told you, Gemini is watching or like, analyzing this video in real time. That's why it's taking, it's going to take like a couple of minutes for it to like, produce the transcript. So, yeah. So, basically, every time you get this question, you know, Claude Code tries to see what it can do, what tools are being available. And, you know, then a sensor request to the MCP server, can you please execute this tool? And, once the MCP, like, MCP executes the tool, it sends the results back to Claude Code and then it further analyzes that result. So, we can create a minute for that. Yeah. So, it is created. This markdown file, so it says that this video is from, you know, the AI engineer code summit. It is provided me all of the details. And, you know, it is like pretty good detailed transcript that we can see here. And, you know, it also gives, you know, gives its input at the end. Plus, I also like really like this thing that, you know, we are storing all of the transcript, but it gives you like a really neat summary at the bottom, where it says, you know, whatever the key ideas that were shared in the video. And, you know, so, and it says that, okay, the transcript is here. So, you know, this is like a short example of how we could, you know, run and see one of the tool if it's, if Claude is able to run that or not. And, now, I will try to, you know, run the entire pipeline again. So, I've already written a question for that. So, I'm telling it to do an end-to-end research. And, so now I'm trying to use the deep research tool that I have and the YouTube video tool that I have. And, trying to see, you know, how it runs both of them together. So, it says that it is trying to, it is trying to do the analyze YouTube video part. So, when you have like a, you know, like a big question like this, the entire workflow, how it works is, plot breaks down the question into like different parts and then sees which tool it needs. So, you know, it has like an entire reasoning going on that, okay, you know, if these are the sort of tools that I have available, I need the YouTube, analyze YouTube video tool for analyzing the YouTube video. I need the deep research tool, the deep research tool for the first part of the question. So, it's going to divide it into two parts and then try to call like both the tools. Okay, let's see. Yeah, sorry, it usually takes some time for it to, sure. I'm actually going to talk about that. Yeah, I'm actually going to talk about that's the next part. I was trying to remove the skill initially when I moved it to the temporary folder and it doesn't pick up the skill because it tends to do that as well. Yeah, not about that, just the prompt part though. No, not, so I mean for us we're using the skill and replacing the prompt part. So, you know, trying to use that entire thing, I'll show it to you. Like just in a minute, I think that might clarify your question. So, I mean here it gives me, you know, it is like, analyze the YouTube video again, it didn't give me the entire answer for the first part. But yeah, I mean I think it just, just give me one second. I have to clear the context of plot because it tends to use whatever I asked it previously. And I would say I like to delete the memory folder. So, till the time, you know, this is, you know, running and generating research for us. Sorry, till the time it's like running and generating research, I could talk about the next slide that we have, which is the agent skill. So, I'm sure all of you must have heard a lot about, you know, agent skill. It is very popular right now. So, this is basically a compact concise way of, you know, telling agent about capabilities and workflows. And, you know, so basically tools do the work. And, you know, the skills provide you the way to do it. So, you know, we talked about the entire prompt that I showed you, you know, the workflow prompt that we have. So, in that, you know, we were telling, sorry, we were telling our agent that, you know, these are the tools that are available. And this is how you should execute them. But instead of, you know, writing it in the prompt, we can instead create a skill out of that. And what is the benefit of doing that? You know, when one thing is already solving the problem, why should we create another one? The reason for that is that skill have something called progressive disclosure. So, when you, you know, when, like I did, you know, when I showed you, I am going to show it to you in a minute, when you load the skill. So, you know, when Claude just like shows you the skill, it doesn't load the entire information. It is only load the name and the description of the skill. Once you run a query, it's going to load the entire thing. And once that query is executed, it is going to wipe off everything. So, you know, it is not going to be there in the context. Whereas, when you are using prompt, you know, everything is going to be there in the context. It's going to clear the context. And so, you know, you would want to avoid that. So, you know, skills are a very clean way of, you know, doing that. And it is also very shareable because, you know, we internally write a lot of skills for our team and like share it with each other. So, it's a very, you know, it's a concise, shareable way. And it's more maintainable because, you know, you can check it into GitHub. You can, it's like your go-to place and you can do the entire thing there. So, okay. So, we have the results for this. But before that, I would just like to show you how we have written the skills. So, I mean, we've written the research skills. So, this is replacing the prompt that we have. So, here we have to write the name of the skill that you want. You need to write the description. So, why this description is necessary because when your agent is trying to, you know, match a query to a skill, it is going to look at this part of, you know, this part of the skill. So, you need to have like a good description here. And, you know, then this is called the front matter which is loaded. And then this is like the rest of the part of the skill. So, in our case, you know, we are reusing the research workflow prompt that we had instead of like writing all of the details in here. You can do that, but it's a design choice for us. And, you know, this is like the rest of the body for the skill. Okay. So, now how could you see all of those skills and what do I mean that it only loads the front matter? So, if you go to Claude again, and if you do, sorry, if you do skills, you'll be able to see. So, these are some of the skills that I've already downloaded and these are the project skills that we have created. So, particularly the research skill that I was talking about. And so, every time you do this and you see this, only the front matter is loaded, not the entire description. And, you can actually, if you want to use this, you can just do something like this. Or Claude can automatically pick up the skill when you ask it a question. Now, I'm going to try to ask it the same question I was trying to ask the MCP with. Okay. So, I'm going to ask it that it researches what an AI agent skills are and then also analyzes the YouTube video for me. Yeah. So, it is loading the research workflow prompt and it is able to do that because our server is running locally. And that's why it's able to read the prompt because they're all in the same folder. And now, it has started to do a deep research you can see here. Sorry. So, it has created, so it is running, you know, the deep research here. And then it has run like, firstly, it ran the analyze YouTube video tool. Then it is running the deep research tool. And I think one amazing thing here that I would like to highlight is that the initial query that I had was, you know, can you give me details about what are agent skills. And every time, you know, our agent harness identifies the gap in the output that it gets. So, in my initial question, it is going to identify all of the gap that exists. And then it's going to ask a different question every time to fill that gap. So, you know, it is doing all of the reasoning behind the scene and making sure, you know, all of those gaps are filled in the, you know, research that we're trying to do. So, yeah, I mean, it's, it usually takes around two to three minutes for me to like run this entire output. So, yeah, let's see. Sure. Yeah. So, I mean, we built it originally with, you know, the MCP portion of it. And now, you know, we are like transitioning, but, you know, we want to like keep our original code. And, I mean, not the server part, but like the prompt essentially that we have, we can replace it with skills. But, I mean, we have a lot of detailing and different components within the servers. That's the reason we wanted to like maintain that. Yeah. Yeah. Yeah. It's more of the complexity in that case. That's why we want to like maintain that. So, here we can see, you know, we've got the output of the video. And now we're just going to wait for it to like produce the output for the MCP file. So, you know, here you can see we've got, this was my initial question that I asked it. And, you know, it has given me the answer for that. And, this is like the first run that we've had. And, now it's like, you know, doing like more research on that. Because, you know, gave us an answer, but it identifies some of the gaps in that. So, this is the second query it ran. And, this is like a different question. You can see here. So, it ran two queries and it is like, okay, you know, this is good enough. And, then it says all three research tasks are complete. Completed. Let me run one more target at query and then complete everything. So, you know, that's why we wanted, you know, all of the reasoning to be done by agents. Because, you know, this is something a human would do. Because if you're reading an article or if you're doing any sort of research, you try to identify the gaps and like try to fill it with that. So, you know, that's what it is trying to do and that's why we needed all of the brain. And, you know, trying to utilize that. So, you know, this is like the third query that is it is running for me. And, then finally, we expected to like create like a compiled report, which is going to be any minute. Sure. I think, I mean, that is like in turn, I mean, we would have to like see the observability for that. And, like, check and identify why it is not like seeing the latest code. But we also have given it a video of 2020. So, I think maybe it's because of that. So, the video that I asked it to like transcribe, it is regarding that. But I think to probably answer that better, I would like go into all of the observability part of the code that we have and like see what is going on there. So, you know, finally, it's asking me if it could create a more down file for me. So, here is the research. I'm the file that our final tool creates. And, you know, it is really detailed. It has got all of the comparison tables. And, it has got all of the details. So, you know, I think this is like a really nice way to compile everything that we have gotten so far. And, it also like compiles and adds the YouTube video link that we have. Sorry, video transcript that we have at the bottom. So, yeah. Thank you so much everyone. Now, I'll hand it over to Paul. Awesome. Thank you, Luis and some ready for those great slides. And, just give me a second to set up everything over here. Okay. So, now we are moving to part three of this whole system, which is the LinkedIn writing workflow, which basically transforms this research from deep research agent to polish posts that pass lobby factors so we can transform all of you into LinkedIn influencers. Okay. So, just like a high level overview, this is our full system architecture, right? So, we did, we already implemented the research agent, which outputs this research and defile, which will be as input to the writing workflow, which on top of that has as input the guideline.und file, which basically is a fancy way to put, it's a fancy way of the user input. And this is how we will model the user input to guide the generation process. And the final output will be the LinkedIn posts, right? Okay. Now, let's actually zoom in into the architecture, which is split into three big pieces. So, the first one is where we build up the context, and ultimately we end up with the big system prompt. Because remember, this is, ultimately this is not an agent, this is just a workflow because in reality, like to write a piece of content, regardless of what is this, of LinkedIn posts, a video or an article, you don't really need agents, right? Because as Luis said in the beginning, this process can be, it's very static, you always kind of go to always the same steps. So an agent will just make everything a lot complicated. Of course, you can put agents in some parts of it, but we won't show that into this workshop. So, the first part is to love the context, which is, as I said, the guideline and defile, which is the user input, the research. And then we have some static files, the profiles, which I will dig into a more detailed bit later. So, we take all of these, build a system prompt, and then pass everything to the LLM. So, this is phase two. So, so far, nothing fancy, we just basically create this huge system prompt, and call the LLM to create the first draft of the post. And then the last phase is to apply the evaluator optimizer loop, which is probably the most interesting part of this, where we let the LLM to create a loop reviews and add it to the post a couple of times. So, this is super important because in reality, you can apply this not only to LinkedIn posts, but to any type of content, like starting from video transcripts, to reports, financial reports, medical reports, or even articles, or book chapters, or very detailed article lessons as we did in our course, where we needed to follow, like, text snippets, images, code snippets, references, and all these little details that the LLM usually sometimes get right, but most often don't get right. And even if they get right 80% of the time, it's annoying to manually go and fix that, right? That's the whole point, so all of this is automated and so on. So we'll dig, again, into this process in more depth later on. So this is just one of our examples. So this is the beginning of a LinkedIn post that we ask to generate. And just for fun, I would just want to copy this post. So it's the exact same post and put it into this, like, pretty popular Slopscore detector. So you can trust me that this actually works and hit analyze. And yay, not Slop. And this is actually lower and less Sloppy than a human, as you can see, is over there on the bottom. Of course, not all the posts are this not Slopish, but most of them are really good and that they sound human. But remember, this is LinkedIn language, so we need to sound a bit like LinkedIn, right? But ultimately, as you can see, it doesn't have any M-dashes, any weird adverbs, verbs, and things like this. It can be read very nicely, the structure is, well, it's Kimmelball, as we would like for LinkedIn and things like that. You can check the full post at this link in the GitHub repository and also try the Slop test yourself if you aren't using that link. Okay, so now let's start to dig deeper into the first phase of this workflow, where our core focus is understanding how we can actually control the generation, right? Because as we said in the beginning, we don't want just to write, hey, I want to link in post on topic X, we actually want to dump our ideas, values, and thoughts into that post that actually sounds like us. So the first trick is actually to structure this guideline, which is the user input, and structure it more than just writing a post on X, right? So this actually guides what we want to write. So usually what we did, we created a template for this where we need to fill in such as what topic we want to address, what angle we want for this, some key points, we want to address the narrative flow and things like this. And this is the only piece that changes from post to post. So basically from the whole system, this is what we need to write out ourselves as a human as input to write a different post. And yeah, as I said, this is dynamic and changes. Next, and again, you can see an example here into this example from the repo along many others, but I will show them a bit later. So the second trick is to add these writing profiles, which basically tell the LLM how to write this post, right? Because in reality, you don't really want the LLM to just do its own thing, you want to guide it. And these are static because as I said with the guideline, you define what you want to write. And this is how you define how you want to write it. It's basically the styling layer on top of it. And we mainly created three profiles, which by themselves are marked on files, which are the structure profile, which basically we define things such as how many characters on average the LinkedIn post has, like the core structure of a LinkedIn post, we want the hook, like the body, the core to action, also the terminology for the terminology. We define things such as the active voice and the AISLOP words and expressions that we want to ban. And unfortunately, this is all you can do. You just need to keep track of a huge list of delvetetistry, vibrant, and all those words that we love, and just kindly ask the LLM to not use them until you start using them as a human as well. And the last one is the character profile, which is kind of static, but you can also configure it. For example, in our writing workflow, we actually configure it under my polystened biography, right? So it knows a bit about me, like how I run to write, how I like to write my style and things like this. And this is how you can add a bit of personality to it. So again, we put all of this into the system prompt plus the guideline that we defined before. And you can see these are static, so as I said, we define them just once, and you can access them under writing profiles and look around. There are, I know, markdown files of a few hundreds of lines of words, right? So the last trick is actually to add few short examples. So this is nothing fancy, but in reality, the hard part is actually to get them, right? Basically, this is under the data collection, part of your system. And in our particular use case, we added kind of tree-linked imposed from my writing. And the key idea is to use high quality and representative few short examples and make them as very desposable in things such as topic, length, structure, and so on and so forth. And one question I think many people are curious about is why three LinkedIn posts? I know three is a magic number, but in reality, I just guessed it. And the thing is that with few short examples, because you always pass them in your system prompt, you want the lower number possible that gets the job done. So usually people tell like three, five, ten, twenty few short examples, but usually want to start with the bigger number and trim down that as much as possible until it works. And when it stops working, you put that back in. Because you want to keep the ideas that you want to keep your system prompt with the few short examples as small as possible because well, you always pass that to a lot of them, which translates to more cost, more latency, and even the greatest performance because the content grows. Okay, so for that, I made a available data set in the GitHub repository where I extracted like twenty random LinkedIn posts from my profile that got more traction. So I used that as a data set for this writing workflow and further down the line to configure other things. And one thing to highlight is probably the only way to reduce this load on the system prompt with few short examples is to fine tuning, but that that's often overkill and adds a lot of friction into your text stack. So the last part of the system is the evaluated optimizer pattern, which probably is the most fancy one from all of this. And basically it contains 12 LM calls and this is really important. It contains two different context windows, right? And the writer and the reviewer, where basically the writer first writes the draft and the reviewer with a complete different context window, text that draft and reviews it. And this is super important to like avoid bias because LM's usually tend to be biased in liking what they already written. And putting that into a completely new context window can, can remove that bias. So basically how this works, the reviewer checks adherence of the draft against to the guideline, which is the user input against the research, basically to remove hallucinations and against the profiles, those structure, terminology and character profiles. So we ensure that it adheres to it. And then the editor, which is often the writer itself applies reviews. And then we run this loop in our example, three or four times to basically write the post, review it, edit it, optimize it and so on. So four similar to like fine tuning optimization loop. And a few tricks here that we absorb over time is that just by keeping in our working directory, all those versions, it helps because writing is subjective and often if you apply this reviewer to aggressively, it might make it you might not just not like it. You know, so you want to look around the less versions and pick the one that you like the most. And again, because writing is subjective. Usually the developer optimizer pattern worked with a score. So basically the reviewer gives a score to the input and you loop until you reach upon a specific threshold. But because creative work is subjective, that threshold is very hard to quantify. And this loop becomes very noisy and not reliable. And we realize that it's just easier to just put a fixed number of iterations and let the user run these iterations again manually if it just wants to. And now let's dig a bit deeper into how the reviewer works. So basically as input, it has the current state of the post. Next, as context, it gets the guideline, the research and the profiles, which basically it's what the writer needs to look around and they are basically the rules, the writer needs to follow. And the outputs is actually a set of pedantic objects. Right. And I think this is the most interesting part here is that you actually constrained the alarm to output a list of structured objects. And this is powerful because if you give a pedantic object to an LLAM. By then, objects have this property where you can put a field under each attribute and actually explain what that field means. So basically this is a prompt engineering technique, which guides the LLAM a lot better into actually understanding what each attribute of the return object actually needs to contain. So for example, we, a review model has this profile location and comment attributes. And an example is an example output is, for example, it. We violated the terminology profile on paragraph two where the LLAM use it used the leverage band term. And like this, we, the editor, when it gets these reviews, will understand exactly what what it violated. And we, from our test, we realized that it's a lot easier to use this structured actually is a lot more performance to use the structure output than just letting the LLAM. I don't know, spit out whatever it wants. Again, you, you can check this code within this Python file. The code itself is pretty simple. So unfortunately, I won't have time to dig too deep into it. And now on the editor part, basically, it gets this list of pedantic objects. And the current state of the post. And also it gets all the context that the writer initially has because the editor is actually the writer itself, right? This is similar to how a team of writer and editors work. I write something, I give it to a team of editors, I get the reviews and I apply them. And another important thing is that reviews are not created equals, right? Especially people like to give reviews on everything usually. And this also is true for the LLAMs. And because we loop for a couple of times, we realize that it's super important to put a priority on them. So usually we always want to prioritize the guideline first, which is the user input, then the research and the profile. And this is super powerful when those those reviews clash on the same paragraphs or on the same sentences and things like this, they let me know what to pick up and what to apply. Okay, again, you can check the system, the prompt within this Python file within the repo. And here is a concrete example where we can see the post V0, which is basically what the LLAM speeds out before the developer optimizer loop. And this is how it looks like after for reviewing iterations. So we can see that the text is nicely formatted for LinkedIn, also like the first sentence, which is the most important is a lot is puncture and things like this. This is valid also for like the first part of the post and also for the second part of the post. Again, it looked or it modified the structure, the wording, everything to look more on to look more about something that you would expect on on LinkedIn. And again, you can check the whole post and all the version from zero to four within the examples directory. And ultimately, let's let me actually show you how this directory looks like. So to test this code, you actually have three levels. So I created a simple make command just to check that everything works. So no, nothing crazy here. If this works and 20 it means your code works locally. You can pass that to Claude Code to make your code work. So we serve similar to how the research agent works, everything to an MCP server prompt. And then we coupled, we connected this MCP with right post skill. And the thing is that this takes like three, four minutes to run. So I already, I already ran it to avoid wasting your time. But here here is how you can run it. So basically you take this right post skill and just as the LLAM, we take, for example, the same example that I did before. So the same example that that we took, we copy the copy the relative path. And we ask it to use the guideline from this dear. And then basically it will, it will know to pick up the guideline and the research associated with it and write a post. And the output is here. So we see we have the four versions of the post, actually five versions of the post and also the guideline, which I think is the most interesting one. So basically, as you can see, instead of just like kindly asking LLAM to write about something, you need to be very explicit about what you want from it. You need to put an angle, the target audience that you want, key points that you actually want to cover the tone and things like this. And actually a constraint of characters that overrides like the other profiles that are static. So we called it encoding in the system from that within this guideline, you actually can override like the default profiles if you want something different. Here you can go wild and input anything you want, but I just wanted to highlight like you actually need to put in some effort to think it through. You cannot just fully automate everything if you don't want to sound like true slots. And on top of this, we actually have like a small image generator. We haven't put a lot of effort into this, but did this prompt actually generates the post using this writing workflow and then ask another tool to automatically generate an image for for that post. So now to move on. So just to wrap up on what we build on the writing workflow, we build this writing workflow of pipeline that generates reviews and edits the post, which again, you can apply to any type of content with tested it with almost every type of content and it works really well. You just need to adapt basically the profiles, the examples and things like this. And then we learn like how to control the generation, the guideline, the profiles and the future examples. Again, for different content type, you will need to adapt this. And then how to serve everything as an MCP server and skills. And yes, for this local example, you could do everything through skills. You don't really need MCP servers to make it work. But from my experience, MCP servers help you distribute logic. Because skills are like your local set up, your local hack set up that works great for you. But if you want to distribute business logic, you don't really want to ask someone, hey, download those skills, install these UV dependencies over there, plug in those CLI tools. Oh, no, you also need those credentials and quickly becomes a mess to distribute this at scale. An MCP server solves that problem and skills can help you personalize how you use that. Basically, if you want a skill to be more than a prompt and actually run that code, it works, but it can become a quick mess to set it up if it's too complicated. So now let's move on to part four of the system, which is observability, more exactly monitoring and evals. So we will use the writing workflow as an example for evals and we did monitoring for both. So for monitoring, we will go really quick over this because, like, theory wise, there's not much to say. So the idea is that the core problem behind monitoring, which in my opinion, like, exploded when we started to use agents and workflow, is that debugging workflows and agents purely through the logs is hard. Like, I personally can't really understand what's going on inside the terminal, especially when you see those thinking of the agent, which hides a lot of stuff. So it quickly becomes painful to debug what's going on just using the logs. So basically, you need some tool to monitor this nicely that captures all your traces, such as all the LLM and tool calls, your input output, your metadata, captures everything about your run. And also latency and cost track, right? So and as a bonus, it also stores your traces to build an AIE Val layer on top of everything, which I will dig, which I'll dig more into it a bit later. So now let's actually look into our monitoring logs. So we used OPPIC to monitor both our agents, the research and the writing workflow. And yeah, just let's look into it because it's a lot easier to understand like this. So for monitoring, usually you have like three big concepts. You have those threads, which is basically in our use case is like the whole workflow of writing an end to end post, like starting with a post itself plus the image. For example, as you can see, this workflow has 44 messages, which basically captures all the bouncing around between the user and the LLM. You can more intuitively see it as a conversation thread, which that's why it's called a thread, you know. And then you have the traces and here you can actually dig deeper into what's going on. So a thread contains multiple traces, right? And within a thread, you can actually dig into what's going on. So for example, for a generate post trace, here we have the high level overview generate post called where we can see how long it took to run, how many tokens it consumed, how much it cost to run. And here we can see all the two calls and LLM calls that happen under the hood, right? So we can see the models use the cost per model, the latency per model and all of that. And on the right, we can also see like the high level input and output of the system, some metadata, the token use it, and basically everything that happened within that run, which is, we've helped you a lot to quickly understand what's happening, right? And we have something similar for the research agent as well. Here the thread is a lot easier to understand because, for example, we have this latest thread, which captures all those two calls, basically all the deep research, two calls, the two calls that creates the final report and all the bouncing around, and then the threads, the traces, which have just single two calls, did they capture just a single two call, where as you can see, we have just one LLM call per deep research, two calls. Okay, so now let's move on to AIEVLs. So I want to start with why EVLs matter because it's not necessarily a cool topic, but it can do a lot on your system. So basically, let's take our writing workflow example. Let's say that we want to check how well it works, right? We all start with Vibes, with Vibetaking. And when we generate just one post, we read it, we say that hey, it's cool, awesome, it works. Well, then we have 10 posts. We read them, maybe, maybe not, maybe we read two, maybe we read three and call it a day and say it works for the others as well. So when we want to check how well it works in 100 posts, it quickly becomes impossible. And just imagine that you do this at the beginning, but as you evolve your system, you start to plug in more features, right? And every feature can just break everything. And yeah, this is standard practice in software engineering, but here you work with prompts. And just one word somewhere randomly can just break all your features, if you're not careful, you know. And you need this layer on top of it. So basically, EVLs fit in three big layers. So we have optimization, which is very similar like to training a model. So basically, we have this AIEVLs layer, which lets us quantify how well our writing workflow writes LinkedIn posts, right? Then when we want to improve it, we have a score, which tells us that, hey, do we move in the right directions or we don't move? So on every change, we run this AIEVLs layer, and we know if we did better or worse. The next layer is regression testing, which is basically similar to testing classic software engineering, where whenever we start working on a new feature, we're just insured that we don't break current functionality. And the layer tree will be to actually run this in production on live traces to actually get warnings and errors and alarms and all of that, when users actually use your system. And in reality, how this is different from like normal unit integration and regression test is that everything starts from the Intel data set, right? So this is a data problem, we actually build a model here. So here is how we build it for our LinkedIn posts, where we were lucky enough to actually already have the data. So I extracted 20 real posts from my LinkedIn. Why 20? I said that is a big enough number for a workshop, but in reality, we'll probably need to go at least to 100. Then I reversed engineered the guideline and the research. Basically, I took the guideline from the post, and then I ran the deep research agent on top of the guideline to find whatever I need to support that post. And then I generated the output. So basically, I put the guideline and the research as inputs and generated new posts. And it's super important because you actually want to see results from your real system. And when you want to generate synthetic data of some sort, never as the LLM to directly generate the output, always ask it to help you generate the input, but never the output. You want the output from your real system because that's what you're testing ultimately. And then I looked around and labeled each output with binary pass and fail labels and two, three sentences critique on why exactly I gave it a pass or fail label to it. And usually I stop when I find the first error is just easier for me while labeling and also for the LLM to understand that whenever I see the first fail, I stop at it, I write, write, failed into three sentences and move on. And ultimately, I split all of this into a trained depth test, because ultimately, remember that here we're building a machine learning a model. So we need to treat it as such. So we need a classic split. And only when we have this data set and these plates, we can actually build the evaluator and evaluate the evaluator, which most often for some reason when they build element judges, they think they can just skip this all last step, which is probably the most important one. And again, what's the most important and interesting here is the data set that we created because actually creating an element judge out of this if you have the data set is just very easy. Okay, so now let's see how this element judge would work for our current scenario. So as input, it will take the generated post that it needs like to label, but also the profiles research and guideline. And this is super important because the element judge actually needs to understand the context used to generate it disposed. And ultimately it outputs the pass fail label plus the critique basically exactly the same labels as our data set. And then we pass as few short examples, our trained split from the data set. And this is the most important one because like the system prompt of element judge can be extremely simple if it has the right few short examples in place that tells the element judge what decisions to take. Right, so the the few short examples look like this, it they have the right or input, which is the guidance and research, right, so the input while writing work for system. Then the writer output, which is basically the generated post of the workflow and the labels, right, so in the future examples, we also have the labels the pass fail label and the two three sentences critique about the label. And this is it like you can check out our system prompt within this file, but there's nothing fancy. The most important part from this is building the data set. And our last step is to actually measure the judge's reliability because ultimately what we did here, the element judge is just a binary classifier, right, we we we output pass and fail label. So yeah, we use the element judge because the reality is that it's easy, but we could have just use any any any other model that gets the job done and can output to this pass and fail labels. And ultimately, when we measure judge reliability, our final goal is to align the element judge with domain expert. And how we will do that, we will do that by testing it against the depth test data sets, please. So this is a process very similar to training any other binary classifier. There's nothing new here just the word and judge is fancier. So what we do basically is test them judge against the depth test data sets, please. And then use f1 score, the f1 score, which is a combination of precision and recall to actually measure how well that may judge perform against this please. And the process looks looks like this. So we first run them judge on the death split computer that one score and adjust them judge and prompt examples to maximize this f1 score because most probably when we first start this process that one score will be low and we basically want to get a score as high as possible. And we repeat until we convert usually you need to do this a couple of times. And when you think you're done and the element judge is ready to go, you run it on the test split as the final validation step, right. So basically training machine learning model. And now it's ready to run on new data. So after we go to this step, we have the element judge, which is ready to run on new samples of data, which will just output the binary pass or fail and the critique. Now like to wire everything together that we had in this in this section is that we first build the data set. Then based on that, we build the m judge, then we calibrated the judge and only after that, we can run it on real data and usually all these steps are managed by some observability platform. We used here up, but you can use any other platform or just build something in house. It doesn't really matter, but the idea is everything should be managed by cohesive platform. Okay, so now as a quick demo, let's emulate as much as possible this calibration step because that will judge already calibrated. So I will just show you how the judge performs on the evils split. So under the hood with a pick, we build like a very simple evaluation harness where it allows us to run this calibration steps and then run it on production traces. So in reality, the code is for our use cases is pretty straightforward. So just to show you around a bit until it runs. Okay, so actually what what it does now, it takes all the generated post, right? It passes the element judge on top all those generated posts. And then what I wanted to show you is this F1 score, where basically we compute an F1 score based on the outputs from the LM judge against our labels from the data set, which I think is more interesting to see how our data set looks like really quickly to get a sense of it. So basically here we have like a single sample within our data set where what's more interesting is like we have a link to the media of the post to the guideline to the generated post basically links to everything that we need as input and output both for the deep research agent and the writing workflow. And then just putting this scope, we plug them in as a few shot examples either for the post generator or for the evaluator or for example for the depth split and things like this. So this is how we are careful that we do the splits the right way between the judge and also the generator. And for simplicity, we have all those files linked together in this YAML file here locally within GitHub. So you can check everything and plate plate and run everything very easily with with minimal configuration. Okay, so if we go back into the terminal, we saw that it had like a perfect score. So now let's actually run this on the on the test plate. And the goal here is to see that our M and Jud did not overfitted right so we ran it on the death plate. We get a perfect score. Usually this is a high signal that it overfitted. So unless the F1 score on the test plate is not one or around one. It means that our M and Jud is overfitted. So again, similar to the process before it is basically the exact same steps, but now we ran on the test split. So let's see the final F1 score, which is again one, which is good. And the scores are so good because we do these just on five samples per split. But if you like expand your split, which you should have at least like 20, 30 samples, the score probably will not be perfect. Okay, and let me actually show how this experiments look inside inside. Yeah, sure. Yeah, I think that the most important is like the dynamics between the F1 score on your. Desplit and the dynamics on your test plate because the over the when the model is overfitted is actually. Like how is compared between the two, for example, if it as I said if it has like a very big score on the depth split and a low score on the test plate, it means it overfitted. But the score itself usually is very correlated with your own data. So for example, you are pleased with the exact. Yeah, exactly. Yeah, exactly. The absolute value usually is very correlated with your data. Usually on the test plate, you say, okay, this F1 score is good enough for me. And then you run on the test plate to see that you have the same F1 score on the test plate as well. Okay, so. Again, usually we with this observability platforms, you have this experiments tab where you can actually run like. Experiments similar to fine tuning experiment trackers from from from from the old school. And for example, here we have the dev and test experiments. So let's open the test experiment that we just ran. And there we can actually see the output from the binary element judge and our label. And we can see that there's a perfect match between the two. And if we hover over this, we can also see like the critique of the judge. But in reality here, we compute the score only between the labels. And this critique is more useful for us because ultimately our goal is us as humans to look over these results and understand what's going wrong with with the system. And then as a final step, we also prepared like a online simulation where I simulated some online traces where the element judge actually runs on top of them. And you get the binary labels and the critique. But the thing is that this takes like five minutes or even more to run because it actually needs to jam for real. It needs to generate all the posts and run the m and judge on top of it. So I'll just show you a result over here, which I bundled it together as an experiment. So is this online test. And then the results from the binary from the judge where we find the score and the critique itself. And of course, this system is not yet perfect. And we probably need to refine it more because here on online traces, as you can see, it kind of failed to be honest. It just passed an F1 score. It just give us a pass on all the traces while in reality, I also have the labels for these simulated scenarios. And that failed probably I need to go back to the dev split expand the dev split with new samples. For example, I could just take these samples and put them in the dev split and start refining it again and again and again and again until it actually works. And I'm kind of running out of time, but you also have this skill to actually run a demo and when right research and writing you can, for example, write a guideline about about something that you want to write a post about and give that as input to this skill, which is just a file. And you know to pick up to do the research and then continue to writing the post. And you can also observe that in all pick with all the the threats traces and all of that. And yeah, I guess that the final step, if you haven't done so yet is to actually top up the GitHub repository, run everything yourself and read the code. And without reading the code, most probably we want to really understand what's going inside. So you can do that by accessing this link or scanning the QR code. And whenever you're ready and you want to go deeper into building this type of multi-agent systems, we have this agentic engineering course which basically was inspiration for this workshop. But instead our goal was actually to design and build production ready agents and on some small local systems within 34 lessons, three and two and port for the projects certificate and a discord community with access to us. So far it's rated five by five by 300 plus students. And if you don't believe us, the first six lessons are free to try out. And you can access it using this link or scanning the QR code. And that's it for the workshop today.