OpenRAG: An open-source stack for RAG — Phil Nash

While large context windows exist, Retrieval Augmented Generation (RAG) remains crucial for handling vast enterprise data and controlling costs, as dumping all information into an LLM is often impractical and expensive.
RAG is more complex than often assumed, with challenges in document processing, chunking strategies, evolving embeddings, and search techniques, requiring sophisticated pipelines to optimize results.
OpenRAG offers an open-source, agentic RAG stack combining Duckling for robust document parsing, OpenSearch for powerful hybrid search, and Langflow for visual orchestration and customizable agent flows.

RAG Complexity: Despite claims that RAG is "solved," it's a hard problem due to diverse document types (especially PDFs), difficulties in optimizing chunking, rapidly evolving embedding models, and advanced search techniques like re-ranking and query re-writing.
OpenRAG Baseline: IBM's OpenRAG provides a high-quality, opinionated yet extensible open-source RAG stack designed to handle modern RAG requirements as a strong starting point.
Duckling for Document Processing: OpenRAG leverages Duckling, an open-source project, to parse various document types including challenging PDFs, HTML, and audio/video, offering multiple pipelines (standard, ASR, VLM with OCR support) to extract text and structure into hierarchical "dock tags."
OpenSearch for Hybrid Search: OpenRAG uses OpenSearch for powerful vector and keyword search capabilities, including sophisticated filtering, support for multiple embedding models during migration, and leveraging the JVector K&N plugin for live indexing and memory-efficient scaling.
Agentic Retrieval: Instead of a traditional RAG approach where an LLM passively consumes top-K chunks, OpenRAG employs agentic retrieval via Langflow, empowering an agent to decide which searches to perform and how to use the results based on instructions and tools.
Langflow for Orchestration: Langflow serves as a visual drag-and-drop editor for AI flows, integrating Duckling, OpenSearch, various embedding models, and LLMs (e.g., OpenAI, Anthropic, Ollama) to build and customize the ingestion, indexing, and generation pipelines.
Customizability and Integrations: OpenRAG allows extensive customization of models (LLMs, embeddings), chunking strategies, and includes features like Claude connectors (Google Drive, SharePoint) for document syncing, knowledge filters for targeted searches, and an API for external application integration.
Offline and Air-Gapped Capabilities: OpenRAG, including Duckling, can be run entirely offline with locally hosted models like those from Ollama, making it suitable for air-gapped environments.

RAG — Retrieval Augmented Generation; an AI technique that improves the accuracy and relevance of generated responses by retrieving relevant information from a knowledge base before generating an answer. Context Window — The maximum amount of text (tokens) that an LLM can process or "see" at once, influencing how much external information can be directly provided. Embeddings — Numerical representations of text (or other data) in a high-dimensional vector space, capturing semantic meaning to enable similarity searches. Vector Database — A specialized database designed to efficiently store, manage, and query vector embeddings, typically for similarity searches. Chunking — The process of breaking down large documents into smaller, manageable segments (chunks) to optimize them for embedding and retrieval in RAG systems. Re-ranking — A post-retrieval step in RAG that uses a more sophisticated model (like a cross-encoder) to re-evaluate the relevance of initially retrieved documents, improving the quality of context provided to the LLM. Query Re-writing — A technique where the initial user query is modified or expanded by an LLM or another component to improve the effectiveness of the subsequent search. OCR — Optical Character Recognition; technology that converts different types of documents, such as scanned paper documents, PDFs or images captured by a digital camera, into editable and searchable data. VLM — Vision Language Model; an AI model capable of processing and understanding both visual information (images, videos) and textual information. Agentic Retrieval — An advanced RAG approach where an AI agent dynamically decides what information to search for and how to use the results, rather than simply taking a fixed set of retrieved documents. Langflow — A visual programming tool that allows users to build and orchestrate AI flows and agents using a drag-and-drop interface. Ollama — A platform that enables users to run open-source large language models (LLMs) and embeddings locally on their own machines.

Hi there, my name's Farnash and I'm a developer relations engineer at IBM. I've been working on tools around AI and RAG for the last couple of years and I've got something I'd like to show to you today. Now, first things first, I've heard that RAG is dead many a time and I'm sure you have to. Contacts Windows are huge these days, you might as well just dump all of your information into there. I don't take this kind of thing very seriously. If every business has less than a million tokens worth of data, then sure, RAG is dead and probably saw all those businesses. Of course, not everyone is happy paying for a million input tokens every time you want to ask a question as well. Instead of hearing that the RAG is dead claims, RAG is solved. We think we understand the process and we can just apply RAG when we need to. Just gather up all your unstructured data, extract the text, chunk it up, embed it, throw it into a vector database. Then when you want to ask your agent a question, you just embed that question. Search the database, pick the top-care results and pass them to a model as context. It's just a footnote in context engineering these days. But it turns out that RAG is actually hard and it's hard for different reasons for different projects. PDFs are a pain. Chunking strategies are a hassle in changing them and testing them is difficult. Embeddings keep improving which is great for the industry but not very great when you've used something from a six months or a year ago. There are new search techniques all the time. And further tweaks that you can add to your pipeline to improve the results like adding summaries to chunks, performing chunk expansion, using a crossing code to re-rank results. Query re-writing, there's so much more. RAG is quite complex. In fact, everyone's documents are different. Every system will have different users, different questions, different interaction patterns, and different expectations. While every RAG system will ultimately be different, there are definitely some core components that are required. When building a RAG system it's useful to have a high quality baseline to build from. So that's what we've been working on at IBM. We brought together three existing open source projects to create a RAG stack that is powerful, easy to use, and easy to extend. And the project's called OpenRAG. And it uses the open source duckling for document processing, open search for search indexing, and langflow for visual orchestration and agents. OpenRAG is an open source project that you can try out today to build your own powerful, customizable, and easy to use RAG system. But I just want to break down the stack for you so that you understand the components and how they work together, and how they create a stack that is flexible enough for your modern RAG requirements. Let's start by looking at the ingestion side of RAG. Let's start where it all begins, document processing, ingesting PDFs, HTML, WordDocs, Slides, and more can be a pain. But the biggest pain of all is of course PDFs. Dockling is an open source project that was built out of IBM Research in Zurich, and it processes and passes all sorts of documents from HTML, Markdown, and WordDocuments through to Slides and Spreadsheets, audio and video, and even that enemy of all RAG systems PDFs. Dockling has a number of different pipelines that handle different file types. This allows it to be flexible in the way it takes in documents and accurate in its output. So there is a simple pipeline that handles those mostly straightforward text documents like Markdown and HTML and Word, though that just extracts the text, tends into hierarchy and outputs document. For audio and video there is an ASR, an automatic speech recognition pipeline, and for PDFs there's two available pipelines. The standard pipeline has a number of small focus models that do different things like extracting text, tables, and images from PDFs. You can even choose an OCR back end to read text, which is particularly useful for scanned documents that don't have actual real text in them. So this collection of small models in a pipeline perform things like layout analysis, table extraction, image extraction, descriptions. This gives you a wide array of options to get the best out of those documents. There's also a VLM, a vision language model pipeline that uses the granite dockling 258 million vision model to extract all of that in one go. This is a newer pipeline, but it is simpler as it is, it's just all in one model that's trained specifically for this task. Dockling extracts text and then produces an intermediate representation, a dockling document, which models the structure of a document in an XML-ish format called dock tags. Those dock tags can then be converted to a number of formats including Markdown, HTML, and JSON. And the dockling also has a chunker that uses the hierarchy generated by the parsers and build into those dock tags to produce hierarchically understood chunks of text. Moving on to embeddings. OpenRag actually isn't very prescriptive within embeddings at all. It supports a number of external providers, including OpenAI, what's next AI and Olamma for locally hosted embeddings. And in fact, the entire of OpenRag can be run offline using locally hosted models. Dockling itself can be run offline, so it can run in air gap situations. It doesn't need those external services. But once you have embedded those chunks, they are indexed in OpenSearch. OpenSearch is of course the open source fork of elastic search and is as a powerful database for performing vector search and keyword search, as well as highly configurable. It also has highly configurable filtering and aggregation. Out of the box, OpenRag uses OpenSearch for a hybrid vector and keyword search and exposes that sophisticated filtering for more targeted searching. It also supports vector search over multiple embedding models. Now this will slow down your vector search in practice, but it is useful if you decided to migrate your embedding models as part of your system. OpenRag also sets up OpenSearch for the secret fourth OpenSource project. The default OpenSearch nearest neighbor's plugin gives you options for HNSW or IVF vector indexes. OpenRag uses the JVector K&N plugin by default. JVector is an open source vector index that gives you live indexing and because it's based on the disk K&N architecture, means your whole index doesn't have to fit in memory, giving you more options for scaling the data. All of this is then tied together with Langflow. Langflow is a drag and drop visual editor for AI flows and it integrates duckling OpenSearch and all these embedding models as well as further data enrichment as part of that ingestion process and pipeline. We'll come back and have a look more deeply into Langflow later. So that's ingestion and indexing. What about the generation side of Rag? On the generation side, we don't normally have to worry about ingesting documents and we already know that OpenSearch is handling that multi-vector hybrid search for us. But we do need to point out that OpenRag uses a genetic retrieval in order to perform the search. This is also done in Langflow and again gives you access to all the kind of models that Langflow makes available to you. So out of the box, that's very much OpenAI and Thropic, Olam or what's an XAI? What does a genetic search mean? Well, traditional Rag generation pipeline would take a user query, embed it, use the perform that nearest neighbor search over the chunks and present the top K chunks to the LLM hoping that the answer is contained within and that the model is smart enough to extract it. With a genetic retrieval, we instead give the user query to an agent along with instructions and tools that it can use to perform as many searches as required. The model is actually responsible for deciding what searches perform and what to do with the results. So let's actually take a look at this in action. I have OpenRag running on my laptop and we're going to have a quick look at what it can do. So once you've gone through the onboarding process with OpenRag and setting it up, you get dropped into a chat and the first thing you get to ask is what is OpenRag? And as you can see here, it has got an answer, but you can also see that it's done some tool calling already. It turns out that what is OpenRag, the answer to what is OpenRag, is inside the agent's prompt by default so it doesn't actually need to do any search querying. But it did go and get the current date just in case as well, which is nice of it. So we can see we get an answer out of it, but we also get these kind of suggestions about the next things. Those are little nudges. That is also powered by Langflow. And if we were to ask about that to explore Langflow's role in AI agent construction, then the agent itself will go off search that documentation and come up itself with an answer. And so as you see the model, the agent has gone and used some tools again. It has come up with an answer. It's come up with more nudges as well. So let's go look at the knowledge section. This is where you actually upload your data, your documents. And you can do so just by adding a whole file or a whole folder. There's also a sync button here. We'll see that in a minute. And you can also inspect kind of your objects and your documents here and your chunks. So you can see that they are chunking things as you'd expect. This also is where you can create knowledge filters. So this takes advantage of that filtering and open search. You can create filters based on a whole bunch of different options around the data that you have in your system. And then that allows you in chat to use those filters to only talk to specific documents. So that's an knowledge section. And then in the settings, we can dig into the actual customizability of this. So right at the top, there are Claude connectors, but in order to use Claude connectors, you need a user model, some authentication. Right now we set that up with Google OAuth. So you need to need no author client and secret there. Once you have that involved, you can connect to Google Drive, you can connect to SharePoint, you can connect to OneDrive. And this allows your users to connect to directories of their own documents and allow OpenRag to sync them directly. I think that's really powerful. It saves you having to upload things a lot of the time. You can just sync with this external document store and it will always be up to date. We can see our model providers. You can configure kind of API-based ones or Olama. Like I said, that's for running things locally. And I'm running Olama and you can see I'm running currently Granit4 3B, that's one of IBM's models. So this is our language model and you can see the actual agent instructions as well. So you can set your system prompt there. And then in the ingest section, you can see again, I'm running Quinn3 and Bedding, 0.6B for my bedding model. Also on Olama. And you can set your chunk size and chunk overlap. And then these last bits are dockling settings where we say, do we want to capture table structure? Yes, currently. Do we want to run OCR? Not right now, let's turn off. And do we want to extract picture descriptions? Currently that's off, but that's a useful one if you want to kind of get the information out of images as well. Because adding more models to the pipeline makes things a little slower. So they're off for now. And then right at the bottom there are API keys. And this is where you can set up access to OpenRag as an API. So you can implement, you can use your search or your agent within your own application. But let's actually drop into under the hood even further where we can go and customize things even more. You can hit this edit and langflow button. And that will take you into the actual implementation of your agent. And so let's actually zoom in. We can see here is our agent. So this is the chat, the generation flow. And our agent receives this information from this chat input, which goes through a quick prompt template, adding in things about knowledge filters, if you've used them. And then the agent has a bunch of tools. Those tools include. This is an MCP server for a URL ingesta. That's actually just another flow within langflow. There's a calculator because I think that agents and models shouldn't be doing arithmetic. Their language models, not maths models. So a calculator is always useful. And then finally, the last one is the open search multi model embedding thing. And so the embedding providers are all here. We can edit this if you go into here and just unlock the flow and save that we can we can do more with it. And so for example, we can take this chat input. And we might want to might want to put some guardrails in place. So we can grab guardrails from our set of components on the left. And then we do need to parse the results of that. So we just got a parser. And so if it passes passes, we pass it through the parser, get the text out, which is the original text that was sent in. And then hand that on to our prompt template. I guess if it fails, we can send an error message to a chat output. And that's fine. And so now we've added guardrails to our thing. We can use our alarm and models there as well. And so this is extensible as langflow as langflow can be for you. There's one more thing. There is an MCP server available for open rag as well as an API. So you can go and use this and hand this to your other agents as well. So is ragsolved? Well, that's kind of start to you to your data and to your users. But open rag is built to help. It is an opinion on native, but agentic and open source stag for rags. As I said before, it combines duckling, open search and langflow to create this powerful baseline rag system made of open source components. And it leaves plenty of room to customize that within that stack so that you can build out the best rag for your data and provide the best context to your agents. It is currently at version 0.4.0 and it's ready for you to play with. So this link or all the QR code will take you to the project and we'd love if you try out open rag, drop a star on the GitHub and let us know what you think. It's also open source, right. The front end is a next-ger application. Everything else is a Python app. And so if you look like the look of open rag, we'd also appreciate your feedback and your contributions to the project itself. The components to open rag of course are open as well. So you can get involved with duckling with open search or with langflow as well. And together, you know, we can build a rag platform that works for everyone. It gives you the choices where you need them and makes good decisions for you where it makes sense. We can do it with open source components out in the open. That's what I'd like to see. So thank you very much for listening. Again, my name is Faunash. I'm a developer relations engineer at IBM trying to help build open rag and this open ecosystem of agentic application. And we can't wait to see what you build with open rag. Thank you very much.

OpenRAG: An open-source stack for RAG — Phil Nash

TL;DR

Takeaways

Vocabulary

Transcript