📖 Lesson content
Summary
Now that we've covered the basics of RAG, text chunking, and embeddings, let's walk through the complete RAG pipeline step by step. This detailed example will show you exactly how all the pieces fit together in a real implementation.
Step 1: Chunk Your Source Text
First, we take our source document and break it into manageable chunks. For this example, we'll use two simple text sections:
- Section 1: Medical Research - "This year saw significant strides in our understanding of XDR-47, a 'bug' we have not seen before."
- Section 2: Software Engineering - "This division dedicated significant effort to studying various infection vectors in our distributed systems"
Step 2: Generate Embeddings
Next, we convert each text chunk into numerical embeddings. To make this concept clear, let's imagine we have a perfect embedding model that always returns exactly two numbers, and we know what each number represents:

In our imaginary model:
- First number: How much the text talks about the medical field
- Second number: How much the text talks about software engineering
So our medical research section gets an embedding of [0.97, 0.34] - very medical, somewhat software-related due to the word "bug". The software engineering section gets [0.30, 0.97] - very software-focused, but "infection vectors" has medical connotations.
Normalization
Before storing these embeddings, they go through a normalization process that scales each vector to have a magnitude of 1.0. This is typically handled automatically by your embedding API, but it's important to understand that it happens.

After normalization, our embeddings become [0.944, 0.331] and [0.295, 0.955]. We can visualize these on a unit circle where each point lies exactly on the circle's edge.

Step 3: Store in Vector Database
The normalized embeddings get stored in a vector database - a specialized database optimized for storing, comparing, and searching through long lists of numbers like our embeddings.

At this point, we pause. All the work so far has been preprocessing that happens ahead of time. Now we wait for a user to submit a query.
Step 4: Process User Query
When a user asks a question like "I'm curious about the company. In particular, what did the software engineering dept do this year?", we run their query through the same embedding model.

This query gets embedded as [0.1, 0.89] - low medical score, high software engineering score. After normalization, it becomes [0.112, 0.993].
Step 5: Find Similar Embeddings
Now we ask the vector database: "Find the stored embedding that's closest to this user query embedding." The database returns the software engineering section because it's the most similar.

How Similarity Works: Cosine Similarity
The vector database uses cosine similarity to determine which embeddings are most similar. This measures the cosine of the angle between two vectors.

Key points about cosine similarity:
- Results range from -1 to 1
- Values close to 1 mean very similar
- Values close to 0 mean perpendicular (unrelated)
- Values close to -1 mean completely opposite
The calculation uses the dot product formula: cos(a) = (A · B) / (||A|| · ||B||)

In our example, the user query has a cosine similarity of 0.983 with the software engineering chunk and only 0.398 with the medical research chunk. The software engineering chunk is clearly the better match.
Cosine Distance
You'll often see "cosine distance" in vector database documentation. This is simply 1 - cosine similarity, which gives us an easier-to-interpret number where:
- Values close to 0 mean high similarity
- Larger values mean less similarity
Step 6: Build the Final Prompt
Finally, we take the user's question and the most relevant text chunk we found, then combine them into a prompt for Claude:

The prompt includes both the user's question and the relevant context from our document, allowing Claude to provide an informed answer based on the specific information in our knowledge base.
The Complete Flow
That's the entire RAG pipeline from start to finish:
- Chunk source documents
- Generate embeddings for each chunk
- Store embeddings in a vector database
- When a user asks a question, embed their query
- Find the most similar stored embeddings using cosine similarity
- Add the relevant chunks to a prompt with the user's question
- Send the enhanced prompt to Claude for a response
Understanding this process and the math behind it will help you work effectively with vector databases and debug issues when your RAG system isn't returning the results you expect.
🔁 Related lessons
- Next: Implementing the RAG flow
- Previous: Text embeddings
- Same section: Overview of Claude Models · Accessing the API · Making a request
- Part of paths: Path C
- Reference docs: Glossary · Skills atlas · By use-case
📚 Source & attribution
- Original Anthropic Academy lesson: https://anthropic.skilljar.com/claude-in-amazon-bedrock/276774
- © 2025 Anthropic. Educational fair-use only.