Lesson 4 / 6

04. RAG — Give Your AI a Knowledge Base

TL;DR

RAG = retrieve relevant documents, stuff them into the prompt, let the LLM answer using that context. You need: an embedding model to vectorize text, a vector DB to store/search, a chunking strategy to split documents, and a prompt template that combines query + context. This is how you make LLMs useful with your own data.

LLMs know a lot, but they don’t know your data. They can’t read your company wiki, your product docs, your private codebase, or last week’s support tickets. Ask them a question about your internal systems and they’ll either hallucinate a confident-sounding wrong answer or admit they don’t know. RAG fixes this. Instead of hoping the model memorized the right training data, you retrieve the relevant documents yourself and hand them to the model as context. It’s the single most important pattern in production AI.

RAG pipeline — ingestion and query flow with embeddings and vector database

What RAG Is and Why It Exists

RAG stands for Retrieval Augmented Generation. The idea is simple: before the LLM generates an answer, retrieve relevant documents and stuff them into the prompt.

User question: "What is our refund policy for enterprise customers?"

Without RAG:  LLM guesses (hallucination)
With RAG:     Search your docs -> find refund-policy.md -> paste it into prompt -> LLM answers accurately

Three reasons RAG dominates production AI:

  1. No fine-tuning required. Fine-tuning is expensive, slow, and hard to update. RAG uses the model as-is — you just change what context you feed it.
  2. Always up to date. Your vector database can be updated in real time. Fine-tuned models are frozen at training time.
  3. Verifiable answers. RAG can cite sources. You can show users which documents the answer came from.

The tradeoff: RAG adds latency (retrieval step), complexity (vector DB infrastructure), and cost (embedding API calls). But for most use cases, it’s the right call.

Embeddings — semantic similarity in vector space with clustered topics

Embeddings Explained

An embedding is a vector — a list of numbers — that represents the meaning of a piece of text. Similar texts get similar vectors. This is what makes semantic search possible.

from openai import OpenAI

client = OpenAI()

# Embed a single piece of text
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="How do I reset my password?"
)

vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}")  # 1536
print(f"First 5 values: {vector[:5]}")
# [-0.0023, 0.0145, -0.0067, 0.0312, -0.0089]

The key insight: “How do I reset my password?” and “I forgot my login credentials” produce vectors that are close together in 1536-dimensional space, even though they share almost no words. That’s the magic of embeddings — they capture meaning, not just keywords.

Embedding Models

Model Dimensions Provider Cost
text-embedding-3-small 1536 OpenAI $0.02/1M tokens
text-embedding-3-large 3072 OpenAI $0.13/1M tokens
voyage-3 1024 Voyage AI $0.06/1M tokens
all-MiniLM-L6-v2 384 Sentence Transformers Free (local)

For local/free embeddings, sentence-transformers is the go-to:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "How do I reset my password?",
    "I forgot my login credentials",
    "The weather is nice today"
]

embeddings = model.encode(sentences)
print(f"Shape: {embeddings.shape}")  # (3, 384)

# Compute similarity
from sklearn.metrics.pairwise import cosine_similarity
sims = cosine_similarity(embeddings)
print(f"password vs credentials: {sims[0][1]:.3f}")  # ~0.78 (similar)
print(f"password vs weather:     {sims[0][2]:.3f}")  # ~0.12 (unrelated)

Critical rule: the same embedding model must be used for both ingestion and querying. If you embed your documents with text-embedding-3-small, you must embed your queries with text-embedding-3-small. Mixing models produces garbage results.

Vector Databases

You need somewhere to store embeddings and search them by similarity. That’s what vector databases do.

ChromaDB (Local, Zero Config)

ChromaDB runs in-process with no server. Perfect for prototyping and small-to-medium datasets.

import chromadb

client = chromadb.Client()  # In-memory
# Or persistent: chromadb.PersistentClient(path="./chroma_data")

collection = client.create_collection(
    name="docs",
    metadata={"hnsw:space": "cosine"}  # cosine similarity
)

# Add documents — Chroma embeds them for you using its default model
collection.add(
    ids=["doc1", "doc2", "doc3"],
    documents=[
        "Our refund policy allows returns within 30 days.",
        "Enterprise customers get a dedicated support channel.",
        "Password resets can be done from the settings page."
    ],
    metadatas=[
        {"source": "policy.md", "section": "refunds"},
        {"source": "enterprise.md", "section": "support"},
        {"source": "faq.md", "section": "auth"}
    ]
)

# Query
results = collection.query(
    query_texts=["How do I get a refund?"],
    n_results=2
)
print(results["documents"])
# [['Our refund policy allows returns within 30 days.',
#   'Enterprise customers get a dedicated support channel.']]

Pinecone (Hosted, Production-Ready)

Pinecone is a managed vector database. You don’t run any infrastructure.

from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")
index = pc.Index("my-index")

# Upsert vectors (you must embed them yourself)
index.upsert(vectors=[
    {"id": "doc1", "values": embedding_vector, "metadata": {"source": "policy.md"}},
    {"id": "doc2", "values": embedding_vector_2, "metadata": {"source": "faq.md"}}
])

# Query
results = index.query(vector=query_embedding, top_k=5, include_metadata=True)
for match in results.matches:
    print(f"{match.id}: {match.score:.3f}{match.metadata['source']}")

pgvector (Postgres Extension)

If you already run Postgres, pgvector adds vector search without a new database.

CREATE EXTENSION vector;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536),
    source TEXT
);

-- Insert
INSERT INTO documents (content, embedding, source)
VALUES ('Refund policy...', '[0.0023, 0.0145, ...]', 'policy.md');

-- Nearest neighbor search
SELECT content, source, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;

Quick Comparison

Database Setup Best For Scales To
ChromaDB pip install Prototyping, small projects ~1M vectors
Pinecone Managed SaaS Production, zero-ops Billions
pgvector Postgres extension Teams already on Postgres ~10M vectors
Weaviate Self-hosted or cloud Multi-modal, GraphQL fans Billions

For this lesson, we’ll use ChromaDB throughout. Everything transfers to other vector DBs — the concepts are identical.

Document chunking strategies for RAG — fixed, recursive, and semantic

Chunking Strategies

You can’t embed an entire 50-page PDF as one vector. You need to split documents into chunks. Chunking is where most RAG pipelines succeed or fail.

Fixed-Size Chunking

The simplest approach. Split text every N characters with overlap.

def fixed_size_chunks(text: str, chunk_size: int = 500, overlap: int = 100) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap  # Overlap prevents cutting mid-sentence
    return chunks

text = "A very long document... " * 200
chunks = fixed_size_chunks(text, chunk_size=500, overlap=100)
print(f"Created {len(chunks)} chunks")

Recursive Character Splitting

Split on natural boundaries — paragraphs first, then sentences, then words. This is the most commonly used strategy.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""]  # Try each in order
)

text = open("long_document.txt").read()
chunks = splitter.split_text(text)
print(f"Chunks: {len(chunks)}, avg size: {sum(len(c) for c in chunks) / len(chunks):.0f}")

Semantic Chunking

Group sentences that are semantically related. More expensive but produces higher-quality chunks.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=75
)

chunks = chunker.split_text(text)

Why Chunk Size Matters

This is not a tuning knob you can ignore:

  • Too small (< 200 chars): Chunks lack context. “The deadline is 30 days” means nothing without knowing what it refers to.
  • Too large (> 2000 chars): Retrieval loses precision. A chunk about five topics matches queries about all five topics — poorly.
  • Sweet spot: 300-800 characters for most use cases. Overlap of 10-20% of chunk size.

Test your chunking on real queries before you move on. Bad chunking is the number one cause of bad RAG results.

Building the Ingestion Pipeline

Time to build. This pipeline loads documents, chunks them, embeds them, and stores them in ChromaDB.

# rag_ingest.py — Full ingestion pipeline
import os
import chromadb
from pathlib import Path

# 1. Load documents from a directory
def load_documents(directory: str) -> list[dict]:
    """Load all .txt and .md files from a directory."""
    docs = []
    for path in Path(directory).rglob("*"):
        if path.suffix in (".txt", ".md"):
            content = path.read_text(encoding="utf-8")
            docs.append({
                "content": content,
                "source": str(path),
                "filename": path.name
            })
    print(f"Loaded {len(docs)} documents")
    return docs


# 2. Chunk documents
def chunk_documents(docs: list[dict], chunk_size: int = 500, overlap: int = 100) -> list[dict]:
    """Split documents into overlapping chunks, preserving metadata."""
    chunks = []
    for doc in docs:
        text = doc["content"]
        start = 0
        chunk_index = 0
        while start < len(text):
            end = start + chunk_size
            chunk_text = text[start:end]

            # Try to break at a sentence boundary
            if end < len(text):
                last_period = chunk_text.rfind(". ")
                if last_period > chunk_size * 0.5:
                    chunk_text = chunk_text[:last_period + 1]
                    end = start + last_period + 1

            chunks.append({
                "id": f"{doc['filename']}_{chunk_index}",
                "text": chunk_text.strip(),
                "metadata": {
                    "source": doc["source"],
                    "filename": doc["filename"],
                    "chunk_index": chunk_index
                }
            })
            start = end - overlap
            chunk_index += 1

    print(f"Created {len(chunks)} chunks from {len(docs)} documents")
    return chunks


# 3. Store in ChromaDB
def create_vector_store(chunks: list[dict], collection_name: str = "knowledge_base"):
    """Embed and store chunks in ChromaDB."""
    client = chromadb.PersistentClient(path="./chroma_data")

    # Delete existing collection if it exists
    try:
        client.delete_collection(collection_name)
    except ValueError:
        pass

    collection = client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}
    )

    # ChromaDB handles embedding internally using its default model
    # For production, you'd pass your own embedding function
    batch_size = 100
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        collection.add(
            ids=[c["id"] for c in batch],
            documents=[c["text"] for c in batch],
            metadatas=[c["metadata"] for c in batch]
        )
        print(f"  Stored batch {i // batch_size + 1} ({len(batch)} chunks)")

    print(f"Vector store created: {collection.count()} chunks indexed")
    return collection


# Run the pipeline
if __name__ == "__main__":
    docs = load_documents("./my_knowledge_base")
    chunks = chunk_documents(docs, chunk_size=500, overlap=100)
    collection = create_vector_store(chunks)

Run it:

pip install chromadb
mkdir -p my_knowledge_base
# Put your .txt or .md files in my_knowledge_base/
python rag_ingest.py

Building the Query Pipeline

Now the retrieval and generation side. Embed the user’s question, find similar chunks, build a prompt, and call the LLM.

# rag_query.py — Full query pipeline
import chromadb
from openai import OpenAI

openai_client = OpenAI()


def retrieve(query: str, collection, n_results: int = 5) -> list[dict]:
    """Search the vector store for relevant chunks."""
    results = collection.query(
        query_texts=[query],
        n_results=n_results
    )

    retrieved = []
    for i in range(len(results["documents"][0])):
        retrieved.append({
            "text": results["documents"][0][i],
            "metadata": results["metadatas"][0][i],
            "distance": results["distances"][0][i]
        })

    return retrieved


def build_prompt(query: str, context_chunks: list[dict]) -> str:
    """Combine retrieved context with the user's question."""
    context = "\n\n---\n\n".join([
        f"[Source: {c['metadata']['source']}]\n{c['text']}"
        for c in context_chunks
    ])

    return f"""You are a helpful assistant. Answer the user's question using ONLY the provided context.
If the context doesn't contain enough information to answer, say "I don't have enough information to answer that."
Always cite which source document you used.

CONTEXT:
{context}

QUESTION: {query}

ANSWER:"""


def generate(prompt: str) -> str:
    """Call the LLM with the augmented prompt."""
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2,  # Low temperature for factual answers
        max_tokens=1000
    )
    return response.choices[0].message.content


def rag_query(query: str, collection, n_results: int = 5) -> dict:
    """Full RAG pipeline: retrieve -> augment -> generate."""

    # Step 1: Retrieve
    chunks = retrieve(query, collection, n_results)
    print(f"Retrieved {len(chunks)} chunks")
    for c in chunks:
        print(f"  [{c['distance']:.3f}] {c['metadata']['source']}")

    # Step 2: Augment (build prompt with context)
    prompt = build_prompt(query, chunks)

    # Step 3: Generate
    answer = generate(prompt)

    return {
        "answer": answer,
        "sources": [c["metadata"]["source"] for c in chunks],
        "chunks_used": len(chunks)
    }

The Complete RAG Pipeline

Here’s both pipelines wired together in one runnable script:

# rag_complete.py — End-to-end RAG system
import chromadb
from openai import OpenAI
from pathlib import Path

openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./chroma_data")

COLLECTION_NAME = "knowledge_base"


# ---------- Ingestion ----------

def ingest(directory: str, chunk_size: int = 500, overlap: int = 100):
    """Load, chunk, and index documents."""
    # Load
    docs = []
    for path in Path(directory).rglob("*"):
        if path.suffix in (".txt", ".md"):
            docs.append({"content": path.read_text(), "name": path.name})

    # Chunk
    chunks, ids, metadatas = [], [], []
    for doc in docs:
        text = doc["content"]
        start, idx = 0, 0
        while start < len(text):
            end = min(start + chunk_size, len(text))
            chunks.append(text[start:end])
            ids.append(f"{doc['name']}_{idx}")
            metadatas.append({"source": doc["name"], "chunk": idx})
            start = end - overlap
            idx += 1

    # Store
    try:
        chroma_client.delete_collection(COLLECTION_NAME)
    except ValueError:
        pass

    collection = chroma_client.create_collection(COLLECTION_NAME)
    collection.add(ids=ids, documents=chunks, metadatas=metadatas)
    print(f"Indexed {len(chunks)} chunks from {len(docs)} documents")
    return collection


# ---------- Query ----------

def query(question: str, collection, top_k: int = 5) -> str:
    """Retrieve context and generate an answer."""
    results = collection.query(query_texts=[question], n_results=top_k)

    context = "\n\n".join([
        f"[{results['metadatas'][0][i]['source']}]: {results['documents'][0][i]}"
        for i in range(len(results["documents"][0]))
    ])

    prompt = f"""Answer using ONLY the context below. Cite your sources.
If the context doesn't contain the answer, say so.

Context:
{context}

Question: {question}"""

    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2
    )
    return response.choices[0].message.content


# ---------- Main ----------

if __name__ == "__main__":
    # Ingest your docs
    collection = ingest("./my_knowledge_base")

    # Ask questions
    while True:
        q = input("\nQuestion (or 'quit'): ")
        if q.lower() == "quit":
            break
        answer = query(q, collection)
        print(f"\n{answer}")

Install dependencies and run:

pip install chromadb openai
python rag_complete.py

That’s a working RAG system in under 70 lines. Everything else is optimization.

Advanced RAG Patterns

The basic pipeline works, but production systems need more.

Combine vector similarity with keyword matching. Vector search misses exact terms (product codes, error IDs). Keyword search misses semantic meaning. Use both.

# Pseudo-code for hybrid search
def hybrid_search(query: str, collection, n_results: int = 10):
    # Vector search (semantic)
    vector_results = collection.query(query_texts=[query], n_results=n_results)

    # Keyword search (BM25 or full-text)
    keyword_results = bm25_search(query, documents, top_k=n_results)

    # Reciprocal Rank Fusion to merge results
    fused = reciprocal_rank_fusion(vector_results, keyword_results, k=60)
    return fused[:n_results]

Reranking

The retrieval step casts a wide net. A reranker scores each result more carefully.

# pip install cohere
import cohere

co = cohere.Client("your-api-key")

# First, retrieve more candidates than you need
candidates = retrieve(query, collection, n_results=20)

# Then rerank to find the best ones
reranked = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=[c["text"] for c in candidates],
    top_n=5
)

top_chunks = [candidates[r.index] for r in reranked.results]

HyDE (Hypothetical Document Embeddings)

Instead of embedding the question directly, ask the LLM to generate a hypothetical answer, then embed that. The hypothetical answer is closer in embedding space to the real documents than a short question would be.

def hyde_retrieve(query: str, collection, n_results: int = 5):
    # Generate a hypothetical answer
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Write a short paragraph answering this question: {query}"
        }],
        temperature=0.5
    )
    hypothetical_doc = response.choices[0].message.content

    # Embed the hypothetical answer (not the question)
    results = collection.query(query_texts=[hypothetical_doc], n_results=n_results)
    return results

Metadata Filtering

Filter by source, date, category, or any metadata before searching.

results = collection.query(
    query_texts=["refund policy"],
    n_results=5,
    where={"source": "enterprise_docs.md"},          # Exact match
    where_document={"$contains": "enterprise"}        # Document content filter
)

# Combine filters
results = collection.query(
    query_texts=["deployment guide"],
    n_results=5,
    where={
        "$and": [
            {"category": {"$eq": "engineering"}},
            {"updated_after": {"$gte": "2025-01-01"}}
        ]
    }
)

Evaluating RAG Quality

You can’t improve what you don’t measure. RAG evaluation has three dimensions:

Context Relevance — Did you retrieve the right documents?

def context_precision(retrieved_docs, relevant_docs):
    """What fraction of retrieved docs are actually relevant?"""
    relevant_retrieved = set(retrieved_docs) & set(relevant_docs)
    return len(relevant_retrieved) / len(retrieved_docs) if retrieved_docs else 0

def context_recall(retrieved_docs, relevant_docs):
    """What fraction of relevant docs were retrieved?"""
    relevant_retrieved = set(retrieved_docs) & set(relevant_docs)
    return len(relevant_retrieved) / len(relevant_docs) if relevant_docs else 0

Faithfulness — Is the answer grounded in the retrieved context, or did the LLM hallucinate?

def check_faithfulness(answer: str, context: str) -> str:
    """Use an LLM to judge if the answer is supported by context."""
    prompt = f"""Given the context and answer below, identify any claims in the answer
that are NOT supported by the context.

Context: {context}
Answer: {answer}

List unsupported claims, or say "All claims are supported."
"""
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content

Answer Correctness — Is the final answer actually right? This usually requires a ground-truth test set.

# Build an evaluation dataset
eval_set = [
    {
        "question": "What is the refund policy for enterprise?",
        "expected_answer": "Enterprise customers can request refunds within 60 days.",
        "relevant_docs": ["enterprise_policy.md"]
    },
    # ... more test cases
]

# Run evaluation
for test in eval_set:
    result = rag_query(test["question"], collection)
    precision = context_precision(result["sources"], test["relevant_docs"])
    print(f"Q: {test['question']}")
    print(f"  Precision: {precision:.2f}")
    print(f"  Answer: {result['answer'][:100]}...")

For automated evaluation at scale, look at frameworks like Ragas and DeepEval.

Common Pitfalls

These are the mistakes that waste weeks of debugging time.

Bad chunking. This is the most common problem. If your chunks split a paragraph in the middle of a key sentence, retrieval will return fragments that lack context. Always test your chunking by reading the actual chunks. If a chunk doesn’t make sense to a human, it won’t make sense to the retriever.

Wrong K value. Retrieving too few chunks (K=1 or 2) misses relevant context. Retrieving too many (K=20) floods the prompt with noise and confuses the LLM. Start with K=5, then tune based on your evaluation metrics.

Context window overflow. If you retrieve 10 chunks of 500 tokens each, that’s 5,000 tokens of context before the question and instructions. Add the system prompt and the LLM’s response, and you can blow past the context window. Always calculate your token budget:

def check_token_budget(chunks: list[str], max_context_tokens: int = 4000) -> list[str]:
    """Trim chunks to fit within token budget."""
    # Rough estimate: 1 token ~ 4 characters
    total_chars = 0
    selected = []
    for chunk in chunks:
        if total_chars + len(chunk) > max_context_tokens * 4:
            break
        selected.append(chunk)
        total_chars += len(chunk)
    return selected

Embedding model mismatch. If you indexed with text-embedding-3-small and query with all-MiniLM-L6-v2, the vectors live in different spaces. Every similarity score will be meaningless. Pin your embedding model and version in configuration.

Not handling empty results. When the vector store returns nothing relevant (all distances > 0.8), the LLM will try to answer with no context — which means hallucination. Add a distance threshold and return “I don’t know” when nothing is close enough:

def retrieve_with_threshold(query, collection, n_results=5, max_distance=0.5):
    results = collection.query(query_texts=[query], n_results=n_results)

    filtered = []
    for i in range(len(results["documents"][0])):
        if results["distances"][0][i] <= max_distance:
            filtered.append(results["documents"][0][i])

    if not filtered:
        return None  # Signal "no relevant context found"
    return filtered

Ignoring metadata. Vectors alone lose structure. A chunk from a deprecated 2019 doc and a chunk from the current 2025 doc will have similar embeddings if the content is similar. Always store and filter by metadata — date, source, version, category.

Key Takeaways

  • RAG = retrieve relevant context + stuff it into the prompt + let the LLM generate. It beats fine-tuning for most use cases.
  • Embeddings convert text to vectors that capture meaning. Use the same model for indexing and querying.
  • ChromaDB is the fastest way to prototype a vector store. Pinecone and pgvector are solid production choices.
  • Chunking is the most under-appreciated part of RAG. Bad chunks produce bad retrieval. Target 300-800 characters with 10-20% overlap.
  • The prompt template matters. Tell the LLM to answer only from context and cite sources.
  • Measure retrieval precision, recall, and faithfulness. You cannot improve RAG by guessing.
  • Advanced patterns (hybrid search, reranking, HyDE) are worth adding once the basic pipeline works and you have evaluation metrics to guide you.
  • Always set a distance threshold on retrieval. When nothing matches, say “I don’t know” instead of hallucinating.