Retrieval-Augmented Generation: retrieve context from knowledge base via vector search, include in LLM prompt. Dominant pattern for knowledge-grounded AI.

Docs chunked, embedded as vectors, stored in vector DB. Query embedded, find similar chunks, add to LLM prompt, generate response.

Retrieval-Augmented Generation (RAG)

Q: What's the RAG stack?

Embedding models (OpenAI, Cohere). Vector DBs (Pinecone, Weaviate, pgvector). Orchestration (LangChain, LlamaIndex). Often hybrid + reranking.

Q: When should I use RAG vs fine-tuning vs long context?

RAG for current/changing knowledge. Fine-tuning for behavior/style. Long context for bounded info. Production combines all three.

Ryan Rutan

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is the AI pattern of pulling relevant context from a knowledge base and including it in the LLM prompt at inference time. Retrieval typically uses vector search. RAG allows the model to answer questions or perform tasks using information it wasn't trained on, or that may be more recent than its training data. RAG has been the dominant pattern for knowledge-grounded AI applications since 2023. It's the bridge between general-purpose LLMs and specific organizational knowledge.

How RAG works (the pipeline):

Ingestion: documents (PDFs, web pages, databases, conversations) are chunked into segments.
Embedding: each chunk is converted to a vector (high-dimensional number array) using an embedding model.
Storage: vectors stored in a vector database (Pinecone, Weaviate, pgvector, Chroma, etc.).
Query time: user question is embedded into a vector.
Retrieval: vector database finds the most similar chunks (nearest neighbors).
Generation: relevant chunks are added to the LLM prompt as context; LLM generates response.

Example: a customer support RAG system has all support docs indexed. User asks a question. System retrieves the 5 most relevant doc chunks. Those chunks are included in the prompt. LLM answers using that context.

Why RAG matters:

Current information: LLMs have training data cutoffs; RAG provides current info.

Proprietary knowledge: company-specific data that LLMs were never trained on.

Cost efficiency: cheaper to retrieve specific context than load everything into massive context windows.

Citations: RAG can cite specific source chunks, enabling verification.

Update flexibility: knowledge updates by re-indexing, not retraining.

The RAG stack:

Embedding models: OpenAI text-embedding-3, Cohere embed, open-source (BGE, E5, Nomic). Convert text to vectors.

Vector databases: Pinecone, Weaviate, pgvector (PostgreSQL extension), Chroma, Qdrant, Milvus. Store and search vectors.

Orchestration: LangChain, LlamaIndex, Haystack. Wire embedding + retrieval + LLM together.

Hybrid search: combining vector search with keyword search (BM25) often outperforms either alone.

Reranking: a second-stage model re-orders retrieved chunks for better relevance.

The chunking decision:

Chunk size: too small loses context; too large wastes context window. Typical: 200-2000 tokens per chunk.

Chunk overlap: 10-20% overlap between adjacent chunks helps boundary context.

Semantic chunking: split on natural boundaries (sections, paragraphs) rather than arbitrary token counts.

Hierarchical chunking: maintain document structure (section → paragraph → sentence).

Common RAG patterns:

Naive RAG: single-step retrieval + generation.

Hybrid retrieval: vector + keyword search combined.

Multi-query RAG: generate multiple queries from original question for better retrieval.

Self-querying: LLM generates structured query (filters, metadata) before retrieval.

Iterative retrieval: retrieve, generate partial answer, retrieve again based on what's needed.

Agent RAG: LLM decides when to retrieve, what to retrieve, when to stop.

What undermines RAG quality:

Bad chunking: arbitrary chunks lose context.

Single-stage retrieval: pure vector search misses precision; combine with keyword/reranking.

Stale embeddings: knowledge base changes; embeddings get outdated.

Embedding model mismatch: different embedding models produce different similarity rankings.

Insufficient retrieval: too few chunks retrieved miss relevant info.

Excess retrieval: too many chunks dilute context with noise.

The RAG vs fine-tuning vs long-context decision:

RAG: best for current/large/changing knowledge bases. Per-query cost-efficient. Citable.

Fine-tuning: best for behavior, style, format consistency. Higher upfront cost.

Long context: best when relevant info is bounded and fits. Simpler architecture.

Combine all three: production AI typically uses fine-tuned model + RAG + careful prompting.

The 2024-2026 RAG evolution:

Agentic RAG: LLM as orchestrator deciding when/what to retrieve.

Graph RAG: knowledge graphs added to or replacing pure vector search.

Hybrid approaches: combining structured (SQL, knowledge graphs) and unstructured (vector) retrieval.

Evaluation tools: RAGAS, TruLens for measuring retrieval quality and answer faithfulness.

Ryan's Take

RAG is the default for any production AI working with your proprietary or current data, and most of its failures are retrieval, not the model. Invest in chunking (almost everyone underinvests), use hybrid search (vector plus keyword plus reranking), and evaluate retrieval quality separately from generation quality. When answers are bad, check retrieval before you blame the LLM. Treat it as a first-class engineering discipline with a data flywheel, not a dump-the-docs-in-a-vector-DB-and-pray project.

What founders get wrong: Treating RAG as "just throw docs in a vector DB." Production RAG requires careful chunking, hybrid retrieval, evaluation, and iteration. The right discipline: invest in retrieval quality measurement; iterate chunking and retrieval strategy; build evaluation harness.

FAQ

What is RAG (Retrieval-Augmented Generation)?
The AI architecture pattern of retrieving relevant context from a knowledge base (typically via vector search) and including it in LLM prompts at inference time. Allows the model to use information it wasn't trained on. Dominant pattern for knowledge-grounded AI applications since 2023.

How does RAG work?
Documents chunked into segments. Each chunk embedded as a vector. Stored in vector database. At query time, question is embedded, vector DB finds most similar chunks, chunks are added to LLM prompt as context, LLM generates response using that context.

What's the RAG stack?
Embedding models (OpenAI, Cohere, open-source like BGE/E5). Vector databases (Pinecone, Weaviate, pgvector, Chroma, Qdrant). Orchestration (LangChain, LlamaIndex). Often includes hybrid search (vector + keyword) and reranking.

When should I use RAG vs fine-tuning vs long context?
RAG: current/large/changing knowledge bases. Fine-tuning: consistent behavior/style. Long context: bounded info that fits. Production AI typically uses fine-tuned model + RAG + careful prompting combined.

Find this article helpful?

This is just a small sample! Register to unlock our in-depth courses, hundreds of video courses, and a library of playbooks and articles to grow your startup fast. Let us Let us show you!

Submission confirms agreement to our Terms of Service and Privacy Policy.