The RAG Architecture

"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" — Lewis et al., 2020 (Meta AI)

RAG solves a fundamental problem: LLMs have a knowledge cutoff and can hallucinate facts. By pairing them with a retrieval system, we ground responses in real, verifiable, up-to-date documents.

The Core Problem: LLM Hallucination

LLMs generate plausible-sounding text — but plausibility ≠ accuracy. Without grounding, a model confidently fabricates:

Wrong citations and statistics
Outdated facts
Non-existent APIs and functions

RAG Architecture Overview

User Query
    ↓
[Embedding Model] → query vector
    ↓
[Vector DB] → top-k relevant chunks
    ↓
[Context Construction] → system prompt + retrieved docs + query
    ↓
[LLM] → grounded response
    ↓
Answer (with source citations)

Two distinct phases:

Indexing (offline, once)

Load documents (PDF, HTML, markdown, databases)
Chunk documents into segments
Embed each chunk with an embedding model
Store vectors + metadata in a vector database

Retrieval + Generation (online, per query)

Embed the user query
Search the vector DB for top-k similar chunks
Construct a prompt with retrieved context
Call the LLM with context
Return the grounded response

Why RAG Beats Pure Fine-Tuning for Knowledge

Approach	Knowledge Update	Cost	Hallucination Risk
RAG	Add docs to DB (minutes)	Low	Low (grounded)
Fine-tuning	Retrain model (hours-days)	High	Medium
In-context (no retrieval)	Paste docs in prompt	Medium	High

RAG is especially powerful for:

Private/proprietary knowledge (internal docs, customer data)
Frequently updated information (news, documentation, pricing)
Long-tail queries where the LLM's training data is sparse

The Faithfulness Challenge

RAG reduces but doesn't eliminate hallucination. The model may:

Ignore retrieved context and generate from memory
Contradict the retrieved context
Misinterpret retrieved passages

Key metrics:

Faithfulness: Is the answer entailed by the retrieved context? (vs. hallucinated)
Answer Relevance: Does the answer actually address the question?
Context Relevance: Are the retrieved chunks actually relevant?

The RAGAS framework provides automated metrics for all three.