AIMLSAGA
Course Content
Module 5 of 9

Module 5: Core RAG

0/3
0% complete
1. The RAG Architecture+100 XP
2. Chunking Strategies+100 XP
3. RAG Evaluation with RAGAS+100 XP
Module 5/9 · Lesson 1/3
The RAG Architecture
genai
intermediate
+100 XP

The RAG Architecture

"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" — Lewis et al., 2020 (Meta AI)

RAG solves a fundamental problem: LLMs have a knowledge cutoff and can hallucinate facts. By pairing them with a retrieval system, we ground responses in real, verifiable, up-to-date documents.

The Core Problem: LLM Hallucination

LLMs generate plausible-sounding text — but plausibility ≠ accuracy. Without grounding, a model confidently fabricates:

  • Wrong citations and statistics
  • Outdated facts
  • Non-existent APIs and functions

RAG Architecture Overview

User Query
    ↓
[Embedding Model] → query vector
    ↓
[Vector DB] → top-k relevant chunks
    ↓
[Context Construction] → system prompt + retrieved docs + query
    ↓
[LLM] → grounded response
    ↓
Answer (with source citations)

Two distinct phases:

Indexing (offline, once)

  1. Load documents (PDF, HTML, markdown, databases)
  2. Chunk documents into segments
  3. Embed each chunk with an embedding model
  4. Store vectors + metadata in a vector database

Retrieval + Generation (online, per query)

  1. Embed the user query
  2. Search the vector DB for top-k similar chunks
  3. Construct a prompt with retrieved context
  4. Call the LLM with context
  5. Return the grounded response

Why RAG Beats Pure Fine-Tuning for Knowledge

ApproachKnowledge UpdateCostHallucination Risk
RAGAdd docs to DB (minutes)LowLow (grounded)
Fine-tuningRetrain model (hours-days)HighMedium
In-context (no retrieval)Paste docs in promptMediumHigh

RAG is especially powerful for:

  • Private/proprietary knowledge (internal docs, customer data)
  • Frequently updated information (news, documentation, pricing)
  • Long-tail queries where the LLM's training data is sparse

The Faithfulness Challenge

RAG reduces but doesn't eliminate hallucination. The model may:

  • Ignore retrieved context and generate from memory
  • Contradict the retrieved context
  • Misinterpret retrieved passages

Key metrics:

  • Faithfulness: Is the answer entailed by the retrieved context? (vs. hallucinated)
  • Answer Relevance: Does the answer actually address the question?
  • Context Relevance: Are the retrieved chunks actually relevant?

The RAGAS framework provides automated metrics for all three.