AIMLSAGA
Course Content
Module 7 of 9

Module 7: Fine-Tuning & Alignment

0/3
0% complete
1. Fine-Tuning Decision Framework & Dataset Creation+135 XP
2. PEFT Methods: LoRA, QLoRA & Quantization+130 XP
3. Alignment: RLHF, DPO, Constitutional AI & Evaluation+135 XP
Module 7/9 · Lesson 1/3
Fine-Tuning Decision Framework & Dataset Creation
genai
advanced
+135 XP

Fine-Tuning Decision Framework & Dataset Creation

The Three-Way Choice

Before writing a single line of training code, understand the full option space:

ApproachWhen to UseCostLatencyData Required
Prompt EngineeringGeneral tasks, fast iteration$LowNone
RAGKnowledge-intensive, facts change frequently$$MediumDocument corpus
Fine-TuningStyle/format consistency, domain vocabulary, max performance$$$Low (short prompts)500–50K labeled examples
RAG + Fine-TuningComplex domain with changing knowledge$$$$MediumBoth

Decision Checklist

Before fine-tuning, verify ALL of the following:

  1. Prompt engineering has plateaued — You've tried chain-of-thought, few-shot (10+ examples), system prompts, and output format specifications.
  2. Data exists — You have at least 500 high-quality labeled examples (or can create them).
  3. Task is well-defined — The input-output mapping is consistent and unambiguous.
  4. Evaluation is ready — You can measure success automatically (not just "vibes").
  5. Budget allows — GPU time, labeling cost, and ongoing maintenance are acceptable.
  6. Knowledge is stable — If the domain changes weekly, RAG may be better.

Fine-Tuning Objectives

ObjectiveDescriptionExample
Instruction FollowingTeach model to follow a formatJSON output, structured reports
Domain AdaptationInject domain vocabulary/knowledgeMedical, legal, finance
Style TransferMatch a specific writing styleBrand voice, tone of voice
Task SpecializationOptimize for one taskSQL generation, code review
Safety AlignmentReduce harmful outputsConstitutional AI, RLAIF
DistillationTransfer large model knowledge to smallTeacher-student training

Catastrophic Forgetting

Full fine-tuning updates all weights — the model can forget general capabilities it learned during pre-training. Mitigations:

  • PEFT (train <1% of weights) — Almost eliminates forgetting
  • Replay — Mix original pre-training data into fine-tuning batches
  • EWC (Elastic Weight Consolidation) — Penalize changes to weights important for old tasks
  • Lower learning rate — 1e-5 vs 1e-3 reduces magnitude of weight changes

Compute Requirements

Model SizeFull FT (FP32)Full FT (BF16)QLoRA (4-bit)
7B112 GB56 GB~12 GB
13B208 GB104 GB~20 GB
70B~1.1 TB~560 GB~48 GB
405B~280 GB

Dataset Creation: The LIMA Principle

LIMA (Less Is More for Alignment, Zhou et al., 2023): 1,000 carefully curated examples outperformed models trained on 52,000+ examples. Quality > Quantity.

Key insight: The model learns how to respond from fine-tuning data, but what to respond relies on pre-trained knowledge. Focus on format, style, and structure — not cramming facts.

Dataset Formats

// Alpaca-style (instruction + optional input + output)
{
  "instruction": "Summarize the key findings",
  "input": "The study examined 500 patients...",
  "output": "Key findings: (1) 73% showed improvement..."
}

// ShareGPT-style (multi-turn conversation)
{
  "conversations": [
    {"from": "human", "value": "Explain gradient descent"},
    {"from": "gpt", "value": "Gradient descent is an optimization..."},
    {"from": "human", "value": "What about momentum?"},
    {"from": "gpt", "value": "Momentum builds on gradient descent by..."}
  ]
}

Data Sourcing Strategies

StrategyDescriptionQualityScaleCost
Human annotationExpert contractors write examples★★★★★LowHigh
Web scrapingFilter high-quality web text★★★HighLow
Self-InstructLLM generates instruction-response pairs★★★★HighMedium
Evol-InstructLLM rewrites simple instructions into harder ones★★★★HighMedium
Synthetic from teacherGPT-4/Claude generates training data★★★★HighMedium
Existing datasetsAlpaca, FLAN, Dolly, OpenHermes★★★HighFree

Quality Filters (Non-Negotiable)

Before training, filter your dataset:

  • Length filter — Remove too-short (<50 tokens) or too-long (>4096 tokens) examples
  • Deduplication — MinHash or embedding-based similarity (removes up to 30% of typical datasets)
  • Quality classifier — Train a classifier on human-labeled "good vs bad" examples
  • Toxicity filter — Remove harmful content (Perspective API, Llama Guard)
  • Format validation — Ensure consistent structure (check JSON validity, etc.)
  • Human review — Sample 5% randomly and manually inspect