Course Content

Module 7 of 9

Module 7: Fine-Tuning & Alignment

0/3

0% complete

1. Fine-Tuning Decision Framework & Dataset Creation+135 XP

2. PEFT Methods: LoRA, QLoRA & Quantization+130 XP

3. Alignment: RLHF, DPO, Constitutional AI & Evaluation+135 XP

Module 7/9 · Lesson 1/3

Fine-Tuning Decision Framework & Dataset Creation

genai

advanced

+135 XP

Fine-Tuning Decision Framework & Dataset Creation

The Three-Way Choice

Before writing a single line of training code, understand the full option space:

Approach	When to Use	Cost	Latency	Data Required
Prompt Engineering	General tasks, fast iteration	$	Low	None
RAG	Knowledge-intensive, facts change frequently	$$	Medium	Document corpus
Fine-Tuning	Style/format consistency, domain vocabulary, max performance	$$$	Low (short prompts)	500–50K labeled examples
RAG + Fine-Tuning	Complex domain with changing knowledge	$$$$	Medium	Both

Decision Checklist

Before fine-tuning, verify ALL of the following:

Prompt engineering has plateaued — You've tried chain-of-thought, few-shot (10+ examples), system prompts, and output format specifications.
Data exists — You have at least 500 high-quality labeled examples (or can create them).
Task is well-defined — The input-output mapping is consistent and unambiguous.
Evaluation is ready — You can measure success automatically (not just "vibes").
Budget allows — GPU time, labeling cost, and ongoing maintenance are acceptable.
Knowledge is stable — If the domain changes weekly, RAG may be better.

Fine-Tuning Objectives

Objective	Description	Example
Instruction Following	Teach model to follow a format	JSON output, structured reports
Domain Adaptation	Inject domain vocabulary/knowledge	Medical, legal, finance
Style Transfer	Match a specific writing style	Brand voice, tone of voice
Task Specialization	Optimize for one task	SQL generation, code review
Safety Alignment	Reduce harmful outputs	Constitutional AI, RLAIF
Distillation	Transfer large model knowledge to small	Teacher-student training

Catastrophic Forgetting

Full fine-tuning updates all weights — the model can forget general capabilities it learned during pre-training. Mitigations:

PEFT (train <1% of weights) — Almost eliminates forgetting
Replay — Mix original pre-training data into fine-tuning batches
EWC (Elastic Weight Consolidation) — Penalize changes to weights important for old tasks
Lower learning rate — 1e-5 vs 1e-3 reduces magnitude of weight changes

Compute Requirements

Model Size	Full FT (FP32)	Full FT (BF16)	QLoRA (4-bit)
7B	112 GB	56 GB	~12 GB
13B	208 GB	104 GB	~20 GB
70B	~1.1 TB	~560 GB	~48 GB
405B	—	—	~280 GB

Dataset Creation: The LIMA Principle

LIMA (Less Is More for Alignment, Zhou et al., 2023): 1,000 carefully curated examples outperformed models trained on 52,000+ examples. Quality > Quantity.

Key insight: The model learns how to respond from fine-tuning data, but what to respond relies on pre-trained knowledge. Focus on format, style, and structure — not cramming facts.

Dataset Formats

// Alpaca-style (instruction + optional input + output)
{
  "instruction": "Summarize the key findings",
  "input": "The study examined 500 patients...",
  "output": "Key findings: (1) 73% showed improvement..."
}

// ShareGPT-style (multi-turn conversation)
{
  "conversations": [
    {"from": "human", "value": "Explain gradient descent"},
    {"from": "gpt", "value": "Gradient descent is an optimization..."},
    {"from": "human", "value": "What about momentum?"},
    {"from": "gpt", "value": "Momentum builds on gradient descent by..."}
  ]
}

Data Sourcing Strategies

Strategy	Description	Quality	Scale	Cost
Human annotation	Expert contractors write examples	★★★★★	Low	High
Web scraping	Filter high-quality web text	★★★	High	Low
Self-Instruct	LLM generates instruction-response pairs	★★★★	High	Medium
Evol-Instruct	LLM rewrites simple instructions into harder ones	★★★★	High	Medium
Synthetic from teacher	GPT-4/Claude generates training data	★★★★	High	Medium
Existing datasets	Alpaca, FLAN, Dolly, OpenHermes	★★★	High	Free

Quality Filters (Non-Negotiable)

Before training, filter your dataset:

Length filter — Remove too-short (<50 tokens) or too-long (>4096 tokens) examples
Deduplication — MinHash or embedding-based similarity (removes up to 30% of typical datasets)
Quality classifier — Train a classifier on human-labeled "good vs bad" examples
Toxicity filter — Remove harmful content (Perspective API, Llama Guard)
Format validation — Ensure consistent structure (check JSON validity, etc.)
Human review — Sample 5% randomly and manually inspect