AIMLSAGA
Course Content
Module 3 of 9

Module 3: Generative AI

0/3
0% complete
1. How LLMs Generate Text+65 XP
2. Prompt Engineering+70 XP
3. Building Your First LLM Application+65 XP
Module 3/9 · Lesson 1/3
How LLMs Generate Text
genai
intermediate
+65 XP

How LLMs Generate Text

Large Language Models generate text through a deceptively simple process: predict the probability distribution over the next token, sample from it, append the token, and repeat.

The Generation Loop

prompt → [tokenize] → input_ids
loop:
  input_ids → [model forward pass] → logits (vocab_size)
  logits → [sampling] → next_token_id
  append next_token_id → repeat until <EOS> or max_length

Despite its simplicity, the distribution learned from trillions of tokens encodes enormous world knowledge.

Sampling Strategies

Greedy Search

Always pick the highest-probability token. Fast but produces repetitive, boring text.

Temperature Scaling

Divide logits by temperature τ before softmax: pi=exp(zi/τ)jexp(zj/τ)p_i = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}

  • τ < 1: Sharper distribution → more deterministic
  • τ > 1: Flatter distribution → more creative/random
  • τ = 1: Standard distribution

Top-k Sampling

Sample only from the k highest-probability tokens. Prevents low-probability tokens from being selected.

Top-p (Nucleus) Sampling

Sample from the smallest set of tokens whose cumulative probability exceeds p.

  • More adaptive than top-k: large k when the distribution is flat, small k when peaked.

Beam Search

Maintain B candidate sequences simultaneously, selecting the globally most probable at the end. Used in translation and summarization; less common for chat.

The Context Window

LLMs have a fixed context window (e.g., 128K tokens for Claude 3.5). Everything in the context — system prompt, conversation history, retrieved documents — counts against this limit.

KV cache: During generation, the key-value pairs for all processed tokens are cached. This makes each new token generation O(1) in compute (only the new token needs a forward pass through each layer).

Emergent Capabilities

GPT-3 (175B) exhibited capabilities absent in GPT-2 (1.5B):

CapabilityDescription
Few-shot learningLearn new tasks from 2-5 examples in the prompt
Chain-of-thoughtStep-by-step reasoning improves multi-step accuracy
Code generationWrite and debug working code across languages
Instruction followingUnderstand and execute natural language instructions

These appear to emerge discontinuously — they're absent at small scale and suddenly appear at a threshold. This is the scaling hypothesis.

Why "Generative"?

A generative model learns the joint distribution P(X) over all possible sequences, not just a discriminative mapping P(Y|X). This allows it to create new sequences that are plausible given learned patterns — not just classify or retrieve.