Module 3: Generative AI
How LLMs Generate Text
How LLMs Generate Text
Large Language Models generate text through a deceptively simple process: predict the probability distribution over the next token, sample from it, append the token, and repeat.
The Generation Loop
prompt → [tokenize] → input_ids
loop:
input_ids → [model forward pass] → logits (vocab_size)
logits → [sampling] → next_token_id
append next_token_id → repeat until <EOS> or max_length
Despite its simplicity, the distribution learned from trillions of tokens encodes enormous world knowledge.
Sampling Strategies
Greedy Search
Always pick the highest-probability token. Fast but produces repetitive, boring text.
Temperature Scaling
Divide logits by temperature τ before softmax:
- τ < 1: Sharper distribution → more deterministic
- τ > 1: Flatter distribution → more creative/random
- τ = 1: Standard distribution
Top-k Sampling
Sample only from the k highest-probability tokens. Prevents low-probability tokens from being selected.
Top-p (Nucleus) Sampling
Sample from the smallest set of tokens whose cumulative probability exceeds p.
- More adaptive than top-k: large k when the distribution is flat, small k when peaked.
Beam Search
Maintain B candidate sequences simultaneously, selecting the globally most probable at the end. Used in translation and summarization; less common for chat.
The Context Window
LLMs have a fixed context window (e.g., 128K tokens for Claude 3.5). Everything in the context — system prompt, conversation history, retrieved documents — counts against this limit.
KV cache: During generation, the key-value pairs for all processed tokens are cached. This makes each new token generation O(1) in compute (only the new token needs a forward pass through each layer).
Emergent Capabilities
GPT-3 (175B) exhibited capabilities absent in GPT-2 (1.5B):
| Capability | Description |
|---|---|
| Few-shot learning | Learn new tasks from 2-5 examples in the prompt |
| Chain-of-thought | Step-by-step reasoning improves multi-step accuracy |
| Code generation | Write and debug working code across languages |
| Instruction following | Understand and execute natural language instructions |
These appear to emerge discontinuously — they're absent at small scale and suddenly appear at a threshold. This is the scaling hypothesis.
Why "Generative"?
A generative model learns the joint distribution P(X) over all possible sequences, not just a discriminative mapping P(Y|X). This allows it to create new sequences that are plausible given learned patterns — not just classify or retrieve.