How LLMs Generate Text

Large Language Models generate text through a deceptively simple process: predict the probability distribution over the next token, sample from it, append the token, and repeat.

The Generation Loop

prompt → [tokenize] → input_ids
loop:
  input_ids → [model forward pass] → logits (vocab_size)
  logits → [sampling] → next_token_id
  append next_token_id → repeat until <EOS> or max_length

Despite its simplicity, the distribution learned from trillions of tokens encodes enormous world knowledge.

Sampling Strategies

Greedy Search

Always pick the highest-probability token. Fast but produces repetitive, boring text.

Temperature Scaling

Divide logits by temperature τ before softmax: $p_i = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}$

τ < 1: Sharper distribution → more deterministic
τ > 1: Flatter distribution → more creative/random
τ = 1: Standard distribution

Top-k Sampling

Sample only from the k highest-probability tokens. Prevents low-probability tokens from being selected.

Top-p (Nucleus) Sampling

Sample from the smallest set of tokens whose cumulative probability exceeds p.

More adaptive than top-k: large k when the distribution is flat, small k when peaked.

Beam Search

Maintain B candidate sequences simultaneously, selecting the globally most probable at the end. Used in translation and summarization; less common for chat.

The Context Window

LLMs have a fixed context window (e.g., 128K tokens for Claude 3.5). Everything in the context — system prompt, conversation history, retrieved documents — counts against this limit.

KV cache: During generation, the key-value pairs for all processed tokens are cached. This makes each new token generation O(1) in compute (only the new token needs a forward pass through each layer).

Emergent Capabilities

GPT-3 (175B) exhibited capabilities absent in GPT-2 (1.5B):

Capability	Description
Few-shot learning	Learn new tasks from 2-5 examples in the prompt
Chain-of-thought	Step-by-step reasoning improves multi-step accuracy
Code generation	Write and debug working code across languages
Instruction following	Understand and execute natural language instructions

These appear to emerge discontinuously — they're absent at small scale and suddenly appear at a threshold. This is the scaling hypothesis.

Why "Generative"?

A generative model learns the joint distribution P(X) over all possible sequences, not just a discriminative mapping P(Y|X). This allows it to create new sequences that are plausible given learned patterns — not just classify or retrieve.