Module 2: NLP & Language Models
Text Representations & Word Embeddings
Text Representations & Word Embeddings
Computers operate on numbers. The history of NLP is largely the history of finding better ways to represent text numerically.
Evolution of Text Representations
Bag of Words (BoW)
A document becomes a sparse vector where each dimension is a vocabulary word and the value is its count (or TF-IDF weight).
- Pros: Simple, interpretable
- Cons: No word order, no semantics ("dog bites man" ≡ "man bites dog")
TF-IDF
Weights terms by frequency in document vs. rarity across corpus:
Still sparse, but common words (the, a, is) are downweighted.
Word2Vec (Mikolov et al., 2013)
Dense 300d vectors trained with two objectives:
- CBOW: Predict center word from context words
- Skip-gram: Predict context words from center word
Famous property: king − man + woman ≈ queen
These linear analogies work because semantically similar words cluster in the embedding space.
GloVe
Trains on word co-occurrence statistics from a global co-occurrence matrix. Combines the best of matrix factorization and neural word embeddings.
The Contextual Embedding Revolution
Word2Vec gives each word a single vector. But "bank" has different meanings in:
- "river bank"
- "bank account"
ELMo (2018): Use a bidirectional LSTM — each word's representation depends on its context.
BERT (2018): Use a Transformer encoder — each token's representation attends to every other token simultaneously.
This shift from static to contextual embeddings was transformative for all NLP tasks.
Tokenization in Modern LLMs
Modern LLMs don't tokenize at the word level — they use subword tokenization:
| Algorithm | Used By | Approach |
|---|---|---|
| BPE | GPT series | Bottom-up byte pair merging |
| WordPiece | BERT | Maximize likelihood of training data |
| SentencePiece | T5, LLaMA | Unigram language model |
Why subword? Handles rare words gracefully. "tokenization" might become ["token", "ization"] instead of [UNK].