AIMLSAGA
Course Content
Module 2 of 9

Module 2: NLP & Language Models

0/3
0% complete
1. Text Representations & Word Embeddings+65 XP
2. The Transformer Architecture+70 XP
3. BERT, GPT, and the Pre-training Paradigm+65 XP
Module 2/9 · Lesson 1/3
Text Representations & Word Embeddings
genai
beginner
+65 XP

Text Representations & Word Embeddings

Computers operate on numbers. The history of NLP is largely the history of finding better ways to represent text numerically.

Evolution of Text Representations

Bag of Words (BoW)

A document becomes a sparse vector where each dimension is a vocabulary word and the value is its count (or TF-IDF weight).

  • Pros: Simple, interpretable
  • Cons: No word order, no semantics ("dog bites man" ≡ "man bites dog")

TF-IDF

Weights terms by frequency in document vs. rarity across corpus:

TF-IDF(t,d)=tf(t,d)×logNdf(t)\text{TF-IDF}(t, d) = \text{tf}(t, d) \times \log\frac{N}{\text{df}(t)}

Still sparse, but common words (the, a, is) are downweighted.

Word2Vec (Mikolov et al., 2013)

Dense 300d vectors trained with two objectives:

  • CBOW: Predict center word from context words
  • Skip-gram: Predict context words from center word

Famous property: king − man + woman ≈ queen

These linear analogies work because semantically similar words cluster in the embedding space.

GloVe

Trains on word co-occurrence statistics from a global co-occurrence matrix. Combines the best of matrix factorization and neural word embeddings.

The Contextual Embedding Revolution

Word2Vec gives each word a single vector. But "bank" has different meanings in:

  • "river bank"
  • "bank account"

ELMo (2018): Use a bidirectional LSTM — each word's representation depends on its context.

BERT (2018): Use a Transformer encoder — each token's representation attends to every other token simultaneously.

This shift from static to contextual embeddings was transformative for all NLP tasks.

Tokenization in Modern LLMs

Modern LLMs don't tokenize at the word level — they use subword tokenization:

AlgorithmUsed ByApproach
BPEGPT seriesBottom-up byte pair merging
WordPieceBERTMaximize likelihood of training data
SentencePieceT5, LLaMAUnigram language model

Why subword? Handles rare words gracefully. "tokenization" might become ["token", "ization"] instead of [UNK].