Course Content

Module 2 of 9

Module 2: NLP & Language Models

0/3

0% complete

1. Text Representations & Word Embeddings+65 XP

2. The Transformer Architecture+70 XP

3. BERT, GPT, and the Pre-training Paradigm+65 XP

Module 2/9 · Lesson 1/3

genai

beginner

+65 XP

Text Representations & Word Embeddings

Computers operate on numbers. The history of NLP is largely the history of finding better ways to represent text numerically.

Evolution of Text Representations

A document becomes a sparse vector where each dimension is a vocabulary word and the value is its count (or TF-IDF weight).

Weights terms by frequency in document vs. rarity across corpus:

$\text{TF-IDF}(t, d) = \text{tf}(t, d) \times \log\frac{N}{\text{df}(t)}$

Still sparse, but common words (the, a, is) are downweighted.

Dense 300d vectors trained with two objectives:

Famous property: king − man + woman ≈ queen

These linear analogies work because semantically similar words cluster in the embedding space.

Trains on word co-occurrence statistics from a global co-occurrence matrix. Combines the best of matrix factorization and neural word embeddings.

Word2Vec gives each word a single vector. But "bank" has different meanings in:

ELMo (2018): Use a bidirectional LSTM — each word's representation depends on its context.

BERT (2018): Use a Transformer encoder — each token's representation attends to every other token simultaneously.

This shift from static to contextual embeddings was transformative for all NLP tasks.

Modern LLMs don't tokenize at the word level — they use subword tokenization:

Algorithm	Used By	Approach
BPE	GPT series	Bottom-up byte pair merging
WordPiece	BERT	Maximize likelihood of training data
SentencePiece	T5, LLaMA	Unigram language model

Why subword? Handles rare words gracefully. "tokenization" might become ["token", "ization"] instead of [UNK].