Course Content

Module 4 of 9

Module 4: Vector Databases & Embeddings

0/3

0% complete

1. Embeddings: Dense Vector Representations+80 XP

2. HNSW and Approximate Nearest Neighbor Search+85 XP

3. Production Vector Databases+85 XP

Module 4/9 · Lesson 1/3

Embeddings: Dense Vector Representations

genai

intermediate

+80 XP

Embeddings: Dense Vector Representations

An embedding is a mapping from a discrete object (token, sentence, image, user) to a point in a continuous high-dimensional vector space, such that semantic similarity corresponds to geometric proximity.

Why Dense Over Sparse?

Representation	Dimensionality	Captures Semantics	Storage
TF-IDF	~100K (vocab size)	No	Sparse
Word2Vec	300d	Partial (static)	Dense
Sentence-BERT	384–1024d	Yes	Dense
OpenAI text-embedding-3-large	3072d	Yes	Dense

Dense embeddings allow "king − man + woman ≈ queen" style arithmetic because meaning is encoded in the geometric structure.

Sentence & Document Embeddings

Word-level embeddings don't capture sentence meaning. SBERT (Sentence-BERT) uses siamese networks with a contrastive objective:

Semantically similar sentences → high cosine similarity
Semantically different sentences → low cosine similarity

Key models:

Model	Dims	Speed	Use Case
all-MiniLM-L6-v2	384	Fast	Local dev, prototyping
all-mpnet-base-v2	768	Medium	High quality, general
text-embedding-3-small	1536	API	Cost-effective production
text-embedding-3-large	3072	API	Maximum quality

Similarity Metrics

Cosine Similarity

$\text{cos}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$

Measures the angle between vectors, ignoring magnitude. Robust to document length differences. Standard choice for text.

Dot Product

$\mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i$

Equivalent to cosine similarity if vectors are L2-normalized. Faster computation.

Euclidean Distance (L2)

$d(\mathbf{a}, \mathbf{b}) = \|\mathbf{a} - \mathbf{b}\|_2$

Sensitive to vector magnitude — less commonly used for text.

Embedding Dimensions & Truncation

OpenAI's text-embedding-3 models support Matryoshka Representation Learning (MRL) — you can truncate the embedding to fewer dimensions without significant quality loss. This allows a cost-quality tradeoff:

256d: 90% of full quality, 12× smaller
1024d: 97% of full quality, 3× smaller
3072d: 100%

What Good Embeddings Capture

Semantic similarity: "automobile" ≈ "car" ≈ "vehicle"
Syntactic structure: verb forms cluster together
Factual knowledge: country-capital relationships
Domain-specific meaning: in medical text, "MI" ≈ "myocardial infarction"

Embedding quality is highly domain-dependent — a general-purpose embedder may underperform a domain-specific one fine-tuned on your data.