AIMLSAGA
Course Content
Module 4 of 9

Module 4: Vector Databases & Embeddings

0/3
0% complete
1. Embeddings: Dense Vector Representations+80 XP
2. HNSW and Approximate Nearest Neighbor Search+85 XP
3. Production Vector Databases+85 XP
Module 4/9 · Lesson 1/3
Embeddings: Dense Vector Representations
genai
intermediate
+80 XP

Embeddings: Dense Vector Representations

An embedding is a mapping from a discrete object (token, sentence, image, user) to a point in a continuous high-dimensional vector space, such that semantic similarity corresponds to geometric proximity.

Why Dense Over Sparse?

RepresentationDimensionalityCaptures SemanticsStorage
TF-IDF~100K (vocab size)NoSparse
Word2Vec300dPartial (static)Dense
Sentence-BERT384–1024dYesDense
OpenAI text-embedding-3-large3072dYesDense

Dense embeddings allow "king − man + woman ≈ queen" style arithmetic because meaning is encoded in the geometric structure.

Sentence & Document Embeddings

Word-level embeddings don't capture sentence meaning. SBERT (Sentence-BERT) uses siamese networks with a contrastive objective:

  • Semantically similar sentences → high cosine similarity
  • Semantically different sentences → low cosine similarity

Key models:

ModelDimsSpeedUse Case
all-MiniLM-L6-v2384FastLocal dev, prototyping
all-mpnet-base-v2768MediumHigh quality, general
text-embedding-3-small1536APICost-effective production
text-embedding-3-large3072APIMaximum quality

Similarity Metrics

Cosine Similarity

cos(a,b)=abab\text{cos}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}

Measures the angle between vectors, ignoring magnitude. Robust to document length differences. Standard choice for text.

Dot Product

ab=iaibi\mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i

Equivalent to cosine similarity if vectors are L2-normalized. Faster computation.

Euclidean Distance (L2)

d(a,b)=ab2d(\mathbf{a}, \mathbf{b}) = \|\mathbf{a} - \mathbf{b}\|_2

Sensitive to vector magnitude — less commonly used for text.

Embedding Dimensions & Truncation

OpenAI's text-embedding-3 models support Matryoshka Representation Learning (MRL) — you can truncate the embedding to fewer dimensions without significant quality loss. This allows a cost-quality tradeoff:

  • 256d: 90% of full quality, 12× smaller
  • 1024d: 97% of full quality, 3× smaller
  • 3072d: 100%

What Good Embeddings Capture

  • Semantic similarity: "automobile" ≈ "car" ≈ "vehicle"
  • Syntactic structure: verb forms cluster together
  • Factual knowledge: country-capital relationships
  • Domain-specific meaning: in medical text, "MI" ≈ "myocardial infarction"

Embedding quality is highly domain-dependent — a general-purpose embedder may underperform a domain-specific one fine-tuned on your data.