Module 4: Vector Databases & Embeddings
Embeddings: Dense Vector Representations
Embeddings: Dense Vector Representations
An embedding is a mapping from a discrete object (token, sentence, image, user) to a point in a continuous high-dimensional vector space, such that semantic similarity corresponds to geometric proximity.
Why Dense Over Sparse?
| Representation | Dimensionality | Captures Semantics | Storage |
|---|---|---|---|
| TF-IDF | ~100K (vocab size) | No | Sparse |
| Word2Vec | 300d | Partial (static) | Dense |
| Sentence-BERT | 384–1024d | Yes | Dense |
| OpenAI text-embedding-3-large | 3072d | Yes | Dense |
Dense embeddings allow "king − man + woman ≈ queen" style arithmetic because meaning is encoded in the geometric structure.
Sentence & Document Embeddings
Word-level embeddings don't capture sentence meaning. SBERT (Sentence-BERT) uses siamese networks with a contrastive objective:
- Semantically similar sentences → high cosine similarity
- Semantically different sentences → low cosine similarity
Key models:
| Model | Dims | Speed | Use Case |
|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fast | Local dev, prototyping |
| all-mpnet-base-v2 | 768 | Medium | High quality, general |
| text-embedding-3-small | 1536 | API | Cost-effective production |
| text-embedding-3-large | 3072 | API | Maximum quality |
Similarity Metrics
Cosine Similarity
Measures the angle between vectors, ignoring magnitude. Robust to document length differences. Standard choice for text.
Dot Product
Equivalent to cosine similarity if vectors are L2-normalized. Faster computation.
Euclidean Distance (L2)
Sensitive to vector magnitude — less commonly used for text.
Embedding Dimensions & Truncation
OpenAI's text-embedding-3 models support Matryoshka Representation Learning (MRL) — you can truncate the embedding to fewer dimensions without significant quality loss. This allows a cost-quality tradeoff:
- 256d: 90% of full quality, 12× smaller
- 1024d: 97% of full quality, 3× smaller
- 3072d: 100%
What Good Embeddings Capture
- Semantic similarity: "automobile" ≈ "car" ≈ "vehicle"
- Syntactic structure: verb forms cluster together
- Factual knowledge: country-capital relationships
- Domain-specific meaning: in medical text, "MI" ≈ "myocardial infarction"
Embedding quality is highly domain-dependent — a general-purpose embedder may underperform a domain-specific one fine-tuned on your data.