Module 9: Production & MLOps
LLM Deployment Patterns
LLM Deployment Patterns
Getting an LLM to work in a notebook is the easy part. Getting it to work reliably for 10,000 concurrent users at <500ms p95 latency is the hard part.
The Latency Stack
End-to-end latency = network + queue + TTFT + TBT × output_tokens
- TTFT (Time to First Token): Latency until the first token streams. Drives perceived responsiveness.
- TBT (Time Between Tokens): Throughput once generation starts.
- Total generation time: ~TTFT + TBT × (output_length)
Typical production targets:
- Streaming UI: TTFT < 500ms
- Batch processing: throughput > cost
Deployment Options
Managed APIs (OpenAI, Anthropic, Google)
- Pros: Zero ops, world-class models, automatic scaling
- Cons: Cost at scale, data privacy concerns, vendor dependency
- Best for: Startups, prototypes, apps where managed cost > model cost
Self-Hosted Open Models (vLLM, Ollama, TGI)
- Pros: Full data control, cost-effective at scale, customizable
- Cons: GPU infra, ops burden, model quality gap (closing fast with LLaMA 3.3)
- Best for: High-volume, sensitive data, need for customization
Hybrid
Route queries by sensitivity: sensitive data → self-hosted, general queries → API.
vLLM: Production Inference Server
vLLM is the dominant open-source LLM inference engine:
PagedAttention: Like virtual memory for the KV cache. Sequences share physical GPU memory blocks non-contiguously, dramatically improving throughput and reducing memory fragmentation.
Continuous batching: Start processing new requests immediately as earlier requests finish, rather than waiting for the full batch. Keeps GPUs fully utilized.
OpenAI-compatible API: Drop-in replacement for OpenAI API calls.
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--max-model-len 8192 \
--tensor-parallel-size 2 # Use 2 GPUs
Cost Optimization
| Strategy | Savings | Tradeoff |
|---|---|---|
| Use smaller models for simple tasks | 5–50× | Quality for complex tasks |
| Cache frequent queries | 0–90%+ | Storage, staleness |
| Reduce output length | 20–60% | May truncate content |
| Batching | 2–5× throughput | Added latency |
| Prompt optimization | 10–30% | Engineering time |
| Use spot/preemptible GPUs | 60–80% | Interruption risk |
Caching Layers
Exact cache: Hash (system_prompt + user_message) → cache response. Very high hit rate for repeated queries (FAQs, code completions).
Semantic cache: Embed the query, search a vector store of cached Q&A pairs. If cosine similarity > 0.95, return cached response. Works for near-duplicate queries.
Prompt prefix cache: Many APIs cache the KV for long system prompts. Structure prompts with stable prefix first to maximize cache hits (saves 50–90% of prefill cost).
Streaming Architecture
Client ←─── SSE/WebSocket ←─── API Gateway ←─── LLM Service
│
Rate Limiter
Auth/AuthZ
Logging
Always stream responses to users. Even if total latency is the same, streaming dramatically improves perceived performance.