LLM Deployment Patterns

Getting an LLM to work in a notebook is the easy part. Getting it to work reliably for 10,000 concurrent users at <500ms p95 latency is the hard part.

The Latency Stack

End-to-end latency = network + queue + TTFT + TBT × output_tokens

TTFT (Time to First Token): Latency until the first token streams. Drives perceived responsiveness.
TBT (Time Between Tokens): Throughput once generation starts.
Total generation time: ~TTFT + TBT × (output_length)

Typical production targets:

Streaming UI: TTFT < 500ms
Batch processing: throughput > cost

Deployment Options

Managed APIs (OpenAI, Anthropic, Google)

Pros: Zero ops, world-class models, automatic scaling
Cons: Cost at scale, data privacy concerns, vendor dependency
Best for: Startups, prototypes, apps where managed cost > model cost

Self-Hosted Open Models (vLLM, Ollama, TGI)

Pros: Full data control, cost-effective at scale, customizable
Cons: GPU infra, ops burden, model quality gap (closing fast with LLaMA 3.3)
Best for: High-volume, sensitive data, need for customization

Hybrid

Route queries by sensitivity: sensitive data → self-hosted, general queries → API.

vLLM: Production Inference Server

vLLM is the dominant open-source LLM inference engine:

PagedAttention: Like virtual memory for the KV cache. Sequences share physical GPU memory blocks non-contiguously, dramatically improving throughput and reducing memory fragmentation.

Continuous batching: Start processing new requests immediately as earlier requests finish, rather than waiting for the full batch. Keeps GPUs fully utilized.

OpenAI-compatible API: Drop-in replacement for OpenAI API calls.

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --max-model-len 8192 \
  --tensor-parallel-size 2  # Use 2 GPUs

Cost Optimization

Strategy	Savings	Tradeoff
Use smaller models for simple tasks	5–50×	Quality for complex tasks
Cache frequent queries	0–90%+	Storage, staleness
Reduce output length	20–60%	May truncate content
Batching	2–5× throughput	Added latency
Prompt optimization	10–30%	Engineering time
Use spot/preemptible GPUs	60–80%	Interruption risk

Caching Layers

Exact cache: Hash (system_prompt + user_message) → cache response. Very high hit rate for repeated queries (FAQs, code completions).

Semantic cache: Embed the query, search a vector store of cached Q&A pairs. If cosine similarity > 0.95, return cached response. Works for near-duplicate queries.

Prompt prefix cache: Many APIs cache the KV for long system prompts. Structure prompts with stable prefix first to maximize cache hits (saves 50–90% of prefill cost).

Streaming Architecture

Client ←─── SSE/WebSocket ←─── API Gateway ←─── LLM Service
                                    │
                               Rate Limiter
                               Auth/AuthZ
                               Logging

Always stream responses to users. Even if total latency is the same, streaming dramatically improves perceived performance.