AIMLSAGA
Course Content
Module 9 of 9

Module 9: Production & MLOps

0/3
0% complete
1. LLM Deployment Patterns+130 XP
2. Observability & Evaluation+135 XP
3. LLM Security: OWASP Top 10+135 XP
Module 9/9 · Lesson 1/3
LLM Deployment Patterns
genai
advanced
+130 XP

LLM Deployment Patterns

Getting an LLM to work in a notebook is the easy part. Getting it to work reliably for 10,000 concurrent users at <500ms p95 latency is the hard part.

The Latency Stack

End-to-end latency = network + queue + TTFT + TBT × output_tokens

  • TTFT (Time to First Token): Latency until the first token streams. Drives perceived responsiveness.
  • TBT (Time Between Tokens): Throughput once generation starts.
  • Total generation time: ~TTFT + TBT × (output_length)

Typical production targets:

  • Streaming UI: TTFT < 500ms
  • Batch processing: throughput > cost

Deployment Options

Managed APIs (OpenAI, Anthropic, Google)

  • Pros: Zero ops, world-class models, automatic scaling
  • Cons: Cost at scale, data privacy concerns, vendor dependency
  • Best for: Startups, prototypes, apps where managed cost > model cost

Self-Hosted Open Models (vLLM, Ollama, TGI)

  • Pros: Full data control, cost-effective at scale, customizable
  • Cons: GPU infra, ops burden, model quality gap (closing fast with LLaMA 3.3)
  • Best for: High-volume, sensitive data, need for customization

Hybrid

Route queries by sensitivity: sensitive data → self-hosted, general queries → API.

vLLM: Production Inference Server

vLLM is the dominant open-source LLM inference engine:

PagedAttention: Like virtual memory for the KV cache. Sequences share physical GPU memory blocks non-contiguously, dramatically improving throughput and reducing memory fragmentation.

Continuous batching: Start processing new requests immediately as earlier requests finish, rather than waiting for the full batch. Keeps GPUs fully utilized.

OpenAI-compatible API: Drop-in replacement for OpenAI API calls.

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --max-model-len 8192 \
  --tensor-parallel-size 2  # Use 2 GPUs

Cost Optimization

StrategySavingsTradeoff
Use smaller models for simple tasks5–50×Quality for complex tasks
Cache frequent queries0–90%+Storage, staleness
Reduce output length20–60%May truncate content
Batching2–5× throughputAdded latency
Prompt optimization10–30%Engineering time
Use spot/preemptible GPUs60–80%Interruption risk

Caching Layers

Exact cache: Hash (system_prompt + user_message) → cache response. Very high hit rate for repeated queries (FAQs, code completions).

Semantic cache: Embed the query, search a vector store of cached Q&A pairs. If cosine similarity > 0.95, return cached response. Works for near-duplicate queries.

Prompt prefix cache: Many APIs cache the KV for long system prompts. Structure prompts with stable prefix first to maximize cache hits (saves 50–90% of prefill cost).

Streaming Architecture

Client ←─── SSE/WebSocket ←─── API Gateway ←─── LLM Service
                                    │
                               Rate Limiter
                               Auth/AuthZ
                               Logging

Always stream responses to users. Even if total latency is the same, streaming dramatically improves perceived performance.