21 May 2026 09:00 AM
Request count works for normal web apps, but it breaks down when you serve LLMs on Kubernetes. Prompt length, output length, RAG context, KV cache pressure, GPU capacity, latency, and observability are all driven by tokens, not requests.