I Watched Our AI Pipeline Silently Fail While Kubernetes Said Everything Was Fine

CPU is the wrong signal for LLM workloads. When inference requests queue up, GPU workers saturate and latency spikes — but CPU stays low, so Kubernetes never scales. The fix: use KEDA to scale on queue depth and P95 latency instead. Two triggers, one ScaledObject, and your autoscaler finally responds to real demand instead of misleading metrics.