Tagged with

Kubernetes

This post thumbnail

4 June 2026 05:29 PM

Before Kubernetes can schedule LLM workloads onto GPUs, the node has to expose the right device resources, labels, runtime support, metrics, sharing mode, and operational boundaries. This article explains the GPU node setup behind serious LLM deployments on Kubernetes.

This post thumbnail

28 May 2026 09:00 AM

A practical look at why giant LLMs do not simply run inside one pod: weight memory math, FP16 and BF16 cost, tensor parallelism, pipeline parallelism, expert parallelism, MoE active parameters, and what Kubernetes is actually scheduling.

This post thumbnail

21 May 2026 09:00 AM

Request count works for normal web apps, but it breaks down when you serve LLMs on Kubernetes. Prompt length, output length, RAG context, KV cache pressure, GPU capacity, latency, and observability are all driven by tokens, not requests.

This post thumbnail

14 May 2026 09:00 AM

A practical introduction to why LLM serving breaks the usual web-app scaling playbook: requests become token streams, latency splits into TTFT and TPOT, replicas may span GPUs or nodes, memory becomes KV cache, and autoscaling needs workload-aware signals instead of CPU alone.

This post thumbnail

26 April 2026 07:49 AM

In the previous part, we set up cert-manager on a Kubernetes cluster and issued SSL certif...

This post thumbnail

24 October 2025 07:49 AM

Learn how to enable Basic auth for Prometheus ingress with ALB in Kubernetes by adding an NGINX proxy sidecar in kube-prometheus-stack Helm chart. Full guide with YAML examples, annotations, and best practices.