How Do You Fit a Trillion-Parameter Model Into a Kubernetes Cluster?

Series links
A giant model does not "run in a pod."
That sentence sounds wrong if you have spent years thinking in Kubernetes objects. We package software into containers. We run containers in pods. We schedule pods onto nodes. We put Services in front of them. That model works well when the thing inside the container is a web server, worker, queue consumer, or API process.
Then someone says, "Can we host a trillion-parameter model on Kubernetes?"
The honest answer is: yes, but not in the way your brain first pictures it. A trillion-parameter model is not one neat process sitting inside one neat pod, waiting for the kubelet to give it enough CPU and memory. It is a pile of weights, communication patterns, parallel workers, GPU memory limits, interconnect assumptions, and serving-engine decisions. Kubernetes can coordinate the outer shape, but the model itself has to be split.
Part 1 of this series argued that LLM serving is not normal web serving. Part 2 argued that requests are the wrong unit of scale because tokens are the work. Part 3 is about the next uncomfortable step: once the model is large enough, a replica is no longer a pod. A replica may be a group of GPUs, a node, several nodes, or a slice of a much larger GPU cluster working together.
The pod is just the envelope. The model is the distributed system.
Start with the boring math
Before tensor parallelism, pipeline parallelism, expert parallelism, Ray, vLLM, KServe, MPI, NCCL, or any Kubernetes YAML, there is a dumb memory question:
How many bytes do the weights need?
The rough formula is simple:
weight memory = number of parameters x bytes per parameter
That is only the model weights. It does not include KV cache, runtime overhead, CUDA graphs, activations during training, optimizer states, communication buffers, fragmentation, or the serving engine's own memory reservations. But for the first pass, it is enough to kill bad assumptions.
A 1 trillion parameter dense model stored in FP16 or BF16 needs about:
1,000,000,000,000 parameters x 2 bytes = 2,000,000,000,000 bytes
That is roughly 2 TB of raw weight memory.
Not disk. Not object storage. GPU-addressable memory.
If you had 80 GB GPUs, 2 TB of raw weights already needs 25 GPUs before overhead. If you had 141 GB H200-class GPUs, it still needs about 15 GPUs just for the weights. That does not mean the model is usable with exactly that many GPUs. It means this is the floor before the real serving problem begins.
This is where normal Kubernetes thinking starts to mislead people. A pod can request memory. A pod can request nvidia.com/gpu: 8. Kubernetes can place that pod on a node with enough advertised GPU devices. But Kubernetes does not magically make one process treat 25 separate GPUs as one giant GPU. The serving engine and distributed runtime have to do that work.
Kubernetes schedules access to hardware. It does not shard the model for you.
FP16 and BF16 are not small
FP16 and BF16 are often discussed as if they are memory optimizations, and compared with FP32, they are. FP32 uses 4 bytes per parameter. FP16 and BF16 use 2 bytes. Cutting weight memory in half is a big deal.
But at trillion-parameter scale, half of enormous is still enormous.
A 175B parameter model in FP16 or BF16 is about 350 GB of raw weights. That already does not fit on a single common GPU. A 671B parameter model is about 1.34 TB in FP16 or BF16. A 1T dense model is about 2 TB. A 1.8T dense model would be about 3.6 TB.
Quantization changes the math. FP8 brings 1T parameters down to about 1 TB of raw weights. FP4 brings it down to about 500 GB. NVIDIA's trillion-parameter inference write-up uses an example GPT MoE model with 1.8T parameters stored with FP4, where the raw weights are about 900 GB. On 192 GB GPUs, the theoretical minimum just to hold those weights is five GPUs.
That sounds surprisingly small until you remember the word "minimum."
Five GPUs may hold the weights. They may not generate tokens fast enough. They may not leave enough room for KV cache. For long-context models, KV cache can consume tens of gigabytes per busy replica, and at real concurrency it competes directly with weights for GPU memory. The same five GPUs may also communicate too much. They may deliver terrible time to first token. They may support one beautiful demo and then fall apart under real traffic.
The memory floor tells you whether a deployment is possible. It does not tell you whether it is good.
The model gets split because memory is only one problem
When a model does not fit on one GPU, there are two broad things you can do.
You can make the model smaller: quantize it, distill it, pick a smaller checkpoint, reduce context length, use adapters, or route some traffic to a cheaper model. Those are valid and often the right production choices, but that is not the focus of this part.
Or you can split the model across GPUs.
Splitting is where the word "replica" becomes slippery. In a normal web app, one replica is usually one pod. In LLM serving, one model replica may require multiple GPU workers that must cooperate for every token. If one worker is slow, missing, placed badly, or stuck behind a bad network path, the whole replica suffers.
That is why large-model serving feels less like "run N pods" and more like "assemble a tiny supercomputer for each serving replica."
There are three important forms of model splitting to understand:
- Tensor parallelism
- Pipeline parallelism
- Expert parallelism
They are often combined with data parallelism, where you run multiple independent replicas of the sharded model to serve more traffic. Data parallelism is easy to understand once the model replica fits somewhere. The hard part is making one replica exist in the first place.
Tensor parallelism splits the inside of a layer
Tensor parallelism splits individual tensors inside a model layer across GPUs. Instead of putting the full matrix multiplication for a transformer layer on one GPU, the serving engine divides the layer's work across several GPUs and combines the result.
This is useful because transformers have large matrix operations that can be partitioned. Megatron-LM popularized this style of tensor model parallelism for GPT-like models, and vLLM's distributed serving documentation still points to Megatron-LM's tensor parallel algorithm as the implementation basis.
A simple mental model:
One transformer layer
GPU 0: owns one slice of the weight matrix
GPU 1: owns another slice
GPU 2: owns another slice
GPU 3: owns another slice
For every token step, the GPUs compute their slices and then exchange partial results. This is powerful, but it is not free. Tensor parallelism depends heavily on fast GPU-to-GPU communication. Inside a node with NVLink or another high-bandwidth interconnect, it can work well. Across nodes, the communication cost can get ugly.
That is why many practical guides recommend keeping tensor parallelism inside a node when possible. vLLM's scaling guidance gives the same shape: if the model is too large for one GPU but fits on one multi-GPU machine, use tensor parallelism. If you have 4 GPUs in the node, tensor_parallel_size=4 is the obvious starting point.
Tensor parallelism makes one layer wider across GPUs. It helps with memory and per-token compute, but it ties those GPUs together tightly. They are not independent pods anymore. They are pieces of one inference machine.
Pipeline parallelism splits the stack of layers
Pipeline parallelism cuts the model vertically by layers.
Instead of every GPU participating in every layer, one GPU or group of GPUs owns the early layers, another owns the middle layers, and another owns the later layers. A request moves through those stages like work moving through an assembly line.
A rough picture:
Stage 1: layers 1-20 -> GPU group A
Stage 2: layers 21-40 -> GPU group B
Stage 3: layers 41-60 -> GPU group C
Stage 4: layers 61-80 -> GPU group D
Pipeline parallelism is attractive when the model cannot fit within one node. Instead of stretching tensor parallelism across a slow boundary, you can keep tensor parallelism inside each node and use pipeline parallelism across nodes. NVIDIA's Megatron work describes exactly this pattern: tensor parallelism works well within a DGX A100 node, while pipeline parallelism helps scale across nodes because it uses a different communication pattern.
vLLM's current docs give a practical serving version of the same idea. For 2 nodes with 8 GPUs each, set tensor parallelism to 8 and pipeline parallelism to 2. In plain English: split each layer across the 8 GPUs inside a node, then split the model's layers across the 2 nodes.
tensor_parallel_size = GPUs per node
pipeline_parallel_size = number of nodes
That rule is not sacred, but it is a good first mental model.
Pipeline parallelism also introduces its own pain. Pipelines can have bubbles, where some stages sit idle while waiting for work. Training systems fight this with microbatches and scheduling tricks. In inference, the serving engine may keep stages busier by feeding different requests through the pipeline continuously, but that depends on batching, traffic shape, and implementation. The operational point is simple: the more stages you add, the more the model starts behaving like a distributed workflow instead of a containerized API.
Kubernetes can keep the pods alive. The serving engine has to keep the pipeline full.
In that shape, a "replica" is already bigger than the mental model most platform teams start with. The serving replica is not one container. It is a coordinated set of ranks, workers, and devices.
Expert parallelism exists because MoE models are weird
Mixture-of-Experts models add another twist.
A dense model usually uses the same parameters for every token. If the model has 70B parameters, each token flows through that dense stack. If the model has 1T dense parameters, the serving system must deal with a terrifying amount of memory and compute.
MoE models are different. They contain many expert feed-forward networks, and a router chooses which experts handle each token. This creates a model with a huge total parameter count, but only a fraction of those parameters are active for any one token.
This is the line that saves people from a lot of confusion:
Trillion parameters does not always mean trillion-parameter compute per token.
The Switch Transformer paper made this idea famous at large scale. It describes MoE models as sparsely activated: huge parameter counts, but roughly constant computation because each token is routed to a small number of experts. Switch simplified the routing further by sending each token to one expert.
Modern public models show the same idea in a more familiar form. DeepSeek-V3 is reported as a 671B parameter MoE model, but only 37B parameters are activated for each token. That does not make the model "really 37B." The inactive experts still exist. Their weights still need to live somewhere. But the compute path for one token is much smaller than the total parameter count suggests.
This distinction matters for capacity planning. Total parameters drive storage and placement. Active parameters drive per-token compute. Both matter, but they are not the same number.
Expert parallelism is the systems trick that places different experts on different GPUs or nodes. When tokens are routed to experts, the serving system sends token representations to the devices that own those experts, runs the expert computation, and combines the results back into the model flow.
That creates a new bottleneck: token routing and all-to-all communication. If the router sends too much traffic to one expert, that expert becomes hot. If experts are spread across nodes, the network starts carrying token activations around the cluster. If the serving engine does not overlap communication and compute well, the GPUs wait.
MoE is not magic. It trades dense compute for routing, load balancing, memory placement, and communication.
A trillion-parameter MoE is not the same as a trillion-parameter dense model
This is worth slowing down on because marketing numbers blur it.
Imagine two models:
Model A: 1T dense parameters
Model B: 1T total MoE parameters, 50B active per token
Both can be called trillion-parameter models. They are not the same infrastructure problem.
Model A needs the serving system to carry the memory and compute burden of the full dense stack. Every token touches the model in a much more uniform way.
Model B still needs the cluster to store the full set of experts, but each token activates only a subset. The challenge shifts toward routing, expert placement, load balancing, and making sure the right GPUs communicate quickly enough.
This is why a model card's parameter count is only the beginning of the conversation. For serving, you also want to know:
- Is it dense or MoE?
- How many total parameters are there?
- How many parameters are active per token?
- What precision are the weights stored in?
- How long is the context window?
- How large is the KV cache at your expected concurrency?
- Does the serving engine support the model's parallelism pattern well?
- Can your node and network topology support the communication pattern?
If you only ask, "How many parameters?" you will size the cluster badly.
What Kubernetes actually sees
Kubernetes does not see tensor slices, pipeline stages, or experts. It sees pods, containers, resources, nodes, labels, taints, tolerations, Services, volumes, and health checks.
For GPUs, Kubernetes normally depends on the device plugin framework. A vendor device plugin registers resources like nvidia.com/gpu with the kubelet. The kubelet advertises those resources on the node. A pod requests them through resource limits. The scheduler places the pod on a node that can satisfy the request.
That is useful, but it is a lower-level contract than many people assume.
Kubernetes can say:
resources:
limits:
nvidia.com/gpu: 8
It cannot infer that those 8 GPUs should form tensor parallel group 0, that another node should form pipeline stage 1, that experts 0-63 should live on one rank group, or that the network path between two stages is now your latency bottleneck.
Those choices happen in the serving layer: vLLM, TensorRT-LLM, Triton or Dynamo Triton, SGLang, TGI, Ray, KServe, llm-d, custom launch scripts, or whatever stack your team chooses. Kubernetes is still important, but its job is orchestration around the model, not inside the model.
This is the split I find useful:
Kubernetes decides where the workers run.
The serving engine decides how the model is split.
The interconnect decides whether the split is fast enough.
Miss any one of those, and the deployment becomes fragile.
One pod, many pods, or one distributed replica?
There are a few common deployment shapes.
The first is a single pod requesting multiple GPUs on one node. This is the simplest shape for models that fit within one machine. A vLLM server might run with --tensor-parallel-size 4 or --tensor-parallel-size 8, and the pod requests the same number of GPUs. Kubernetes schedules one pod. Inside that pod, the serving engine starts multiple GPU workers.
This is operationally pleasant because the failure domain is clean. The pod is up or down. The node has the GPUs or it does not. The model weights are local or mounted. You still have complexity, but it is contained.
The second shape is one distributed replica spread across multiple pods or nodes. This is what you need when the model or desired serving shape exceeds one node. Now you need coordinated startup, rank assignment, service discovery, identical images, shared model paths or download behavior, and careful placement. If one part of the replica is missing, the replica is not healthy.
This is where Kubernetes starts to need help from higher-level controllers or conventions. StatefulSets, headless Services, Ray clusters, KServe runtimes, LeaderWorkerSet, job-style launchers, or custom operators can all appear depending on the stack. The exact tool matters less than the invariant: the workers are not independent replicas. They are shards of one replica.
The third shape is data parallel replicas of sharded replicas. For example, you may run four independent model replicas, and each replica uses 16 GPUs internally. That gives you 64 GPUs total, but the scheduling unit is not "64 independent pods." It is four coordinated groups.
This is where platform teams need to be very careful with autoscaling language. Scaling from 4 replicas to 5 may mean adding 16 GPUs and starting a full distributed group. It may require model weight loading, rank coordination, cache warmup, and traffic shifting. It is not the same as adding one more stateless web pod.
The network is part of the model now
With normal web services, the network matters. With distributed LLM inference, the network is part of the model's execution path.
Tensor parallelism needs frequent GPU-to-GPU communication. Pipeline parallelism moves activations between stages. Expert parallelism can create all-to-all traffic between tokens and expert owners. NCCL, or the equivalent communication layer in your stack, becomes part of the serving path. Multi-node serving depends on bandwidth, latency, topology, and how well the serving engine overlaps communication with compute.
This is why "we have 32 GPUs in the cluster" is not enough information. Are they eight GPUs in four nodes? Four GPUs in eight nodes? Do they have NVLink inside the node? What is the NIC? Is RDMA available? Are the nodes in the same placement group or rack? Are you crossing noisy network boundaries? Is the storage path going to make every cold start painful?
A cluster with the same GPU count can behave like a different machine depending on topology.
For smaller models, Kubernetes scheduling may feel like bin packing. For giant models, it becomes topology-aware placement. Any 16 GPUs will not do. You need 16 GPUs arranged in a way that matches the communication pattern of the model.
That is one reason large AI clusters often feel more rigid than normal Kubernetes clusters. The workload cares about where things are, not only whether resources exist.
Why loading the model is its own event
A 2 TB model is painful while serving. It is also painful while starting.
The weights have to come from somewhere: container image layers, persistent volumes, object storage, local NVMe, a model cache, or a preloaded node image. The file format matters too. A memory-mappable format like safetensors behaves differently from formats that need heavier deserialization before the model is usable. Pulling, reading, mapping, transferring to GPU memory, initializing kernels, building CUDA graphs, and warming the serving engine can dominate startup time.
This changes how you think about pod restarts and autoscaling.
In a web app, a new pod might become useful in seconds. For a giant LLM, a new replica may take minutes. If the weights are remote and the cache is cold, it can be worse. If the replica spans nodes, all workers need to agree on their ranks and become ready together. One slow worker can delay the group.
Kubernetes readiness probes are necessary, but they are not the whole story. A pod can exist before the model is loaded. A container can be running before the GPU workers are ready. A distributed group can have seven healthy workers and still be unusable because the eighth worker failed.
That is why production LLM serving often needs warm pools, minimum replicas, local weight caches, careful rollout strategy, and boring operational patience. The deployment is not ready when the pod starts. It is ready when the model can actually generate tokens at the latency you promised.
A practical sizing walkthrough
Suppose someone asks for a 1T parameter model on Kubernetes.
The first question is not YAML. It is model shape.
If it is a dense 1T model in BF16, you start with about 2 TB of raw weights. On 80 GB GPUs, that is 25 GPUs before overhead. In reality, you probably need more. You also need room for KV cache, which grows with context length and concurrency. If the model supports lower precision weight formats without unacceptable quality loss, quantization may reduce the floor, but it does not remove the distributed serving problem.
If it is a 1T MoE model, the next questions change. How many experts exist? How many are active per token? Are the attention layers dense while the feed-forward layers are sparse? What does the serving engine support: tensor parallel attention, expert parallel MoE layers, data parallel attention, or some hybrid? How much all-to-all traffic appears when real prompts arrive?
Then you map it to topology.
If one node has 8 GPUs with fast intra-node links, tensor parallelism across 8 GPUs is a natural first shape. If the model needs more than one node, you may use tensor parallelism inside each node and pipeline parallelism across nodes. If it is MoE, you may place experts across devices using expert parallelism and still combine that with tensor or data parallel attention.
Only after that does Kubernetes enter the center of the conversation.
You need nodes labeled by GPU type and topology. You need the NVIDIA device plugin or GPU Operator equivalent exposing devices. You need pod specs or higher-level controllers that request the right GPU count. You need placement rules so the workers land together. You need model storage that does not turn every restart into a download storm. You need readiness that understands distributed health. You need metrics that report tokens, KV cache, queueing, and per-worker failures, not just pod CPU.
The YAML is the last mile. The architecture decision happened before it.
Where teams usually get surprised
The first surprise is that GPU count is not capacity. It is potential capacity. Capacity appears only when the GPUs are arranged, connected, loaded, and driven correctly.
The second surprise is that one model replica can be a group. Platform teams like replicas because replicas sound independent. With large LLMs, the word can hide coordination. A "replica" might be 8 pods. Or 16 GPUs. Or 2 nodes. Or a combination of tensor, pipeline, and expert parallel ranks that must agree before anything works.
The third surprise is that MoE parameter counts are easy to misread. A 671B total parameter model with 37B active parameters per token is not lying. It is telling you two different infrastructure facts at once. You need enough memory and placement for the large total model, but the per-token compute path is sparse.
The fourth surprise is that the scheduler is not the serving system. Kubernetes can place pods on GPU nodes. It cannot decide the model parallel strategy. It cannot make a bad tensor-parallel layout fast. It cannot fix a network topology that does not match the workload.
This is why serious LLM-on-Kubernetes work ends up crossing boundaries that platform teams used to keep separate: scheduler behavior, GPU topology, model architecture, serving-engine internals, storage layout, rollout strategy, and latency SLOs.
The Kubernetes mental model that works better
Do not picture a trillion-parameter model as a container image.
Picture it as a distributed runtime that Kubernetes happens to host.
The model weights are split. The computation is split. The memory pressure is split. The failure modes are split. The serving engine owns the model-parallel details. Kubernetes owns the outer lifecycle: placement, resources, health, rollout, identity, networking, storage, and integration with the rest of the platform.
That does not make Kubernetes the wrong tool. It makes Kubernetes the substrate, not the magic trick.
For smaller LLMs, you can get away with thinking "one pod equals one model server." For larger models, use a different sentence:
One serving replica is a coordinated GPU group.
That group may live in one pod on one node. It may live across several pods and nodes. It may combine tensor parallelism, pipeline parallelism, expert parallelism, and data parallelism. But if it takes all of those workers to generate one token stream, treat the group as the unit you operate.
That shift makes the rest of the architecture less surprising.
Autoscaling becomes group scaling. Rollouts become distributed rollouts. Readiness becomes model readiness, not process readiness. Capacity planning starts with bytes and tokens, not pod count. Scheduling starts caring about topology. Observability moves from CPU and memory to TTFT, TPOT, tokens per second, KV cache, queue depth, per-rank health, and GPU utilization.
A giant model does not run in a pod. It runs across a shape.
Kubernetes can manage that shape, but only if you tell it what the shape is.
Continue the series
The next part goes one layer lower into the machines themselves: GPU nodes, device plugins, feature discovery, MIG, MPS, time slicing, labels, taints, and what Kubernetes must know before it schedules an LLM.
If you are working through LLM serving on Kubernetes, subscribe to get the next part. I am also putting together a free LLM Serving on Kubernetes Production Readiness Checklist that turns these ideas into a practical review path for teams.
And if your team is already trying to serve large models on Kubernetes, this is the kind of architecture decision worth reviewing before the cloud bill becomes the incident report.
Sources worth reading
- NVIDIA: Demystifying AI Inference Deployments for Trillion Parameter Large Language Models
- NVIDIA: Scaling Language Model Training to a Trillion Parameters Using Megatron
- vLLM documentation: Parallelism and Scaling
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- DeepSeek-V3 Technical Report
- Kubernetes documentation: Device Plugins
Enjoyed this post?
Get AI + DevOps insights delivered to your inbox. No spam, unsubscribe anytime.