Your First LLM API on Kubernetes: From Model to Curl Request

Series links
Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM
Part 2: The Request Is the Wrong Unit of Scale for LLMs on Kubernetes
Part 3: How Do You Fit a Trillion-Parameter Model Into a Kubernetes Cluster?
Part 4: Before the Pod Starts: GPU Node Setup for LLMs on Kubernetes
Part 5: OpenAI Already Told Us the Kubernetes Scaling Story, Most People Just Did Not Read It Closely

So far in this series, we have covered the mental model, tokens, model size, GPU node readiness, and OpenAI's Kubernetes scaling lessons.

Now we should run something.

In this part, we will deploy an actual model on a Kubernetes GPU node, expose it as an OpenAI-compatible API, and call it with curl. The model is:

Qwen/Qwen2.5-1.5B-Instruct

That model is small enough for a first single-GPU walkthrough, but still behaves like a real chat model. If your GPU is very small, try Qwen/Qwen2.5-0.5B-Instruct. If you have more memory and want a bigger test, try Qwen/Qwen2.5-7B-Instruct.

Do not start with the biggest model you can name. Start with a model your node can actually load. The goal here is not benchmark glory. The goal is to get from Kubernetes GPU capacity to a working LLM API request.

What vLLM is doing in this setup

Kubernetes is not serving the model by itself. Kubernetes schedules the pod, gives it networking, mounts the Secret, and asks the NVIDIA device plugin for a GPU. After that, the model server inside the container has to do the LLM-specific work.

vLLM is that model server in this walkthrough. It downloads the model weights, loads them into GPU memory, starts an HTTP server, accepts OpenAI-compatible requests, batches work internally, runs the model, and streams or returns generated tokens.

That distinction matters. The Kubernetes Deployment does not magically become an LLM API because it has nvidia.com/gpu: 1. It becomes an LLM API because the container starts a serving engine that knows how to load a Hugging Face model and expose routes like /v1/chat/completions.

vLLM is a good first serving engine because it hides a lot of ugly details without hiding the shape from you. You still see the model name, GPU request, port, token Secret, logs, Service, and curl request. But you do not have to write your own batching loop, tokenizer path, HTTP server, or OpenAI-compatible API wrapper just to prove the deployment works.

vLLM is the engine. The thing we care about is the model API it serves.

Prerequisites

I am assuming you already completed the GPU node setup from Part 4. That means the NVIDIA driver stack, container runtime, GPU Operator or NVIDIA device plugin, labels, and basic GPU checks are already working.

We are not reinstalling the GPU Operator here. Before deploying the model, confirm Kubernetes can see GPU capacity:

kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu

A useful output looks like this:

NAME            GPU
gpu-worker-01   1

If the GPU column is empty, <none>, or missing, stop here. Kubernetes cannot schedule this workload until the node advertises nvidia.com/gpu.

Create a Hugging Face token first

Even though Qwen/Qwen2.5-1.5B-Instruct is public, we will still use a Hugging Face token. That is intentional.

Real teams often start with a public model and later swap to a gated model, private model, licensed model, or organization repository. If the token path is already part of the Deployment, that swap is much less annoying.

Create a token first:

Open the official Hugging Face token docs: https://huggingface.co/docs/hub/security-tokens
Create a token with read access.
Copy the token value and keep it ready.

From this point onward, I will assume you have the token value. Do not paste it into Git. Do not put it directly in a Deployment manifest. Put it in a Kubernetes Secret.

Create the namespace and Secret

Keep the first LLM workload out of the default namespace:

kubectl create namespace llm-demo

Set the token in your shell:

export HF_TOKEN="hf_your_token_here"

Create the Secret:

kubectl create secret generic hf-token \
  -n llm-demo \
  --from-literal=HF_TOKEN="${HF_TOKEN}"

Check that it exists:

kubectl get secret hf-token -n llm-demo

Expected shape:

NAME       TYPE     DATA   AGE
hf-token   Opaque   1      10s

Existence is enough. Do not print the token back unless you have a specific reason.

Deploy the model API

vLLM gives us the model server and the OpenAI-compatible HTTP API. The Kubernetes pattern is documented in the vLLM Kubernetes docs, and the API shape is documented in the vLLM OpenAI-compatible server docs.

Create qwen-vllm.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen-vllm
  namespace: llm-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwen-vllm
  template:
    metadata:
      labels:
        app: qwen-vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          imagePullPolicy: IfNotPresent
          command:
            - vllm
            - serve
            - Qwen/Qwen2.5-1.5B-Instruct
          args:
            - --host
            - 0.0.0.0
            - --port
            - "8000"
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: HF_TOKEN
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: HF_TOKEN
          resources:
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi
---
apiVersion: v1
kind: Service
metadata:
  name: qwen-vllm
  namespace: llm-demo
spec:
  selector:
    app: qwen-vllm
  ports:
    - name: http
      port: 8000
      targetPort: 8000

A few details matter.

The pod requests one GPU with nvidia.com/gpu: 1. That is what makes this schedulable as a GPU workload. The token appears as both HF_TOKEN and HUGGING_FACE_HUB_TOKEN because different libraries and examples use different names. Both point to the same Secret value.

The /dev/shm mount is there because model servers often use shared memory heavily. Tiny default shared memory limits inside containers can create strange failures. A memory-backed emptyDir keeps the first deployment boring.

When this pod starts, vLLM does roughly five things. It reads the model name from the command, uses the Hugging Face token to access the repository, downloads or reuses the model files, initializes the tokenizer and model runtime, then starts the API server on port 8000. Only after that finishes is the API useful.

For production, pin the vllm/vllm-openai image version instead of using latest. For this walkthrough, latest keeps the example readable.

Apply it:

kubectl apply -f qwen-vllm.yaml

Expected output:

deployment.apps/qwen-vllm created
service/qwen-vllm created

Watch startup properly

Watch the pod:

kubectl get pods -n llm-demo -w

You may see:

NAME                         READY   STATUS              RESTARTS   AGE
qwen-vllm-6c9f7d8c9d-x9v2m   0/1     Pending             0          3s
qwen-vllm-6c9f7d8c9d-x9v2m   0/1     ContainerCreating   0          15s
qwen-vllm-6c9f7d8c9d-x9v2m   1/1     Running             0          2m

Do not celebrate too early.

Running is not the same as ready. The container can be running while the image is still settling, the model is downloading, CUDA is initializing, weights are loading, or vLLM is preparing the serving engine. The first start is usually slower because the model has to be pulled.

Follow the logs:

kubectl logs -n llm-demo -f deployment/qwen-vllm

You are looking for the server to finish loading the model and listen on port 8000. The exact log lines vary by vLLM version. If logs are still busy, wait. If they show a clear error, jump to the troubleshooting table below.

Port-forward the Service

For the first test, do not create public ingress. Do not add DNS. Do not put it behind an internet-facing load balancer.

Use port-forward:

kubectl port-forward -n llm-demo svc/qwen-vllm 8000:8000

Keep that command running. You should see:

Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000

Now local port 8000 forwards to the Kubernetes Service, which forwards to the vLLM pod.

Send the first curl request

In another terminal, call the OpenAI-compatible chat endpoint:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a concise Kubernetes assistant."
      },
      {
        "role": "user",
        "content": "Explain what a Kubernetes Service does in two sentences."
      }
    ],
    "max_tokens": 120,
    "temperature": 0.2
  }'

Why does the curl request include the model name again?

This part looks redundant at first:

"model": "Qwen/Qwen2.5-1.5B-Instruct"

We already gave the model name to vllm serve in the Deployment. That tells the server which model to load into memory. The model field in the curl request is part of the OpenAI-compatible API contract. Clients send it so the server knows which served model the request is targeting.

In this article, the server has only one model, so the value feels repetitive. In real systems, the same API style may sit behind routers, gateways, aliases, multiple deployments, or clients that can switch between models. Keeping the field means curl, OpenAI SDK code, and later gateway setup all follow the same shape.

For the first run, keep the value identical to the model passed to vllm serve. Later, vLLM can expose a different client-facing name with a served model name alias, but that is extra complexity we do not need yet.

A successful response will be JSON. The exact wording will differ, but the shape should look familiar:

{
  "object": "chat.completion",
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "A Kubernetes Service provides a stable network endpoint for a set of Pods, even as those Pods are created, deleted, or replaced. It selects Pods using labels and forwards traffic to the matching backends."
      }
    }
  ]
}

That is the moment the deployment becomes real. The request reached your model server, vLLM handled the OpenAI-compatible route, the model generated text, and the response came back through Kubernetes. Not a diagram, not a promise. A model answered through an API running inside the cluster.

Swapping the model

To try the smaller model, change the served model:

command:
  - vllm
  - serve
  - Qwen/Qwen2.5-0.5B-Instruct

Then change the curl body too:

"model": "Qwen/Qwen2.5-0.5B-Instruct"

For a larger test, use Qwen/Qwen2.5-7B-Instruct in both places.

For a first run, keep the model name in the request identical to the model name served by vLLM. You can configure aliases later. Today, remove avoidable debugging.

What happened

Kubernetes scheduled a pod onto a node that advertises nvidia.com/gpu. The NVIDIA device plugin made the GPU available to the container. The Hugging Face token let the container pull the model. vLLM loaded the model onto the GPU and started an HTTP server on port 8000. The Service gave the pod a stable in-cluster endpoint. Port-forward gave us a safe local path. Curl proved the API could answer through /v1/chat/completions.

That is the basic loop every LLM platform needs before it becomes fancy:

Can Kubernetes schedule the workload onto a GPU?
Can the container see the GPU?
Can the model server download and load the model?
Can the API route accept a request?
Can the model generate a response?
Can you observe failures when any of those steps break?

If this loop is unreliable, autoscaling and gateways will not save you. They will only hide the problem for a while.

Troubleshooting

Symptom	What it usually means	What to check
Pod stuck in `Pending`	Kubernetes cannot find a matching node	Run `kubectl describe pod -n llm-demo <pod-name>` and read scheduler events. Confirm GPU capacity exists.
`nvidia.com/gpu` missing	GPU Operator or device plugin path is broken	Re-run the GPU visibility command and go back to Part 4 before continuing.
Hugging Face download fails	Token is missing, wrong, expired, or lacks model access	Recreate the token, update the Secret, then run `kubectl rollout restart deployment/qwen-vllm -n llm-demo`.
CUDA initialization error	Driver, runtime, image, or node stack mismatch	Check pod logs, GPU Operator status, driver version, and a simple CUDA test pod.
Pod crashes with OOM	Model or runtime needs more memory	Try `Qwen/Qwen2.5-0.5B-Instruct`, use a larger GPU, or tune model/runtime settings later.
`curl: connection refused`	Server is not ready or port-forward is not running	Check logs, keep port-forward running, and verify `kubectl get svc -n llm-demo`.
Model name mismatch	Request model differs from served model	Make the curl `model` value match the `vllm serve` model.

The most common mistake is treating Running as the finish line. It is not. For model serving, readiness is tied to download, GPU initialization, model loading, and server startup. Watch logs, not just pod phase.

Clean up

If this was only a test, delete the namespace:

kubectl delete namespace llm-demo

That removes the Deployment, Service, and Secret. If you keep experimenting, remember that a GPU pod can hold expensive capacity even when nobody is sending requests.

What we are not covering yet

This article stops at the first working API call. We are not covering public ingress, authentication, autoscaling, multi-GPU serving, quantization, production monitoring, or cost optimization yet.

Those are not tiny details. Public ingress brings TLS, routing, limits, and abuse controls. Authentication decides who can call the model. Autoscaling needs LLM-specific signals, not only CPU. Multi-GPU serving changes scheduling and failure behavior. Quantization changes memory and quality tradeoffs. Monitoring needs token, latency, GPU, queue, and model-server metrics.

But all of that comes after this basic path works.

A Kubernetes LLM platform starts becoming real when a model can load, serve, and answer through an API that other systems can call. Today we got there with one Deployment, one Service, one Secret, and one curl request.

In the next parts, we can make this less like a demo and more like a platform: readiness, observability, routing, auth, scaling, and the failure paths that show up once real users start sending prompts.

If you are following the series, subscribe and keep the manifest from this article handy. It is a good checklist for the first LLM-on-Kubernetes question: can we actually serve a model and call it?