GPU Cluster Orchestration for LLM Inference
A comprehensive guide to architecting and deploying GPU-accelerated Kubernetes clusters for large language model inference, from resource scheduling to cost optimization.
I’ve been seeing a lot of discourse lately about GPUs being “for graphics” and not general compute. While I appreciate the sentiment about naming conventions, the reality is that GPUs have become the backbone of modern AI infrastructure. Over the past two years, I’ve architected and deployed GPU clusters serving billions of LLM inference requests, and I can tell you: getting this right is far more complex than just spinning up some A100s and calling it a day.
The challenge isn’t just about having GPUs—it’s about orchestrating them efficiently at scale while keeping costs under control. In this post, I’ll share the battle-tested patterns I’ve developed for building production-ready GPU infrastructure for LLM inference.
The GPU Infrastructure Challenge
When you’re serving LLMs in production, you’re dealing with unique constraints that traditional Kubernetes workloads don’t face:
- Resource intensity: A single 70B parameter model can require 140GB+ of VRAM
- Cost pressure: A100 80GB instances cost $3-5/hour; H100s can hit $8-12/hour
- Utilization gaps: GPUs sitting idle cost the same as GPUs under load
- Scheduling complexity: Not all GPUs are equal (A100 vs H100 vs L40S)
- Multi-tenancy: Serving multiple models efficiently without interference
I learned these lessons the hard way when our first GPU cluster hit 40% utilization despite running 24/7. We were burning $50K/month on idle compute. That’s when I realized we needed a fundamentally different approach.
Architecture Overview
Here’s the reference architecture I’ve refined across multiple production deployments:
┌─────────────────────────────────────────────────────────────┐
│ Load Balancer Layer │
│ (Cloudflare → AWS ALB → Istio) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Inference Gateway Layer │
│ (FastAPI + Request Routing + Queue Management) │
└─────────────────────────────────────────────────────────────┘
│
┌─────────┴─────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ vLLM Pods │ │ vLLM Pods │
│ (A100 80GB) │ │ (H100 80GB) │
│ GPT-4 class │ │ Llama-3 70B │
└──────────────────┘ └──────────────────┘
│ │
└─────────┬─────────┘
▼
┌──────────────────────────┐
│ Model Storage Layer │
│ (S3 + Redis Cache) │
└──────────────────────────┘
The key insight here is separation of concerns. The inference gateway handles routing, queueing, and batching, while the GPU pods focus purely on model execution.
Infrastructure as Code: The Foundation
I use Terraform to manage the entire GPU infrastructure stack. Here’s the core EKS cluster configuration with GPU node groups:
# terraform/eks-gpu-cluster.tf
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"
cluster_name = "llm-inference-${var.environment}"
cluster_version = "1.28"
# Enable IRSA for fine-grained IAM
enable_irsa = true
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
# GPU node groups with different instance types
eks_managed_node_groups = {
# High-memory inference (70B+ models)
gpu_a100_80gb = {
instance_types = ["p4d.24xlarge"] # 8x A100 80GB
capacity_type = "ON_DEMAND"
min_size = 1
max_size = 10
desired_size = 2
labels = {
workload-type = "gpu-inference"
gpu-type = "a100-80gb"
gpu-count = "8"
}
taints = [{
key = "nvidia.com/gpu"
value = "true"
effect = "NO_SCHEDULE"
}]
# Install NVIDIA device plugin
pre_bootstrap_user_data = <<-EOT
#!/bin/bash
# Install NVIDIA drivers
aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/latest/ .
chmod +x NVIDIA-Linux-x86_64*.run
/bin/sh ./NVIDIA-Linux-x86_64*.run -s
# Configure containerd for GPU
nvidia-ctk runtime configure --runtime=containerd
systemctl restart containerd
EOT
}
# Cost-optimized inference (smaller models)
gpu_l40s = {
instance_types = ["g6.12xlarge"] # 4x L40S 48GB
capacity_type = "SPOT" # 70% cost savings
min_size = 2
max_size = 20
desired_size = 4
labels = {
workload-type = "gpu-inference"
gpu-type = "l40s"
gpu-count = "4"
}
taints = [{
key = "nvidia.com/gpu"
value = "true"
effect = "NO_SCHEDULE"
}]
}
}
}
# NVIDIA GPU Operator for device management
resource "helm_release" "gpu_operator" {
name = "gpu-operator"
repository = "https://helm.ngc.nvidia.com/nvidia"
chart = "gpu-operator"
namespace = "gpu-operator-resources"
version = "v23.9.0"
create_namespace = true
values = [
yamlencode({
operator = {
defaultRuntime = "containerd"
}
driver = {
enabled = false # Pre-installed in user data
}
toolkit = {
enabled = true
}
devicePlugin = {
enabled = true
config = {
name = "time-slicing-config"
default = "any"
}
}
dcgmExporter = {
enabled = true # GPU metrics for Prometheus
}
gfd = {
enabled = true # GPU feature discovery
}
})
]
}
The critical decisions here:
- Mixed instance types: A100s for large models, L40S for cost-sensitive workloads
- Spot instances where possible: 70% cost savings for fault-tolerant inference
- GPU operator: Automates driver management and device plugin deployment
- Taints: Prevents non-GPU workloads from landing on expensive nodes
vLLM: The Inference Engine
For LLM serving, I’ve standardized on vLLM. It’s the most mature solution I’ve found for production GPU inference, with PagedAttention for memory efficiency and continuous batching for throughput.
Here’s the Kubernetes deployment configuration:
# k8s/vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-70b
namespace: inference
spec:
replicas: 2
selector:
matchLabels:
app: vllm-llama3-70b
template:
metadata:
labels:
app: vllm-llama3-70b
spec:
# GPU node affinity
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu-type
operator: In
values:
- a100-80gb
# Tolerate GPU taints
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
# Init container: Download model from S3
initContainers:
- name: model-downloader
image: amazon/aws-cli:latest
command:
- sh
- -c
- |
aws s3 sync s3://llm-models/llama-3-70b-instruct /models/llama-3-70b
volumeMounts:
- name: model-storage
mountPath: /models
env:
- name: AWS_REGION
value: us-east-1
containers:
- name: vllm-server
image: vllm/vllm-openai:v0.5.4
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
args:
- --model=/models/llama-3-70b
- --tensor-parallel-size=4 # Use 4 GPUs
- --dtype=bfloat16
- --max-model-len=8192
- --gpu-memory-utilization=0.95
- --disable-log-requests
- --served-model-name=llama-3-70b
resources:
limits:
nvidia.com/gpu: 4 # Request 4 GPUs
memory: 320Gi
requests:
nvidia.com/gpu: 4
memory: 320Gi
ports:
- containerPort: 8000
name: http
volumeMounts:
- name: model-storage
mountPath: /models
- name: shm
mountPath: /dev/shm # Shared memory for tensor parallelism
# Health checks
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
volumes:
- name: model-storage
emptyDir:
sizeLimit: 200Gi
- name: shm
emptyDir:
medium: Memory
sizeLimit: 32Gi # Increased for tensor parallelism
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama3-70b
namespace: inference
spec:
selector:
app: vllm-llama3-70b
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
Key optimizations:
- Tensor parallelism: Splits the model across 4 GPUs for faster inference
- GPU memory utilization: 0.95 pushes the limit for maximum batch sizes
- Shared memory: Critical for multi-GPU communication
- Init container: Pre-loads models to avoid cold starts
Intelligent Request Routing
The real magic happens in the inference gateway. This is where we handle request queuing, model routing, and batch optimization:
# gateway/main.py
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import httpx
import asyncio
from collections import defaultdict
import time
app = FastAPI()
# Model routing configuration
MODEL_ENDPOINTS = {
"gpt-4": "http://vllm-gpt4-equivalent.inference.svc.cluster.local:8000",
"llama-3-70b": "http://vllm-llama3-70b.inference.svc.cluster.local:8000",
"llama-3-8b": "http://vllm-llama3-8b.inference.svc.cluster.local:8000",
}
# Request queue for batching
request_queues = defaultdict(asyncio.Queue)
batch_processors = {}
class InferenceRequest(BaseModel):
model: str
messages: list
temperature: float = 0.7
max_tokens: int = 2048
stream: bool = False
class BatchProcessor:
def __init__(self, model_name: str, endpoint: str, batch_size: int = 8, batch_timeout: float = 0.1):
self.model_name = model_name
self.endpoint = endpoint
self.batch_size = batch_size
self.batch_timeout = batch_timeout
self.queue = request_queues[model_name]
async def process_batches(self):
"""Continuously process batched requests"""
while True:
batch = []
deadline = time.time() + self.batch_timeout
# Collect requests until batch_size or timeout
while len(batch) < self.batch_size and time.time() < deadline:
try:
timeout = max(0.01, deadline - time.time())
request, response_queue = await asyncio.wait_for(
self.queue.get(), timeout=timeout
)
batch.append((request, response_queue))
except asyncio.TimeoutError:
break
if batch:
await self._execute_batch(batch)
async def _execute_batch(self, batch):
"""Execute a batch of requests in parallel"""
async with httpx.AsyncClient(timeout=120.0) as client:
tasks = []
for request, response_queue in batch:
task = self._single_inference(client, request, response_queue)
tasks.append(task)
await asyncio.gather(*tasks)
async def _single_inference(self, client, request, response_queue):
"""Execute single inference request"""
try:
payload = {
"model": self.model_name,
"messages": request.messages,
"temperature": request.temperature,
"max_tokens": request.max_tokens,
"stream": request.stream
}
response = await client.post(
f"{self.endpoint}/v1/chat/completions",
json=payload
)
await response_queue.put(("success", response.json()))
except Exception as e:
await response_queue.put(("error", str(e)))
@app.on_event("startup")
async def startup_event():
"""Initialize batch processors for each model"""
for model_name, endpoint in MODEL_ENDPOINTS.items():
processor = BatchProcessor(model_name, endpoint)
batch_processors[model_name] = processor
asyncio.create_task(processor.process_batches())
@app.post("/v1/chat/completions")
async def chat_completion(request: InferenceRequest):
"""Inference endpoint with automatic batching"""
if request.model not in MODEL_ENDPOINTS:
raise HTTPException(status_code=404, detail=f"Model {request.model} not found")
# Create response queue for this request