GPU Cluster Orchestration for LLM Inference

A comprehensive guide to architecting and deploying GPU-accelerated Kubernetes clusters for large language model inference, from resource scheduling to cost optimization.

AI LLM GPU Kubernetes MLOps Infrastructure as Code DevOps Cloud Architecture Model Serving Cost Optimization Terraform vLLM

I’ve been seeing a lot of discourse lately about GPUs being “for graphics” and not general compute. While I appreciate the sentiment about naming conventions, the reality is that GPUs have become the backbone of modern AI infrastructure. Over the past two years, I’ve architected and deployed GPU clusters serving billions of LLM inference requests, and I can tell you: getting this right is far more complex than just spinning up some A100s and calling it a day.

The challenge isn’t just about having GPUs—it’s about orchestrating them efficiently at scale while keeping costs under control. In this post, I’ll share the battle-tested patterns I’ve developed for building production-ready GPU infrastructure for LLM inference.

The GPU Infrastructure Challenge

When you’re serving LLMs in production, you’re dealing with unique constraints that traditional Kubernetes workloads don’t face:

Resource intensity: A single 70B parameter model can require 140GB+ of VRAM
Cost pressure: A100 80GB instances cost $3-5/hour; H100s can hit $8-12/hour
Utilization gaps: GPUs sitting idle cost the same as GPUs under load
Scheduling complexity: Not all GPUs are equal (A100 vs H100 vs L40S)
Multi-tenancy: Serving multiple models efficiently without interference

I learned these lessons the hard way when our first GPU cluster hit 40% utilization despite running 24/7. We were burning $50K/month on idle compute. That’s when I realized we needed a fundamentally different approach.

Architecture Overview

Here’s the reference architecture I’ve refined across multiple production deployments:

┌─────────────────────────────────────────────────────────────┐
│                     Load Balancer Layer                     │
│              (Cloudflare → AWS ALB → Istio)                 │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Inference Gateway Layer                   │
│         (FastAPI + Request Routing + Queue Management)      │
└─────────────────────────────────────────────────────────────┘
                              │
                    ┌─────────┴─────────┐
                    ▼                   ▼
         ┌──────────────────┐  ┌──────────────────┐
         │  vLLM Pods       │  │  vLLM Pods       │
         │  (A100 80GB)     │  │  (H100 80GB)     │
         │  GPT-4 class     │  │  Llama-3 70B     │
         └──────────────────┘  └──────────────────┘
                    │                   │
                    └─────────┬─────────┘
                              ▼
                ┌──────────────────────────┐
                │   Model Storage Layer    │
                │   (S3 + Redis Cache)     │
                └──────────────────────────┘

The key insight here is separation of concerns. The inference gateway handles routing, queueing, and batching, while the GPU pods focus purely on model execution.

Infrastructure as Code: The Foundation

I use Terraform to manage the entire GPU infrastructure stack. Here’s the core EKS cluster configuration with GPU node groups:

# terraform/eks-gpu-cluster.tf

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.0"

  cluster_name    = "llm-inference-${var.environment}"
  cluster_version = "1.28"

  # Enable IRSA for fine-grained IAM
  enable_irsa = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  # GPU node groups with different instance types
  eks_managed_node_groups = {
    # High-memory inference (70B+ models)
    gpu_a100_80gb = {
      instance_types = ["p4d.24xlarge"]  # 8x A100 80GB
      capacity_type  = "ON_DEMAND"
      
      min_size     = 1
      max_size     = 10
      desired_size = 2

      labels = {
        workload-type = "gpu-inference"
        gpu-type      = "a100-80gb"
        gpu-count     = "8"
      }

      taints = [{
        key    = "nvidia.com/gpu"
        value  = "true"
        effect = "NO_SCHEDULE"
      }]

      # Install NVIDIA device plugin
      pre_bootstrap_user_data = <<-EOT
        #!/bin/bash
        # Install NVIDIA drivers
        aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/latest/ .
        chmod +x NVIDIA-Linux-x86_64*.run
        /bin/sh ./NVIDIA-Linux-x86_64*.run -s
        
        # Configure containerd for GPU
        nvidia-ctk runtime configure --runtime=containerd
        systemctl restart containerd
      EOT
    }

    # Cost-optimized inference (smaller models)
    gpu_l40s = {
      instance_types = ["g6.12xlarge"]  # 4x L40S 48GB
      capacity_type  = "SPOT"  # 70% cost savings
      
      min_size     = 2
      max_size     = 20
      desired_size = 4

      labels = {
        workload-type = "gpu-inference"
        gpu-type      = "l40s"
        gpu-count     = "4"
      }

      taints = [{
        key    = "nvidia.com/gpu"
        value  = "true"
        effect = "NO_SCHEDULE"
      }]
    }
  }
}

# NVIDIA GPU Operator for device management
resource "helm_release" "gpu_operator" {
  name       = "gpu-operator"
  repository = "https://helm.ngc.nvidia.com/nvidia"
  chart      = "gpu-operator"
  namespace  = "gpu-operator-resources"
  version    = "v23.9.0"

  create_namespace = true

  values = [
    yamlencode({
      operator = {
        defaultRuntime = "containerd"
      }
      driver = {
        enabled = false  # Pre-installed in user data
      }
      toolkit = {
        enabled = true
      }
      devicePlugin = {
        enabled = true
        config = {
          name = "time-slicing-config"
          default = "any"
        }
      }
      dcgmExporter = {
        enabled = true  # GPU metrics for Prometheus
      }
      gfd = {
        enabled = true  # GPU feature discovery
      }
    })
  ]
}

The critical decisions here:

Mixed instance types: A100s for large models, L40S for cost-sensitive workloads
Spot instances where possible: 70% cost savings for fault-tolerant inference
GPU operator: Automates driver management and device plugin deployment
Taints: Prevents non-GPU workloads from landing on expensive nodes

vLLM: The Inference Engine

For LLM serving, I’ve standardized on vLLM. It’s the most mature solution I’ve found for production GPU inference, with PagedAttention for memory efficiency and continuous batching for throughput.

Here’s the Kubernetes deployment configuration:

# k8s/vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-70b
  namespace: inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-llama3-70b
  template:
    metadata:
      labels:
        app: vllm-llama3-70b
    spec:
      # GPU node affinity
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: gpu-type
                operator: In
                values:
                - a100-80gb
      
      # Tolerate GPU taints
      tolerations:
      - key: nvidia.com/gpu
        operator: Equal
        value: "true"
        effect: NoSchedule
      
      # Init container: Download model from S3
      initContainers:
      - name: model-downloader
        image: amazon/aws-cli:latest
        command:
        - sh
        - -c
        - |
          aws s3 sync s3://llm-models/llama-3-70b-instruct /models/llama-3-70b
        volumeMounts:
        - name: model-storage
          mountPath: /models
        env:
        - name: AWS_REGION
          value: us-east-1
      
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:v0.5.4
        command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        args:
        - --model=/models/llama-3-70b
        - --tensor-parallel-size=4  # Use 4 GPUs
        - --dtype=bfloat16
        - --max-model-len=8192
        - --gpu-memory-utilization=0.95
        - --disable-log-requests
        - --served-model-name=llama-3-70b
        
        resources:
          limits:
            nvidia.com/gpu: 4  # Request 4 GPUs
            memory: 320Gi
          requests:
            nvidia.com/gpu: 4
            memory: 320Gi
        
        ports:
        - containerPort: 8000
          name: http
        
        volumeMounts:
        - name: model-storage
          mountPath: /models
        - name: shm
          mountPath: /dev/shm  # Shared memory for tensor parallelism
        
        # Health checks
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 300
          periodSeconds: 30
        
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
      
      volumes:
      - name: model-storage
        emptyDir:
          sizeLimit: 200Gi
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 32Gi  # Increased for tensor parallelism

---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama3-70b
  namespace: inference
spec:
  selector:
    app: vllm-llama3-70b
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP

Key optimizations:

Tensor parallelism: Splits the model across 4 GPUs for faster inference
GPU memory utilization: 0.95 pushes the limit for maximum batch sizes
Shared memory: Critical for multi-GPU communication
Init container: Pre-loads models to avoid cold starts

Intelligent Request Routing

The real magic happens in the inference gateway. This is where we handle request queuing, model routing, and batch optimization:

# gateway/main.py
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import httpx
import asyncio
from collections import defaultdict
import time

app = FastAPI()

# Model routing configuration
MODEL_ENDPOINTS = {
    "gpt-4": "http://vllm-gpt4-equivalent.inference.svc.cluster.local:8000",
    "llama-3-70b": "http://vllm-llama3-70b.inference.svc.cluster.local:8000",
    "llama-3-8b": "http://vllm-llama3-8b.inference.svc.cluster.local:8000",
}

# Request queue for batching
request_queues = defaultdict(asyncio.Queue)
batch_processors = {}

class InferenceRequest(BaseModel):
    model: str
    messages: list
    temperature: float = 0.7
    max_tokens: int = 2048
    stream: bool = False

class BatchProcessor:
    def __init__(self, model_name: str, endpoint: str, batch_size: int = 8, batch_timeout: float = 0.1):
        self.model_name = model_name
        self.endpoint = endpoint
        self.batch_size = batch_size
        self.batch_timeout = batch_timeout
        self.queue = request_queues[model_name]
        
    async def process_batches(self):
        """Continuously process batched requests"""
        while True:
            batch = []
            deadline = time.time() + self.batch_timeout
            
            # Collect requests until batch_size or timeout
            while len(batch) < self.batch_size and time.time() < deadline:
                try:
                    timeout = max(0.01, deadline - time.time())
                    request, response_queue = await asyncio.wait_for(
                        self.queue.get(), timeout=timeout
                    )
                    batch.append((request, response_queue))
                except asyncio.TimeoutError:
                    break
            
            if batch:
                await self._execute_batch(batch)
    
    async def _execute_batch(self, batch):
        """Execute a batch of requests in parallel"""
        async with httpx.AsyncClient(timeout=120.0) as client:
            tasks = []
            for request, response_queue in batch:
                task = self._single_inference(client, request, response_queue)
                tasks.append(task)
            await asyncio.gather(*tasks)
    
    async def _single_inference(self, client, request, response_queue):
        """Execute single inference request"""
        try:
            payload = {
                "model": self.model_name,
                "messages": request.messages,
                "temperature": request.temperature,
                "max_tokens": request.max_tokens,
                "stream": request.stream
            }
            
            response = await client.post(
                f"{self.endpoint}/v1/chat/completions",
                json=payload
            )
            
            await response_queue.put(("success", response.json()))
        except Exception as e:
            await response_queue.put(("error", str(e)))

@app.on_event("startup")
async def startup_event():
    """Initialize batch processors for each model"""
    for model_name, endpoint in MODEL_ENDPOINTS.items():
        processor = BatchProcessor(model_name, endpoint)
        batch_processors[model_name] = processor
        asyncio.create_task(processor.process_batches())

@app.post("/v1/chat/completions")
async def chat_completion(request: InferenceRequest):
    """Inference endpoint with automatic batching"""
    if request.model not in MODEL_ENDPOINTS:
        raise HTTPException(status_code=404, detail=f"Model {request.model} not found")
    
    # Create response queue for this request

Found this helpful? Share it with others:

Share Share