12 min read
Dillon Browne

AWS Service Limits: Lab Infrastructure Rethink

Hit AWS Lightsail limits? Learn how our team migrated 40+ lab environments to self-hosted Kubernetes, cut costs by 60%, and scaled infinitely. Discover when to switch from managed services for better control and efficiency.

AWS Kubernetes Infrastructure as Code DevOps Cloud Architecture Cost Optimization Terraform Docker Container Orchestration Multi-Cloud Platform Engineering Lightsail EKS Self-Hosted Lab Infrastructure Developer Experience Resource Management
AWS Service Limits: Lab Infrastructure Rethink

AWS service limits aren’t just arbitrary numbers—they’re forcing functions that reveal when you’ve outgrown a platform. When our lab infrastructure hit Lightsail’s 20-instance limit, we faced a choice: fragment across multiple AWS accounts or fundamentally rethink our approach. This post details our journey from AWS Lightsail to a self-hosted Kubernetes cluster for lab environments.

We chose the latter, migrating 40+ lab environments from managed Lightsail instances to a self-hosted Kubernetes cluster. The result: a 60% cost reduction, unlimited horizontal scaling, and infrastructure that better mirrors production patterns, significantly improving our developer experience.

The AWS Lightsail Limit Wall

AWS Lightsail is fantastic for simple workloads—fixed pricing, predictable costs, easy management. But it has hard limits that can hinder growing lab environments:

  • 20 instances per account (a soft limit, requiring support tickets to increase)
  • Limited instance types (no GPU, restricted CPU/memory configurations)
  • Regional constraints (not available in all AWS regions)
  • Basic networking (VPC peering exists, but lacks advanced routing and network policies)

For a single application or small team, these Lightsail constraints are fine. For a lab environment serving 15+ engineers running ephemeral test environments, AI model experiments, and CI/CD runners, we hit the ceiling fast. This forced us to consider alternative cloud architecture solutions.

Cost Analysis: Lightsail vs. Self-Hosted Kubernetes

Before migrating, I ran the numbers for three months of actual usage, comparing Lightsail costs to a self-hosted Kubernetes solution. The cost optimization potential was clear.

Lightsail Approach (20 instances):

20 instances × $40/month (4GB RAM, 2 vCPUs) = $800/month
- Average utilization: 35%
- Wasted capacity: $520/month
- Scaling: Blocked by instance limits

Self-Hosted Kubernetes (3 bare metal nodes):

3 × Hetzner AX41 (64GB RAM, AMD Ryzen 7) = $180/month
+ 1TB block storage = $50/month
Total: $230/month
- Average utilization: 75%
- Pod density: 60+ concurrent workloads
- Scaling: Add nodes as needed

The economics were obvious. But cost wasn’t the only driver—we needed better resource utilization, namespace isolation, and the ability to run GPU workloads for LLM experiments. This shift dramatically improved our cloud architecture.

Cloud Architecture: Lightsail to Kubernetes Migration

The migration required rethinking how we provision and manage lab environments, moving from a single-instance model to a multi-tenant Kubernetes cluster.

Old Pattern (Lightsail):

# terraform/lightsail.tf
resource "aws_lightsail_instance" "lab_env" {
  count             = 20
  name              = "lab-${count.index}"
  availability_zone = "us-east-1a"
  blueprint_id      = "ubuntu_22_04"
  bundle_id         = "medium_2_0"  # $40/month
  
  user_data = file("init-script.sh")
}

New Pattern (Kubernetes):

# kubernetes/lab-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: lab-${USER}
  labels:
    environment: lab
    owner: ${USER}
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: lab-${USER}
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    persistentvolumeclaims: "5"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: namespace-isolation
  namespace: lab-${USER}
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: lab-${USER}

This shift from instance-per-environment to namespace-per-environment unlocked true multi-tenancy. Engineers could spin up isolated environments in seconds, not minutes, leveraging the power of container orchestration.

Infrastructure as Code Migration with Terraform

The Terraform migration was straightforward but required careful state management, particularly when moving from AWS Lightsail resources to a self-hosted Kubernetes cluster. This highlighted the benefits of robust Infrastructure as Code practices.

# terraform/k8s-cluster.tf
module "k3s_cluster" {
  source = "./modules/k3s"
  
  nodes = [
    {
      name     = "k3s-master-01"
      role     = "control-plane"
      provider = "hetzner"
      size     = "ax41"
    },
    {
      name     = "k3s-worker-01"
      role     = "worker"
      provider = "hetzner"
      size     = "ax41"
    },
    {
      name     = "k3s-worker-02"
      role     = "worker"
      provider = "hetzner"
      size     = "ax41"
    }
  ]
  
  features = {
    traefik_ingress  = true
    cert_manager     = true
    longhorn_storage = true
    metrics_server   = true
  }
}

# Automated lab provisioning
resource "kubernetes_namespace" "lab_envs" {
  for_each = toset(var.lab_users)
  
  metadata {
    name = "lab-${each.key}"
    labels = {
      environment = "lab"
      owner       = each.key
      auto-delete = "7d"  # Cleanup after 7 days
    }
  }
}

resource "kubernetes_limit_range" "lab_defaults" {
  for_each = kubernetes_namespace.lab_envs
  
  metadata {
    name      = "default-limits"
    namespace = each.value.metadata[0].name
  }
  
  spec {
    limit {
      type = "Container"
      default = {
        cpu    = "1"
        memory = "2Gi"
      }
      default_request = {
        cpu    = "500m"
        memory = "1Gi"
      }
    }
  }
}

Self-Service Lab Provisioning with Kubernetes

The real win was enabling engineers to self-service their environments via a simple CLI script. This significantly improved developer experience and accelerated our experimentation cycles.

#!/usr/bin/env python3
# scripts/provision-lab.py
import subprocess
import sys
from pathlib import Path

def provision_lab(username: str, template: str = "default"):
    """Provision isolated lab environment"""
    
    namespace = f"lab-{username}"
    
    # Apply namespace and quotas
    subprocess.run([
        "kubectl", "apply", "-f", "-"
    ], input=f"""
apiVersion: v1
kind: Namespace
metadata:
  name: {namespace}
  labels:
    owner: {username}
    template: {template}
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: {username}
  namespace: {namespace}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: {username}-admin
  namespace: {namespace}
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: admin
subjects:
- kind: ServiceAccount
  name: {username}
  namespace: {namespace}
    """.encode(), check=True)
    
    # Generate kubeconfig
    token = subprocess.check_output([
        "kubectl", "create", "token", username,
        "-n", namespace, "--duration=168h"
    ]).decode().strip()
    
    kubeconfig = Path.home() / f".kube/lab-{username}.yaml"
    kubeconfig.write_text(f"""
apiVersion: v1
kind: Config
clusters:
- cluster:
    server: https://lab.internal:6443
    certificate-authority-data: {get_ca_cert()}
  name: lab-cluster
contexts:
- context:
    cluster: lab-cluster
    namespace: {namespace}
    user: {username}
  name: lab-{username}
current-context: lab-{username}
users:
- name: {username}
  user:
    token: {token}
    """)
    
    print(f"✅ Lab environment provisioned: {namespace}")
    print(f"📝 Kubeconfig: {kubeconfig}")
    print(f"🚀 Usage: export KUBECONFIG={kubeconfig}")

if __name__ == "__main__":
    provision_lab(sys.argv[1], sys.argv[2] if len(sys.argv) > 2 else "default")

Engineers run ./provision-lab.py john-doe ai-experiment and get a fully isolated environment with pre-configured resource limits, network policies, and credentials. This self-service model is a cornerstone of effective platform engineering.

Key Lessons from Our Infrastructure Migration

Migrating from AWS Lightsail to self-hosted Kubernetes provided valuable insights into cloud architecture, cost optimization, and resource management.

1. Service Limits Are Design Signals for Cloud Architecture

When you hit platform limits repeatedly, it’s time to evaluate whether you’re using the right tool. Lightsail is perfect for 5-10 simple workloads. Beyond that, Kubernetes offers better economics and flexibility for complex, multi-tenant lab infrastructure.

2. Multi-Tenancy Requires Discipline and Robust Resource Management

Namespace isolation sounds simple until you deal with shared storage, network policies, and resource contention. We implemented:

  • Resource quotas on every namespace to prevent resource hogging.
  • Network policies for robust traffic isolation between lab environments.
  • Pod security standards (restricted by default) to enhance security.
  • Automated cleanup (namespaces older than 7 days get flagged) for efficient resource management.

3. Cost Optimization Through High Resource Utilization

The Lightsail instances sat at 35% average CPU utilization because we couldn’t bin-pack workloads efficiently. Kubernetes lets us achieve 75%+ utilization through intelligent scheduling and resource requests/limits, leading to significant cost savings.

4. Bare Metal for Predictable Cloud Costs

Hetzner’s dedicated servers provide fixed monthly costs with no egress fees, no per-hour charges, and no surprise bills. For lab environments with unpredictable usage patterns, this predictability is invaluable for budgeting and cost control.

When to Stay on AWS Lightsail

Despite our migration to Kubernetes, Lightsail remains the right choice for specific use cases:

  • Simple production workloads (e.g., WordPress, static sites, small APIs).
  • Predictable traffic patterns (fixed resource needs and scaling requirements).
  • Teams without extensive Kubernetes expertise (lower operational overhead and easier management).
  • Small scale operations (typically 1-10 instances).

The moment you need dynamic scaling, multi-tenancy, or hit service limits, start planning your Kubernetes migration for advanced container orchestration.

Our Modern Tech Stack for Lab Infrastructure

Our new lab infrastructure is built on a robust and modern tech stack, enabling high performance and flexibility:

Infrastructure:

  • Hetzner bare metal (AX41 servers) for cost-effective, high-performance compute.
  • K3s (lightweight Kubernetes) for efficient container orchestration.
  • Longhorn (distributed block storage) for persistent data.
  • Traefik (ingress controller) for managing external access to services.

Automation:

  • Terraform (infrastructure provisioning) for declarative infrastructure as code.
  • Helm (application deployment) for packaging and deploying Kubernetes applications.
  • Python (CLI tooling) for custom automation and developer experience improvements.
  • GitHub Actions (CI/CD) for continuous integration and deployment workflows.

Observability:

  • Prometheus (metrics) for robust monitoring.
  • Grafana (dashboards) for visualizing system performance and health.
  • Loki (log aggregation) for centralized log management.

The migration took two weeks of planning and one weekend of execution. Three months later, we’re running 60+ concurrent lab environments at a fraction of the cost, with zero scaling constraints, and a greatly improved developer experience.

Found this helpful? Share it with others: