AWS Service Limits: Lab Infrastructure Rethink
Hit AWS Lightsail limits? Learn how our team migrated 40+ lab environments to self-hosted Kubernetes, cut costs by 60%, and scaled infinitely. Discover when to switch from managed services for better control and efficiency.
AWS service limits aren’t just arbitrary numbers—they’re forcing functions that reveal when you’ve outgrown a platform. When our lab infrastructure hit Lightsail’s 20-instance limit, we faced a choice: fragment across multiple AWS accounts or fundamentally rethink our approach. This post details our journey from AWS Lightsail to a self-hosted Kubernetes cluster for lab environments.
We chose the latter, migrating 40+ lab environments from managed Lightsail instances to a self-hosted Kubernetes cluster. The result: a 60% cost reduction, unlimited horizontal scaling, and infrastructure that better mirrors production patterns, significantly improving our developer experience.
The AWS Lightsail Limit Wall
AWS Lightsail is fantastic for simple workloads—fixed pricing, predictable costs, easy management. But it has hard limits that can hinder growing lab environments:
- 20 instances per account (a soft limit, requiring support tickets to increase)
- Limited instance types (no GPU, restricted CPU/memory configurations)
- Regional constraints (not available in all AWS regions)
- Basic networking (VPC peering exists, but lacks advanced routing and network policies)
For a single application or small team, these Lightsail constraints are fine. For a lab environment serving 15+ engineers running ephemeral test environments, AI model experiments, and CI/CD runners, we hit the ceiling fast. This forced us to consider alternative cloud architecture solutions.
Cost Analysis: Lightsail vs. Self-Hosted Kubernetes
Before migrating, I ran the numbers for three months of actual usage, comparing Lightsail costs to a self-hosted Kubernetes solution. The cost optimization potential was clear.
Lightsail Approach (20 instances):
20 instances × $40/month (4GB RAM, 2 vCPUs) = $800/month
- Average utilization: 35%
- Wasted capacity: $520/month
- Scaling: Blocked by instance limits
Self-Hosted Kubernetes (3 bare metal nodes):
3 × Hetzner AX41 (64GB RAM, AMD Ryzen 7) = $180/month
+ 1TB block storage = $50/month
Total: $230/month
- Average utilization: 75%
- Pod density: 60+ concurrent workloads
- Scaling: Add nodes as needed
The economics were obvious. But cost wasn’t the only driver—we needed better resource utilization, namespace isolation, and the ability to run GPU workloads for LLM experiments. This shift dramatically improved our cloud architecture.
Cloud Architecture: Lightsail to Kubernetes Migration
The migration required rethinking how we provision and manage lab environments, moving from a single-instance model to a multi-tenant Kubernetes cluster.
Old Pattern (Lightsail):
# terraform/lightsail.tf
resource "aws_lightsail_instance" "lab_env" {
count = 20
name = "lab-${count.index}"
availability_zone = "us-east-1a"
blueprint_id = "ubuntu_22_04"
bundle_id = "medium_2_0" # $40/month
user_data = file("init-script.sh")
}
New Pattern (Kubernetes):
# kubernetes/lab-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: lab-${USER}
labels:
environment: lab
owner: ${USER}
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: lab-${USER}
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
persistentvolumeclaims: "5"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: namespace-isolation
namespace: lab-${USER}
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: lab-${USER}
This shift from instance-per-environment to namespace-per-environment unlocked true multi-tenancy. Engineers could spin up isolated environments in seconds, not minutes, leveraging the power of container orchestration.
Infrastructure as Code Migration with Terraform
The Terraform migration was straightforward but required careful state management, particularly when moving from AWS Lightsail resources to a self-hosted Kubernetes cluster. This highlighted the benefits of robust Infrastructure as Code practices.
# terraform/k8s-cluster.tf
module "k3s_cluster" {
source = "./modules/k3s"
nodes = [
{
name = "k3s-master-01"
role = "control-plane"
provider = "hetzner"
size = "ax41"
},
{
name = "k3s-worker-01"
role = "worker"
provider = "hetzner"
size = "ax41"
},
{
name = "k3s-worker-02"
role = "worker"
provider = "hetzner"
size = "ax41"
}
]
features = {
traefik_ingress = true
cert_manager = true
longhorn_storage = true
metrics_server = true
}
}
# Automated lab provisioning
resource "kubernetes_namespace" "lab_envs" {
for_each = toset(var.lab_users)
metadata {
name = "lab-${each.key}"
labels = {
environment = "lab"
owner = each.key
auto-delete = "7d" # Cleanup after 7 days
}
}
}
resource "kubernetes_limit_range" "lab_defaults" {
for_each = kubernetes_namespace.lab_envs
metadata {
name = "default-limits"
namespace = each.value.metadata[0].name
}
spec {
limit {
type = "Container"
default = {
cpu = "1"
memory = "2Gi"
}
default_request = {
cpu = "500m"
memory = "1Gi"
}
}
}
}
Self-Service Lab Provisioning with Kubernetes
The real win was enabling engineers to self-service their environments via a simple CLI script. This significantly improved developer experience and accelerated our experimentation cycles.
#!/usr/bin/env python3
# scripts/provision-lab.py
import subprocess
import sys
from pathlib import Path
def provision_lab(username: str, template: str = "default"):
"""Provision isolated lab environment"""
namespace = f"lab-{username}"
# Apply namespace and quotas
subprocess.run([
"kubectl", "apply", "-f", "-"
], input=f"""
apiVersion: v1
kind: Namespace
metadata:
name: {namespace}
labels:
owner: {username}
template: {template}
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: {username}
namespace: {namespace}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: {username}-admin
namespace: {namespace}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: admin
subjects:
- kind: ServiceAccount
name: {username}
namespace: {namespace}
""".encode(), check=True)
# Generate kubeconfig
token = subprocess.check_output([
"kubectl", "create", "token", username,
"-n", namespace, "--duration=168h"
]).decode().strip()
kubeconfig = Path.home() / f".kube/lab-{username}.yaml"
kubeconfig.write_text(f"""
apiVersion: v1
kind: Config
clusters:
- cluster:
server: https://lab.internal:6443
certificate-authority-data: {get_ca_cert()}
name: lab-cluster
contexts:
- context:
cluster: lab-cluster
namespace: {namespace}
user: {username}
name: lab-{username}
current-context: lab-{username}
users:
- name: {username}
user:
token: {token}
""")
print(f"✅ Lab environment provisioned: {namespace}")
print(f"📝 Kubeconfig: {kubeconfig}")
print(f"🚀 Usage: export KUBECONFIG={kubeconfig}")
if __name__ == "__main__":
provision_lab(sys.argv[1], sys.argv[2] if len(sys.argv) > 2 else "default")
Engineers run ./provision-lab.py john-doe ai-experiment and get a fully isolated environment with pre-configured resource limits, network policies, and credentials. This self-service model is a cornerstone of effective platform engineering.
Key Lessons from Our Infrastructure Migration
Migrating from AWS Lightsail to self-hosted Kubernetes provided valuable insights into cloud architecture, cost optimization, and resource management.
1. Service Limits Are Design Signals for Cloud Architecture
When you hit platform limits repeatedly, it’s time to evaluate whether you’re using the right tool. Lightsail is perfect for 5-10 simple workloads. Beyond that, Kubernetes offers better economics and flexibility for complex, multi-tenant lab infrastructure.
2. Multi-Tenancy Requires Discipline and Robust Resource Management
Namespace isolation sounds simple until you deal with shared storage, network policies, and resource contention. We implemented:
- Resource quotas on every namespace to prevent resource hogging.
- Network policies for robust traffic isolation between lab environments.
- Pod security standards (restricted by default) to enhance security.
- Automated cleanup (namespaces older than 7 days get flagged) for efficient resource management.
3. Cost Optimization Through High Resource Utilization
The Lightsail instances sat at 35% average CPU utilization because we couldn’t bin-pack workloads efficiently. Kubernetes lets us achieve 75%+ utilization through intelligent scheduling and resource requests/limits, leading to significant cost savings.
4. Bare Metal for Predictable Cloud Costs
Hetzner’s dedicated servers provide fixed monthly costs with no egress fees, no per-hour charges, and no surprise bills. For lab environments with unpredictable usage patterns, this predictability is invaluable for budgeting and cost control.
When to Stay on AWS Lightsail
Despite our migration to Kubernetes, Lightsail remains the right choice for specific use cases:
- Simple production workloads (e.g., WordPress, static sites, small APIs).
- Predictable traffic patterns (fixed resource needs and scaling requirements).
- Teams without extensive Kubernetes expertise (lower operational overhead and easier management).
- Small scale operations (typically 1-10 instances).
The moment you need dynamic scaling, multi-tenancy, or hit service limits, start planning your Kubernetes migration for advanced container orchestration.
Our Modern Tech Stack for Lab Infrastructure
Our new lab infrastructure is built on a robust and modern tech stack, enabling high performance and flexibility:
Infrastructure:
- Hetzner bare metal (AX41 servers) for cost-effective, high-performance compute.
- K3s (lightweight Kubernetes) for efficient container orchestration.
- Longhorn (distributed block storage) for persistent data.
- Traefik (ingress controller) for managing external access to services.
Automation:
- Terraform (infrastructure provisioning) for declarative infrastructure as code.
- Helm (application deployment) for packaging and deploying Kubernetes applications.
- Python (CLI tooling) for custom automation and developer experience improvements.
- GitHub Actions (CI/CD) for continuous integration and deployment workflows.
Observability:
- Prometheus (metrics) for robust monitoring.
- Grafana (dashboards) for visualizing system performance and health.
- Loki (log aggregation) for centralized log management.
The migration took two weeks of planning and one weekend of execution. Three months later, we’re running 60+ concurrent lab environments at a fraction of the cost, with zero scaling constraints, and a greatly improved developer experience.