12 min read
Dillon Browne

Multi-Region Kubernetes with GitOps

A comprehensive guide to architecting and implementing a production-grade, multi-region Kubernetes platform using GitOps principles and Infrastructure as Code.

Kubernetes GitOps Infrastructure as Code DevOps Cloud Architecture AWS Terraform

As organizations scale their container workloads across multiple regions and cloud providers, the complexity of managing Kubernetes infrastructure grows exponentially. In this post, I’ll share my battle-tested approach to building a production-grade, multi-region Kubernetes platform using GitOps principles and Infrastructure as Code (IaC).

The Challenge

Recently, I led the development of a global platform that needed to:

  • Support applications across North America, Europe, and Asia
  • Maintain consistent security and compliance controls
  • Enable rapid deployment with minimal human intervention
  • Provide disaster recovery with RPO < 15 minutes
  • Scale to handle 1000+ microservices

Architecture Overview

Here’s the high-level architecture we implemented:

[Git Repositories]


[ArgoCD/Flux]──────────────[Terraform Cloud]
     │                            │
     ▼                            ▼
[Platform Components]        [Infrastructure]
 - Cert Manager             - VPC/Networking
 - External DNS             - EKS Clusters
 - Ingress Controller       - IAM Roles
 - Monitoring Stack         - Security Groups


[Regional EKS Clusters]
 └── us-east-1
 └── eu-west-1
 └── ap-southeast-1

Infrastructure as Code Foundation

We used Terraform to define our infrastructure, organizing it into reusable modules:

module "eks_cluster" {
  source = "./modules/eks"
  
  for_each = local.regions
  
  region         = each.key
  cluster_name   = "${var.environment}-${each.key}"
  node_groups    = local.node_group_config[each.key]
  vpc_id         = module.vpc[each.key].vpc_id
  subnet_ids     = module.vpc[each.key].private_subnet_ids
  
  tags = {
    Environment = var.environment
    Region      = each.key
    ManagedBy   = "terraform"
  }
}

GitOps Implementation

We chose ArgoCD for GitOps, configuring it to manage both infrastructure and applications:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: platform-services
  namespace: argocd
spec:
  project: default
  source:
    repoURL: git@github.com:org/platform-services.git
    targetRevision: HEAD
    path: manifests
  destination:
    server: https://kubernetes.default.svc
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Platform Components

Security and Access Control

We implemented a zero-trust security model using AWS IAM roles and Kubernetes RBAC:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: platform-admin
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: platform-admin-binding
subjects:
- kind: Group
  name: platform-admins
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: platform-admin
  apiGroup: rbac.authorization.k8s.io

Monitoring and Observability

We deployed a comprehensive monitoring stack:

  1. Prometheus for metrics collection
  2. Grafana for visualization
  3. Loki for log aggregation
  4. Tempo for distributed tracing

Example Prometheus configuration:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
spec:
  replicas: 2
  retention: 15d
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: gp3
        resources:
          requests:
            storage: 100Gi

Performance Optimizations

Some key optimizations we implemented:

  1. Cluster Autoscaling
resource "aws_autoscaling_group" "nodes" {
  desired_capacity = 3
  max_size        = 10
  min_size        = 1
  
  mixed_instances_policy {
    instances_distribution {
      on_demand_percentage_above_base_capacity = 50
    }
    launch_template {
      override {
        instance_type = "m6i.2xlarge"
      }
    }
  }
}
  1. Network Policy Optimization
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Lessons Learned

  1. State Management: Keep Terraform state in a centralized location (we used S3 + DynamoDB) and implement proper locking:
terraform {
  backend "s3" {
    bucket         = "terraform-state"
    key            = "platform/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}
  1. Disaster Recovery: Regular testing of DR procedures is crucial. We automated this with chaos engineering:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-test
spec:
  action: pod-failure
  mode: one
  duration: "10m"
  selector:
    namespaces:
      - default
  1. Cost Management: Implement proper tagging and use tools like Kubecost for visibility:
resource "aws_eks_node_group" "main" {
  tags = {
    Environment = var.environment
    Team        = var.team
    CostCenter  = var.cost_center
  }
}

Performance Results

After implementation, we achieved:

  • 99.99% platform availability
  • 45% reduction in deployment time
  • 30% cost savings through optimized resource utilization
  • Zero production incidents during regional failovers

Tech Stack Summary

  • Infrastructure: AWS (EKS, VPC, Route53)
  • IaC: Terraform
  • GitOps: ArgoCD
  • Monitoring: Prometheus, Grafana, Loki
  • Security: AWS IAM, cert-manager, external-dns
  • CI/CD: GitHub Actions, ArgoCD
  • Storage: AWS EBS, S3
  • Networking: AWS VPC CNI, Calico

This architecture has been running in production for over 6 months, serving millions of requests daily across three continents. The combination of GitOps and IaC has dramatically reduced our operational overhead while improving reliability and security.

Remember, there’s no one-size-fits-all solution. The key is understanding your specific requirements and constraints, then designing a platform that balances complexity with maintainability.

Feel free to reach out if you have questions about implementing similar architectures in your organization!

Found this helpful? Share it with others: