Build Internal Developer Platforms
Stop chasing every DevOps tool. Build an internal developer platform that reduces cognitive load and accelerates delivery. Real examples included.
Escape Tool Fatigue with Platform Engineering
I’ve watched countless engineers burn out trying to master Kubernetes, Terraform, ArgoCD, Prometheus, and a dozen other tools simultaneously. In my experience working with fast-growing startups, the teams that succeed aren’t the ones with the most tool expertise—they’re the ones that build internal developer platforms.
The shift from “DevOps engineer who knows all the tools” to “platform engineer who builds abstractions” is one of the most important career transitions I’ve made. Instead of firefighting infrastructure issues and context-switching between tools, I now build systems that let developers ship code without needing to understand the underlying complexity.
Design Platform Components That Matter
An internal developer platform (IDP) isn’t just a collection of scripts or a fancy dashboard. It’s a thoughtfully designed abstraction layer that handles the operational complexity your organization actually faces.
Here’s what I’ve learned platforms need:
Self-service capabilities - Developers shouldn’t need to file tickets to provision infrastructure. They should push code and get running services.
Opinionated workflows - Unlimited flexibility creates unlimited cognitive load. Good platforms make the right choices obvious and the wrong ones difficult.
Observable by default - Metrics, logs, and traces should flow automatically. No developer should manually configure Prometheus scrape configs.
Security guardrails - Compliance and security shouldn’t be optional add-ons. They should be impossible to bypass.
Deploy Your First Platform Service
I recently helped a team reduce their deployment complexity from seven tools and twelve manual steps to a single command. Here’s the before and after.
Before: Developers needed to manually create Kubernetes namespaces, configure service meshes, set up monitoring, manage secrets, configure ingress, update DNS, and configure CI/CD pipelines. Each step required understanding a different tool.
After: Developers run a single command that handles everything:
# Create a new service with production-grade infrastructure
platform service create api-gateway \
--language go \
--replicas 3 \
--database postgres \
--cache redis
This abstraction wasn’t magic. It was a Python CLI that orchestrated Terraform, Kubernetes manifests, and CI/CD configurations. The platform made decisions based on organizational standards so developers didn’t have to.
Implement Platform Orchestration with Python
The core of our platform is a Python-based orchestration layer that wraps multiple infrastructure tools. Here’s a simplified version of how we handle service creation:
import subprocess
import re
from pathlib import Path
from jinja2 import Environment, FileSystemLoader
# Compile regex at module level for performance
SERVICE_NAME_PATTERN = re.compile(r'^[a-z0-9]([a-z0-9-_]*[a-z0-9])?$')
class PlatformService:
def __init__(self, name, language, replicas=2, database=None):
# Validate input to prevent injection attacks
# Must be lowercase alphanumeric (single char allowed) or multi-char with hyphens/underscores
if not SERVICE_NAME_PATTERN.match(name):
raise ValueError("Service name: lowercase alphanumeric (a-z, 0-9) with optional hyphens/underscores")
self.name = name
self.language = language
self.replicas = replicas
self.database = database
self.namespace = f"app-{name}"
def create(self):
"""Orchestrate all infrastructure provisioning.
In a real platform, each step should be part of a transaction with
compensating actions (rollback) to avoid leaving partial state if
something fails mid-way.
"""
created_namespace = False
try:
self._create_namespace()
created_namespace = True
self._provision_database()
self._generate_manifests()
self._setup_monitoring()
self._configure_cicd()
self._apply_kubernetes()
except Exception:
if created_namespace:
# Best-effort rollback to avoid inconsistent state
self._rollback()
raise
def _rollback(self):
"""Best-effort rollback of resources created during provisioning."""
# In this simplified example, we only clean up the namespace. A real
# implementation would also undo database provisioning (note: databases
# with deletion_protection=true require manual Terraform destroy or
# removing the protection flag first), CI/CD config, etc.
subprocess.run(
["kubectl", "delete", "namespace", self.namespace, "--ignore-not-found"],
check=False,
capture_output=True,
)
def _create_namespace(self):
"""Create isolated Kubernetes namespace with policies"""
manifest = f"""
apiVersion: v1
kind: Namespace
metadata:
name: {self.namespace}
labels:
platform.company.com/managed: "true"
platform.company.com/service: {self.name}
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: {self.namespace}
spec:
podSelector: {{}}
policyTypes:
- Ingress
- Egress
# Default-deny with minimal required access for demonstration
# In production, define specific ingress rules (e.g., from ingress controller)
# and egress rules (e.g., to specific services, DNS, external APIs)
ingress: []
egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53 # Allow DNS
"""
subprocess.run(
["kubectl", "apply", "-f", "-"],
input=manifest.encode("utf-8"),
check=True,
)
def _provision_database(self):
"""Use Terraform to provision managed database"""
if not self.database:
return
terraform_config = f"""
resource "google_sql_database_instance" "{self.name}_db" {{
name = "{self.name}-db"
database_version = "POSTGRES_15"
region = "us-central1"
deletion_protection = true
settings {{
# In production, make tier configurable via Application spec
# Example tiers: db-custom-1-3840 (~$50/mo), db-custom-2-7680 (~$100/mo)
# For dev/test, use db-f1-micro or db-g1-small to reduce costs
tier = "db-custom-1-3840"
backup_configuration {{
enabled = true
point_in_time_recovery_enabled = true
}}
ip_configuration {{
ipv4_enabled = true
require_ssl = true
}}
database_flags {{
name = "cloudsql.iam_authentication"
value = "on"
}}
}}
}}
resource "google_sql_database" "{self.name}" {{
name = "{self.name}"
instance = google_sql_database_instance.{self.name}_db.name
}}
"""
terraform_dir = Path("terraform")
terraform_dir.mkdir(exist_ok=True)
tf_file = terraform_dir / "database.tf"
tf_file.write_text(terraform_config)
try:
subprocess.run(
["terraform", "init", "-input=false"],
cwd=terraform_dir,
check=True,
)
subprocess.run(
["terraform", "apply", "-auto-approve"],
cwd=terraform_dir,
check=True,
)
finally:
# WARNING: Deleting configuration without managing state properly
# leaves Terraform in an inconsistent state. In production:
# - Keep .tf files under version control
# - Use remote state backends (S3, GCS, Terraform Cloud)
# - Never delete state files - they track real infrastructure
pass # Skipping cleanup in this educational example
def _setup_monitoring(self):
"""Configure Prometheus ServiceMonitor automatically"""
monitor = f"""
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {self.name}
namespace: {self.namespace}
spec:
selector:
matchLabels:
app: {self.name}
endpoints:
- port: metrics
interval: 30s
"""
subprocess.run(
["kubectl", "apply", "-f", "-"],
input=monitor.encode("utf-8"),
check=True,
)
This abstraction handles Kubernetes, Terraform, and monitoring configuration with a single interface. Developers never see the complexity underneath.
Abstract Kubernetes Complexity with CRDs
One of my favorite patterns is using Kubernetes Custom Resource Definitions (CRDs) to create higher-level abstractions. Instead of making developers write Deployments, Services, and Ingresses, we built an Application CRD:
apiVersion: platform.company.com/v1
kind: Application
metadata:
name: api-gateway
namespace: production
spec:
image: gcr.io/company/api-gateway:v1.2.3
replicas: 3
language: go
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
# In a real platform, also set namespace-level ResourceQuotas
# to prevent any single application from consuming all cluster resources
database:
type: postgres
size: db-custom-2-8192
user: api_gateway_user # Must start with letter/underscore, then alphanumeric/underscores
cache:
type: redis
version: "7.0"
monitoring:
enabled: true
alerts:
- high-error-rate
- high-latency
A Kubernetes operator watches these Application resources and generates the dozens of underlying Kubernetes objects needed. Developers describe what they want, not how to build it.
The operator implementation is surprisingly straightforward in Go:
package main
import (
"context"
"fmt"
"regexp"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
apierrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
)
// Pre-compiled regex at package level for performance
// Validates database usernames: must start with letter/underscore, then alphanumeric/underscores
var dbUsernameRegex = regexp.MustCompile(`^[a-zA-Z_][a-zA-Z0-9_]*$`)
// Application represents the CRD type defined earlier in the blog post.
// Example structure:
// type Application struct {
// Name string
// Namespace string
// Spec ApplicationSpec
// }
// type ApplicationSpec struct {
// Image string
// Replicas int32
// Resources ResourceRequirements
// Database *DatabaseConfig
// }
// type DatabaseConfig struct {
// Type string // e.g., "postgres"
// Size string // e.g., "db-custom-2-8192"
// User string // Database username (validated)
// }
type ApplicationReconciler struct {
Client kubernetes.Interface
}
func (r *ApplicationReconciler) Reconcile(ctx context.Context, app *Application) error {
// Generate Deployment from Application spec
deployment := &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: app.Name,
Namespace: app.Namespace,
Labels: map[string]string{
"app": app.Name,
"platform.managed": "true",
},
},
Spec: appsv1.DeploymentSpec{
Replicas: &app.Spec.Replicas,
Selector: &metav1.LabelSelector{
MatchLabels: map[string]string{"app": app.Name},
},
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: map[string]string{"app": app.Name},
Annotations: map[string]string{
"prometheus.io/scrape": "true",
"prometheus.io/port": "8080",
},
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{
{
Name: app.Name,
Image: app.Spec.Image,
Resources: corev1.ResourceRequirements{
Requests: app.Spec.Resources.Requests,
Limits: app.Spec.Resources.Limits,
},
},
},
},
},
},
}
// Create or update the Deployment using create-or-update pattern
deploymentsClient := r.Client.AppsV1().Deployments(app.Namespace)
existing, err := deploymentsClient.Get(ctx, deployment.Name, metav1.GetOptions{})
if err != nil {
if apierrors.IsNotFound(err) {
_, err = deploymentsClient.Create(ctx, deployment, metav1.CreateOptions{})
if err != nil {
return fmt.Errorf("failed to create deployment: %w", err)
}
} else {
return fmt.Errorf("failed to get deployment: %w", err)
}
} else {
deployment.ResourceVersion = existing.ResourceVersion
_, err = deploymentsClient.Update(ctx, deployment, metav1.UpdateOptions{})
if err != nil {
return fmt.Errorf("failed to update deployment: %w", err)
}
}
// Provision database if specified
if app.Spec.Database != nil {
if err := r.provisionDatabase(ctx, app); err != nil {
return err
}
}
return nil
}
func (r *ApplicationReconciler) provisionDatabase(ctx context.Context, app *Application) error {
// Validate database username to prevent SQL injection
// In production, also validate against database-specific constraints
if !isValidDatabaseUsername(app.Spec.Database.User) {
return fmt.Errorf("invalid database username: must start with letter/underscore, contain only alphanumeric/underscores")
}
// Call Terraform or cloud provider API to provision database
// Inject connection details as Kubernetes Secret
secret := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("%s-db", app.Name),
Namespace: app.Namespace,
},
StringData: map[string]string{
"host": "postgres.example.com",
"database": app.Name,
"username": app.Spec.Database.User,
"password": generateSecurePassword(),
},
}
secretsClient := r.Client.CoreV1().Secrets(app.Namespace)
existing, err := secretsClient.Get(ctx, secret.Name, metav1.GetOptions{})
if err != nil {
if apierrors.IsNotFound(err) {
if _, err := secretsClient.Create(ctx, secret, metav1.CreateOptions{}); err != nil {
return fmt.Errorf("creating database secret %q: %w", secret.Name, err)
}
return nil
}
return fmt.Errorf("getting database secret %q: %w", secret.Name, err)
}
secret.ResourceVersion = existing.ResourceVersion
if _, err := secretsClient.Update(ctx, secret, metav1.UpdateOptions{}); err != nil {
return fmt.Errorf("updating database secret %q: %w", secret.Name, err)
}
return nil
}
// generateSecurePassword creates a cryptographically secure random password.
// In production, use a secrets management system like HashiCorp Vault or
// cloud provider secret managers (AWS Secrets Manager, GCP Secret Manager, etc.)
// rather than generating passwords in-cluster.
//
// NOTE: This function intentionally panics to prevent copy-paste usage without
// proper secrets manager integration. In a real controller, you would return
// an error and let the reconciler handle it gracefully via status conditions.
func generateSecurePassword() string {
// Example implementation using crypto/rand (simplified):
// const charset = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!@#$%^&*()"
// b := make([]byte, 32)
// if _, err := rand.Read(b); err != nil {
// panic(err) // or handle error appropriately in production
// }
// return base64.URLEncoding.EncodeToString(b)
// IMPORTANT: This placeholder will not work in production!
// Replace with actual secrets manager integration before deploying.
panic("generateSecurePassword must be implemented with secrets manager integration")
}
// isValidDatabaseUsername validates database username to prevent SQL injection
func isValidDatabaseUsername(username string) bool {
// Use pre-compiled regex for performance
// Must start with letter or underscore, then alphanumeric/underscores
return dbUsernameRegex.MatchString(username) && len(username) <= 63 // PostgreSQL max
}
This operator pattern lets us evolve platform capabilities without changing developer workflows. We add features by updating the operator, not by teaching everyone new tools.
Migrate to Platform Engineering Gradually
You can’t build a platform overnight. The teams I’ve worked with that succeeded followed a gradual migration:
Month 1-2: Start with a single team and one use case. For us, it was new service creation. We built just enough platform to handle that workflow.
Month 3-4: Migrate existing services one at a time. Each migration taught us what the platform was missing. We added database migration support, secret management, and monitoring integration.
Month 5-6: Expand to more teams. We documented patterns, created runbooks, and built self-service dashboards. The platform became the default way to deploy.
Month 7+: Iterate based on feedback. We added cost attribution, compliance reporting, and disaster recovery automation.
The key insight: build for the organization you have, not the one you wish you had. Start small, prove value, then expand.
Measure Platform Engineering ROI
I track platform effectiveness with four metrics:
Time to first deployment: How long does it take a new developer to ship code to production? We went from three days to thirty minutes.
Mean time to recovery (MTTR): How quickly can we recover from incidents? Platform abstractions made rollbacks instant.
Cognitive load reduction: How many tools does a developer need to learn? We reduced it from twelve to two (Git and the platform CLI).
Developer satisfaction: Would developers recommend the platform? We survey quarterly and iterate based on feedback.
The Bottom Line
Building an internal developer platform isn’t about creating more complexity—it’s about hiding it. The best platforms feel invisible. Developers focus on building features, not debugging infrastructure.
In my experience, the ROI appears within three months. Faster deployments, fewer incidents, and happier developers make the investment worthwhile. You don’t need to master every DevOps tool. You need to build abstractions that make the tools irrelevant.
If you’re drowning in tool sprawl, consider building a platform instead of learning another framework. Your future self will thank you.