Debugging Kubernetes Kernel Memory
Diagnose Kubernetes kernel memory issues with slab allocators and page cache analysis. Production-tested strategies to prevent OOMKills and memory pressure.
I woke up to Gitaly pods hitting memory limits at 3 AM. The alerts showed OOMKilled containers, evicted pods, and degraded Git operations across our self-hosted GitLab infrastructure. Standard Kubernetes memory pressure—except the container metrics showed normal usage. The kernel was consuming 4GB of unreclaimable memory per node, and none of our monitoring caught it.
This wasn’t a memory leak in application code. It was kernel-level memory consumption from filesystem operations, page cache pressure, and slab allocator fragmentation. The kind of invisible memory usage that doesn’t appear in container metrics, evades Prometheus exporters, and only surfaces when nodes start thrashing.
After three nights of debugging production incidents, I learned that Kubernetes memory accounting tells you what containers are doing, but kernel memory reveals what the infrastructure is really doing.
The Memory That Doesn’t Show Up
Kubernetes resource limits define container memory boundaries—RSS, cache, swap usage. But kernels maintain their own memory for filesystem caches, network buffers, slab allocations, and inode structures. This kernel memory doesn’t count against container limits until it becomes critical.
In my Gitaly case, every Git operation involved hundreds of small file reads. The kernel cached inode structures, directory entries, and file metadata in slab allocators. Over weeks, this kernel memory grew to consume 40% of node capacity—memory that kubectl top nodes reported as “available.”
The core issue: Kubernetes sees kernel memory as reclaimable until the system proves otherwise through OOM conditions.
Identify Kubernetes Kernel Memory Consumers
Standard monitoring tools miss kernel memory patterns. I needed direct access to kernel internals to understand what was consuming resources.
The /proc/meminfo interface reveals kernel memory breakdowns that container metrics never expose:
#!/bin/bash
# Query kernel memory statistics on each node
for node in $(kubectl get nodes -o name); do
echo "=== $node ==="
kubectl debug $node -it --image=busybox -- cat /proc/meminfo | \
grep -E 'Slab|SReclaimable|SUnreclaim|KernelStack|PageTables'
done
This script showed me that SUnreclaim (unreclaimable slab memory) was growing 200MB per day on Gitaly nodes. Page tables and kernel stacks were normal, which ruled out process leaks or connection explosions.
The slab allocator was the culprit—specifically, filesystem metadata caching from millions of small Git objects.
Track Slab Allocator Usage in Production
The slab allocator manages kernel memory for frequently allocated objects: inodes, dentries, file descriptors, network buffers. High-churn workloads fragment these slabs into unreclaimable memory.
To identify which slab caches were consuming memory, I used /proc/slabinfo:
# Identify top slab allocators by active objects
kubectl debug node/worker-03 -it --image=ubuntu -- sh -c \
"apt update && apt install -y procps && \
cat /proc/slabinfo | tail -n +3 | \
awk '{print \$6*\$4/1024/1024, \$1}' | sort -rn | head -20"
Output showed ext4_inode_cache and dentry consuming 1.8GB and 1.2GB respectively. These caches grow when filesystems handle massive small-file workloads—exactly what Git repositories do.
The kernel was correctly caching filesystem metadata, but Kubernetes scheduling didn’t account for this memory usage when placing new pods.
The Git Workload Memory Pattern
Git operations are uniquely demanding on filesystem caches. A single git fetch might access thousands of objects, each requiring inode lookups, directory traversals, and metadata reads.
In my GitLab setup, Gitaly serves hundreds of concurrent repository operations. Each operation generates filesystem cache entries that persist in kernel memory for performance. Over time, this accumulates into gigabytes of slab allocations.
I verified this by monitoring slab growth during high Git activity:
#!/usr/bin/env python3
import time
import subprocess
def get_slab_usage():
"""Extract total slab memory from /proc/meminfo"""
result = subprocess.run(
["cat", "/proc/meminfo"],
capture_output=True, text=True
)
for line in result.stdout.split('\n'):
if line.startswith('Slab:'):
return int(line.split()[1]) # KB
return 0
# Monitor slab growth during Git operations
print("Timestamp,SlabKB,SlabMB")
while True:
slab_kb = get_slab_usage()
print(f"{time.time()},{slab_kb},{slab_kb/1024:.2f}")
time.sleep(60)
During peak hours (8 AM - 6 PM), slab memory grew 150MB/hour. During off-hours, it remained stable. The correlation with Git activity was unmistakable.
Why Kernel Memory Pressure Causes OOMKills
When nodes hit memory pressure, the kernel tries to reclaim memory by evicting page caches and shrinking slab allocators. But if slab memory is unreclaimable (actively in use by filesystem operations), the kernel has fewer options.
Kubernetes sees total node memory as full, triggers pod evictions, and eventually OOMKills processes to free resources. But the kernel memory causing pressure isn’t tied to specific containers—it’s infrastructure overhead from shared filesystem operations.
This creates a vicious cycle: evicting pods frees container memory but not kernel memory, so the node continues experiencing pressure, leading to more evictions.
I confirmed this pattern in kernel logs:
# Check for memory pressure events in kernel logs
kubectl debug node/worker-03 -it --image=ubuntu -- \
dmesg | grep -E 'Out of memory|Memory cgroup|oom_reaper'
Logs showed the OOM killer targeting high-memory processes (Gitaly workers) while kernel slab allocations remained protected.
Configure Kubernetes Memory Reservations
Kubernetes supports system-reserved and kube-reserved memory, which excludes kernel and system overhead from schedulable capacity. I wasn’t using these settings, so Kubernetes assumed all node memory was available for containers.
Updating kubelet configuration to reserve memory for kernel operations:
# /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
systemReserved:
memory: "2Gi"
kubeReserved:
memory: "1Gi"
evictionHard:
memory.available: "500Mi"
This reserves 3GB for kernel and kubelet operations, reducing schedulable capacity but preventing OOMKills from kernel memory pressure.
After applying these settings and rebooting nodes (required for kubelet config changes), memory pressure incidents stopped. Kernel slab memory continued growing during Git operations, but stayed within reserved boundaries.
Optimize Filesystem Cache Settings
The kernel’s default behavior is to cache aggressively and reclaim only under pressure. For workloads with predictable memory patterns, I tuned cache retention with vm.vfs_cache_pressure.
# Increase cache eviction pressure (default: 100)
sysctl -w vm.vfs_cache_pressure=200
# Persist across reboots
echo "vm.vfs_cache_pressure=200" >> /etc/sysctl.d/99-kubernetes.conf
Higher values make the kernel reclaim dentry and inode caches more aggressively, reducing long-term slab accumulation. I tested values from 150-250 and found 200 balanced Git performance with memory stability.
For Kubernetes, I deployed this as a DaemonSet with init containers applying sysctl settings to all nodes:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: sysctl-tuning
namespace: kube-system
spec:
selector:
matchLabels:
app: sysctl-tuning
template:
metadata:
labels:
app: sysctl-tuning
spec:
hostNetwork: true
hostPID: true
initContainers:
- name: sysctl-init
image: busybox
command:
- sh
- -c
- |
sysctl -w vm.vfs_cache_pressure=200
sysctl -w vm.min_free_kbytes=67584
securityContext:
privileged: true
containers:
- name: pause
image: k8s.gcr.io/pause:3.9
This approach ensures consistent kernel tuning across all cluster nodes without manual SSH configuration.
Monitor Kubernetes Kernel Memory Metrics
Once I understood kernel memory patterns, I added Prometheus metrics to track slab and page cache usage. The node-exporter doesn’t export slab metrics by default, so I created a custom exporter.
package main
import (
"bufio"
"fmt"
"net/http"
"os"
"strconv"
"strings"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
slabTotal = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "node_memory_slab_bytes",
Help: "Total slab memory in bytes",
})
slabReclaimable = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "node_memory_slab_reclaimable_bytes",
Help: "Reclaimable slab memory in bytes",
})
slabUnreclaim = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "node_memory_slab_unreclaim_bytes",
Help: "Unreclaimable slab memory in bytes",
})
)
func init() {
prometheus.MustRegister(slabTotal)
prometheus.MustRegister(slabReclaimable)
prometheus.MustRegister(slabUnreclaim)
}
func updateMetrics() error {
file, err := os.Open("/proc/meminfo")
if err != nil {
return err
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
fields := strings.Fields(line)
if len(fields) < 2 {
continue
}
key := strings.TrimSuffix(fields[0], ":")
value, _ := strconv.ParseFloat(fields[1], 64)
value *= 1024 // Convert KB to bytes
switch key {
case "Slab":
slabTotal.Set(value)
case "SReclaimable":
slabReclaimable.Set(value)
case "SUnreclaim":
slabUnreclaim.Set(value)
}
}
return scanner.Err()
}
func main() {
http.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
updateMetrics()
promhttp.Handler().ServeHTTP(w, r)
})
fmt.Println("Slab exporter listening on :9101")
http.ListenAndServe(":9101", nil)
}
Deployed as a DaemonSet, this exporter provided real-time visibility into kernel memory trends. I created Grafana dashboards showing slab growth correlated with Git operation rates.
When Kernel Memory Growth Is Normal
Not all kernel memory growth indicates problems. Filesystem caches improve performance by keeping frequently accessed metadata in memory. The kernel is designed to use available RAM for caching.
The issue arises when Kubernetes scheduling doesn’t account for kernel memory overhead. A node reporting 8GB “available” might have 4GB in kernel caches that won’t be reclaimed easily.
I learned to distinguish between healthy caching and problematic accumulation:
- Healthy: Slab memory grows during activity, shrinks during idle periods
- Problematic: Slab memory grows continuously, never reclaims, causes OOMKills
For Git workloads, some kernel memory overhead is expected and beneficial. The solution isn’t eliminating kernel caches but reserving appropriate node capacity for them.
Lessons from Three Nights of Debugging
- Container metrics lie by omission - Kubernetes reports container memory accurately but ignores kernel overhead
- Kernel memory is infrastructure tax - High-churn workloads like Git operations create unavoidable kernel memory costs
- System reservations aren’t optional - Production clusters need explicit kernel memory reservations
- Slab allocators are invisible until they’re not - Filesystem caches grow silently until they trigger OOMKills
- Tuning kernel parameters requires testing - vfs_cache_pressure changes performance characteristics, test thoroughly
The hardest part wasn’t fixing the issue—it was realizing container metrics didn’t show the full picture. Once I started monitoring kernel memory directly, the patterns became obvious.
If your Kubernetes cluster experiences unexplained memory pressure, check kernel memory before blaming applications. The real consumer might be the infrastructure, not the workload.