Debug Hidden Linux Kernel Bugs
Master kernel debugging with eBPF, ftrace, and perf. Identify latent bugs hiding in production infrastructure and fix them before system outages occur.
Linux kernel debugging in production environments is fundamentally different from application debugging. Most kernel bugs don’t announce themselves with kernel panics or obvious stack traces. They hide in production for months or years, manifesting as inexplicable performance degradation, mysterious memory leaks, or rare race conditions that only trigger under specific workload patterns. I’ve spent hundreds of hours tracking down these ghosts in the machine, and the hardest lesson I learned was this: by the time you notice the symptom, the root cause is often buried under layers of system behavior that look perfectly normal.
The real challenge isn’t finding bugs that crash systems—those get fixed quickly. It’s the subtle ones that degrade performance by 5%, cause occasional connection timeouts, or create memory pressure that only appears after days of uptime. These are the bugs that hide in production infrastructure for years, slowly eroding reliability until someone finally connects the dots.
Identify Hidden Kernel Bug Patterns
Latent kernel bugs have distinct signatures that differ from application-level problems. In my experience debugging production infrastructure across cloud platforms and bare metal deployments, I’ve learned to identify several categories of kernel-related issues that traditional monitoring often misses.
Memory subsystem anomalies are among the most insidious. I once tracked down a bug in the kernel’s slab allocator that caused gradual memory fragmentation over weeks of uptime. The symptoms were subtle: occasional allocation failures in completely unrelated subsystems, increased page fault rates, and degraded network throughput. Traditional memory monitoring showed plenty of free memory, but the kernel couldn’t allocate contiguous pages when it needed them.
Filesystem and I/O bugs often hide behind application behavior. I worked with a team experiencing random database checkpoint timeouts that only occurred after 10+ days of continuous operation. The culprit was a kernel bug in the ext4 journal that caused write stalls under specific metadata workload patterns. The bug had existed for three years before we identified it, affecting thousands of production systems without anyone connecting the symptoms to a kernel issue.
Network stack race conditions are particularly difficult to debug because they’re timing-dependent. I encountered a bug in the TCP congestion control algorithm that only manifested when specific network latency patterns coincided with high connection churn rates. The symptoms—occasional connection hangs lasting exactly 200ms—looked like network issues to our monitoring systems, but the root cause was entirely in the kernel’s network stack.
Deploy Production Kernel Debugging Tools
Effective kernel debugging requires purpose-built tools that expose kernel internals without disrupting production workloads. Here’s the debugging stack I rely on for production kernel investigations:
eBPF and bpftrace provide the foundation for low-overhead kernel instrumentation. I use bpftrace for exploratory debugging and custom eBPF programs when I need sustained monitoring. This Python script demonstrates how I use eBPF to track kernel memory allocation patterns:
#!/usr/bin/env python3
from bcc import BPF
import time
# eBPF program to track kmalloc allocations
bpf_program = """
#include <uapi/linux/ptrace.h>
#include <linux/slab.h>
struct alloc_info_t {
u64 timestamp;
u64 size;
u64 address;
u32 pid;
char comm[16];
};
BPF_PERF_OUTPUT(events);
BPF_HASH(allocations, u64, struct alloc_info_t);
int trace_kmalloc(struct pt_regs *ctx, size_t size) {
struct alloc_info_t info = {};
info.timestamp = bpf_ktime_get_ns();
info.size = size;
info.pid = bpf_get_current_pid_tgid() >> 32;
bpf_get_current_comm(&info.comm, sizeof(info.comm));
events.perf_submit(ctx, &info, sizeof(info));
return 0;
}
int trace_kfree(struct pt_regs *ctx, void *ptr) {
u64 addr = (u64)ptr;
allocations.delete(&addr);
return 0;
}
"""
b = BPF(text=bpf_program)
b.attach_kprobe(event="__kmalloc", fn_name="trace_kmalloc")
b.attach_kprobe(event="kfree", fn_name="trace_kfree")
def print_event(cpu, data, size):
event = b["events"].event(data)
print(f"[{event.timestamp}] PID {event.pid} ({event.comm.decode('utf-8', 'replace')}): "
f"allocated {event.size} bytes")
b["events"].open_perf_buffer(print_event)
print("Tracing kernel memory allocations... Hit Ctrl-C to end.")
while True:
try:
b.perf_buffer_poll()
except KeyboardInterrupt:
break
This script instruments kmalloc and kfree calls with minimal overhead, allowing me to observe kernel memory allocation patterns in real time. I’ve used variations of this approach to identify memory leaks in kernel modules, track allocation hotspots during performance degradation, and correlate allocation patterns with application-level behavior.
ftrace remains essential for function-level kernel tracing. When eBPF isn’t sufficient, I use ftrace to capture detailed execution paths through kernel subsystems. This bash script shows my typical ftrace workflow for investigating filesystem performance issues:
#!/bin/bash
# Capture filesystem write path with function graph tracing
TRACE_DIR="/sys/kernel/debug/tracing"
# Enable function graph tracer
echo function_graph > "${TRACE_DIR}/current_tracer"
# Filter to filesystem functions
echo 'vfs_write' > "${TRACE_DIR}/set_graph_function"
echo 'ext4_*' >> "${TRACE_DIR}/set_graph_function"
# Set maximum graph depth
echo 10 > "${TRACE_DIR}/max_graph_depth"
# Clear existing trace
echo > "${TRACE_DIR}/trace"
# Enable tracing
echo 1 > "${TRACE_DIR}/tracing_on"
# Let it run for 30 seconds
sleep 30
# Disable tracing
echo 0 > "${TRACE_DIR}/tracing_on"
# Dump trace output
cat "${TRACE_DIR}/trace" > /tmp/kernel_trace_$(date +%Y%m%d_%H%M%S).txt
# Reset tracer
echo nop > "${TRACE_DIR}/current_tracer"
echo "Trace saved to /tmp/kernel_trace_*.txt"
I use this script when investigating write latency issues, filesystem lock contention, or unexpected I/O patterns. The function graph tracer provides precise execution time for each function in the call chain, making it possible to identify exactly where time is being spent in the kernel.
perf provides statistical sampling that’s safe for production use. I combine perf with flame graphs to visualize where the kernel spends time during performance degradation events:
#!/bin/bash
# Capture kernel CPU profile with call stacks
# Record kernel samples for 60 seconds
perf record -F 99 -a -g --call-graph dwarf -- sleep 60
# Generate flame graph data
perf script | ~/FlameGraph/stackcollapse-perf.pl | \
~/FlameGraph/flamegraph.pl --title "Kernel CPU Profile" > \
kernel_profile_$(date +%Y%m%d_%H%M%S).svg
echo "Flame graph saved to kernel_profile_*.svg"
# Report top kernel functions
perf report --stdio --sort symbol -n --percent-limit 1
This approach helped me identify a kernel bug where a lock contention issue in the network stack caused CPU time to spike in the RCU (Read-Copy-Update) code path. The flame graph made it obvious that the kernel was spending 30% of CPU time in RCU callbacks, which led me to investigate recent changes in the network driver that were triggering excessive RCU synchronization.
Isolate Kernel Bugs from Application Issues
The hardest part of kernel debugging is distinguishing kernel issues from application-level problems that happen to trigger kernel code paths. I’ve developed a systematic approach to isolate true kernel bugs from application behavior that merely looks like a kernel issue.
Reproduce across different workloads. When I suspect a kernel bug, I first try to reproduce the symptoms with completely different applications. If the issue only manifests with one specific application, it’s more likely an application bug or a kernel API misuse. True kernel bugs affect multiple workloads that exercise the same kernel subsystem.
I once investigated a “kernel networking bug” that only affected a specific microservice. After reproducing the connection timeout pattern with a simple socket test program, I confirmed it was actually a kernel bug in TCP keep-alive handling, not an application issue. The key was isolating the minimal kernel code path that triggered the problem.
Compare kernel versions systematically. Bisecting kernel versions is tedious but often the fastest path to identifying when a bug was introduced. I maintain a collection of kernel builds spanning multiple stable releases specifically for bisection testing. This Terraform configuration shows how I automate kernel bisection testing in AWS:
# Terraform configuration for automated kernel bisection testing
variable "kernel_versions" {
type = list(string)
default = [
"5.10.0",
"5.15.0",
"6.1.0",
"6.6.0"
]
}
resource "aws_launch_template" "kernel_test" {
for_each = toset(var.kernel_versions)
name_prefix = "kernel-bisect-${each.value}-"
image_id = data.aws_ami.ubuntu_base.id
instance_type = "c5.xlarge"
user_data = base64encode(templatefile("${path.module}/install_kernel.sh", {
kernel_version = each.value
}))
block_device_mappings {
device_name = "/dev/sda1"
ebs {
volume_size = 20
volume_type = "gp3"
}
}
tag_specifications {
resource_type = "instance"
tags = {
Name = "kernel-test-${each.value}"
KernelVersion = each.value
Purpose = "bisection-testing"
}
}
}
resource "aws_instance" "kernel_test" {
for_each = aws_launch_template.kernel_test
launch_template {
id = each.value.id
version = "$Latest"
}
vpc_security_group_ids = [aws_security_group.kernel_test.id]
subnet_id = aws_subnet.test.id
provisioner "remote-exec" {
inline = [
"sudo apt-get update",
"cd /opt/workload-test",
"./run_test_suite.sh",
"tar czf /tmp/kernel-test-results.tar.gz /var/log/test-results/"
]
connection {
type = "ssh"
user = "ubuntu"
private_key = file(var.ssh_private_key_path)
host = self.public_ip
}
}
}
output "test_instances" {
value = {
for k, v in aws_instance.kernel_test : k => {
kernel_version = var.kernel_versions[index(keys(aws_instance.kernel_test), k)]
public_ip = v.public_ip
instance_id = v.id
}
}
}
This infrastructure spins up parallel test environments with different kernel versions, runs identical workloads, and collects metrics that help identify exactly which kernel version introduced the regression. I’ve used this approach to narrow down bugs to specific kernel release ranges, which dramatically reduces the scope of investigation.
Analyze kernel data structures directly. When symptoms point to kernel state corruption, I use crash dumps and live kernel debugging to inspect data structures. The SystemTap language provides safe access to kernel internals for this purpose:
#!/usr/bin/env stap
# Inspect TCP connection state to debug connection hang issues
probe begin {
printf("Monitoring TCP connection states...\n")
}
probe kernel.function("tcp_v4_do_rcv") {
sk = $sk
state = @cast(sk, "sock_common", "kernel")->skc_state
if (state == 8) { # TCP_CLOSE_WAIT
saddr = format_ipaddr(@cast(sk, "inet_sock", "kernel")->inet_saddr, AF_INET)
daddr = format_ipaddr(@cast(sk, "inet_sock", "kernel")->inet_daddr, AF_INET)
sport = @cast(sk, "inet_sock", "kernel")->inet_sport
dport = @cast(sk, "inet_sock", "kernel")->inet_dport
printf("CLOSE_WAIT: %s:%d -> %s:%d\n",
saddr, sport, daddr, dport)
# Check if socket has pending data
rcv_queue = @cast(sk, "sock", "kernel")->sk_receive_queue->qlen
if (rcv_queue > 0) {
printf(" WARNING: %d packets in receive queue\n", rcv_queue)
}
}
}
probe timer.s(5) {
printf("--- %s ---\n", ctime(gettimeofday_s()))
}
This SystemTap script helped me debug a subtle kernel bug where TCP connections in CLOSE_WAIT state weren’t properly cleaning up their receive queues, eventually exhausting socket buffers and causing connection establishment failures. The ability to inspect live kernel data structures without crashing the system was essential for understanding the bug’s behavior.
Report Kernel Bugs Effectively
Once I’ve isolated a kernel bug, the next challenge is reporting it effectively to the kernel development community. I’ve learned that kernel developers need specific information to reproduce and fix bugs, and providing incomplete reports usually results in no response.
Create minimal reproducers. The single most important thing you can provide is a reliable reproducer. I spend as much time creating minimal reproduction cases as I do investigating the initial problem. Here’s a Go program I created to reproduce a kernel networking bug:
// Minimal reproducer for TCP connection hang bug
package main
import (
"fmt"
"net"
"sync"
"time"
)
const (
targetHost = "127.0.0.1"
targetPort = "8080"
numConnections = 1000
requestPattern = "GET / HTTP/1.1\r\nHost: localhost\r\nConnection: keep-alive\r\n\r\n"
)
func main() {
var wg sync.WaitGroup
errorCount := 0
var errorMutex sync.Mutex
// Spawn concurrent connections
for i := 0; i < numConnections; i++ {
wg.Add(1)
go func(connID int) {
defer wg.Done()
conn, err := net.DialTimeout("tcp",
fmt.Sprintf("%s:%s", targetHost, targetPort),
5*time.Second)
if err != nil {
errorMutex.Lock()
errorCount++
fmt.Printf("[%d] Connection failed: %v\n", connID, err)
errorMutex.Unlock()
return
}
defer conn.Close()
// Set aggressive timeouts to trigger kernel bug
conn.SetReadDeadline(time.Now().Add(200 * time.Millisecond))
conn.SetWriteDeadline(time.Now().Add(200 * time.Millisecond))
// Send request
_, err = conn.Write([]byte(requestPattern))
if err != nil {
errorMutex.Lock()
errorCount++
fmt.Printf("[%d] Write failed: %v\n", connID, err)
errorMutex.Unlock()
return
}
// Read response
buf := make([]byte, 4096)
_, err = conn.Read(buf)
if err != nil {
if netErr, ok := err.(net.Error); ok && netErr.Timeout() {
errorMutex.Lock()
errorCount++
fmt.Printf("[%d] Read timeout - possible kernel bug\n", connID)
errorMutex.Unlock()
}
return
}
// Keep connection open briefly
time.Sleep(100 * time.Millisecond)
}(i)
// Stagger connection creation
time.Sleep(10 * time.Millisecond)
}
wg.Wait()
fmt.Printf("\nCompleted: %d errors out of %d connections (%.2f%% failure rate)\n",
errorCount, numConnections, float64(errorCount)/float64(numConnections)*100)
}
This reproducer demonstrates the exact connection pattern that triggered a kernel bug in TCP keep-alive handling. It’s self-contained, clearly documents the expected versus actual behavior, and runs in under a minute. Kernel developers appreciated having a reproducer that didn’t require understanding our entire production architecture.
Provide complete kernel context. When reporting bugs, I include kernel version, configuration options, hardware details, and relevant kernel logs. This script automates collecting the necessary information:
#!/bin/bash
# Collect kernel debugging context for bug reports
OUTPUT_DIR="/tmp/kernel-bug-report-$(date +%Y%m%d_%H%M%S)"
mkdir -p "$OUTPUT_DIR"
# Kernel version and config
uname -a > "$OUTPUT_DIR/kernel_version.txt"
cat /proc/version >> "$OUTPUT_DIR/kernel_version.txt"
cat /boot/config-$(uname -r) > "$OUTPUT_DIR/kernel_config.txt"
# Hardware information
lscpu > "$OUTPUT_DIR/cpu_info.txt"
lspci -vvv > "$OUTPUT_DIR/pci_devices.txt"
lsusb -v > "$OUTPUT_DIR/usb_devices.txt"
# Kernel logs
journalctl -k -b > "$OUTPUT_DIR/kernel_log.txt"
dmesg -T > "$OUTPUT_DIR/dmesg.txt"
# Network configuration (for network-related bugs)
ip addr show > "$OUTPUT_DIR/network_interfaces.txt"
ip route show > "$OUTPUT_DIR/routing_table.txt"
ss -tunap > "$OUTPUT_DIR/network_connections.txt"
# Memory and system state
cat /proc/meminfo > "$OUTPUT_DIR/meminfo.txt"
cat /proc/slabinfo > "$OUTPUT_DIR/slabinfo.txt"
cat /proc/vmstat > "$OUTPUT_DIR/vmstat.txt"
# Create archive
tar czf "$OUTPUT_DIR.tar.gz" "$OUTPUT_DIR"
rm -rf "$OUTPUT_DIR"
echo "Kernel debug context saved to $OUTPUT_DIR.tar.gz"
echo "Include this archive when reporting the kernel bug."
This comprehensive context has helped kernel developers quickly understand my environment and reproduce bugs without endless back-and-forth clarification questions.
Mitigate Production Kernel Bugs
Kernel bugs often take months to fix and even longer to land in stable distributions. In production environments, I can’t wait that long. I’ve developed several strategies for mitigating kernel bugs while permanent fixes are in progress.
Tuning kernel parameters can sometimes work around bugs without requiring kernel patches. I once mitigated a kernel memory allocator bug by adjusting vm.min_free_kbytes to ensure the kernel maintained a larger pool of free memory for emergency allocations:
# Increase minimum free memory to mitigate allocation failures
echo 1048576 > /proc/sys/vm/min_free_kbytes # 1GB
# Reduce page cache aggressiveness
echo 1 > /proc/sys/vm/swappiness
# Make changes persistent
cat >> /etc/sysctl.conf <<EOF
vm.min_free_kbytes = 1048576
vm.swappiness = 1
EOF
This configuration increased memory overhead slightly but eliminated the allocation failures we were experiencing every few days.
Workload scheduling can avoid triggering race conditions. For a network stack bug that only occurred under high connection churn, I implemented connection pooling and rate limiting to reduce the frequency of connection establishment/teardown cycles:
import time
from queue import Queue, Empty
from threading import Lock, Thread
import socket
class ConnectionPool:
def __init__(self, host, port, pool_size=50, max_connections_per_second=100):
self.host = host
self.port = port
self.pool_size = pool_size
self.max_cps = max_connections_per_second
self.pool = Queue(maxsize=pool_size)
self.rate_limiter = RateLimiter(max_cps)
# Pre-populate pool
for _ in range(pool_size):
self.pool.put(self._create_connection())
def _create_connection(self):
self.rate_limiter.wait()
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((self.host, self.port))
sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
return sock
def get_connection(self, timeout=5):
try:
return self.pool.get(timeout=timeout)
except Empty:
# Pool exhausted, create new connection
return self._create_connection()
def return_connection(self, conn):
try:
self.pool.put_nowait(conn)
except:
# Pool full, close connection
conn.close()
class RateLimiter:
def __init__(self, max_rate):
self.max_rate = max_rate
self.tokens = max_rate
self.last_update = time.time()
self.lock = Lock()
def wait(self):
with self.lock:
now = time.time()
elapsed = now - self.last_update
self.tokens = min(self.max_rate, self.tokens + elapsed * self.max_rate)
self.last_update = now
if self.tokens < 1:
sleep_time = (1 - self.tokens) / self.max_rate
time.sleep(sleep_time)
self.tokens = 0
else:
self.tokens -= 1
This connection pool reduced kernel network stack activity by 80% while maintaining application throughput, completely avoiding the race condition window.
Selective feature disabling can bypass buggy code paths. When I identified a bug in the kernel’s TCP Fast Open implementation, I disabled the feature system-wide until a fix was available:
# Disable TCP Fast Open to avoid kernel bug
echo 0 > /proc/sys/net/ipv4/tcp_fastopen
# Make permanent
echo "net.ipv4.tcp_fastopen = 0" >> /etc/sysctl.conf
The performance impact was minimal compared to the random connection failures the bug was causing.
Master Kernel Debugging for Infrastructure Reliability
Debugging latent kernel bugs has taught me humility about what I assume “can’t be a kernel issue.” Some of my most challenging investigations turned out to be kernel bugs I initially dismissed as application problems. The symptoms were too intermittent, too specific, or too seemingly unrelated to low-level kernel behavior.
The most important lesson is that kernel bugs in production infrastructure require a fundamentally different debugging mindset than application bugs. You’re working without the safety net of reproducible test environments, dealing with timing-dependent issues that disappear when you try to observe them directly, and operating in a domain where most engineers defer to “must be the hardware” explanations.
Building expertise in kernel debugging is a long-term investment. I spent years developing the tooling, automation, and investigation patterns that let me isolate kernel bugs efficiently. But this expertise has paid dividends: I can now debug issues that previously required weeks of back-and-forth with hardware vendors or kernel maintainers, often identifying and mitigating bugs within days.
If you’re running production infrastructure at scale, kernel debugging skills aren’t optional expertise—they’re essential operational capabilities. The kernel is the foundation everything else builds on, and when that foundation has subtle bugs, no amount of application-level monitoring will save you.