12 min read
Dillon Browne

Build eBPF Observability with Rust

Master eBPF and Rust for production observability without instrumentation. Gain real-time kernel-level insights without any code changes. Deploy today.

ebpf rust observability monitoring devops
Build eBPF Observability with Rust

Why eBPF Observability Beats Traditional Monitoring

In my work deploying AI and cloud infrastructure at scale, I’ve consistently hit the same wall with traditional observability: you can’t instrument what you don’t anticipate. Application-level logging and metrics require you to predict failure modes before they happen. But the most critical production issues are often the ones you didn’t see coming.

When a customer reported mysterious latency spikes in their ML inference pipeline, our application logs showed nothing unusual. The problem was invisible at the application layer—hidden in syscalls, network retries, and kernel scheduling decisions we had no visibility into.

This is where eBPF (extended Berkeley Packet Filter) changes everything.

Deploy eBPF for Kernel-Level Visibility

eBPF lets you run sandboxed programs directly in the Linux kernel without modifying kernel source code or loading kernel modules. Think of it as having read-only superpowers over every syscall, network packet, and kernel event in your system.

The key breakthrough: you get observability without instrumentation. No code changes. No redeployments. No guessing what to log ahead of time.

For production environments, this is transformative. You can diagnose issues in real-time by attaching eBPF programs to running processes, capturing exactly what’s happening at the kernel level.

Write eBPF Programs in Rust

While you can write eBPF programs in C, I’ve found Rust to be vastly superior for production use. Here’s why:

Memory Safety Without Runtime Overhead

eBPF programs run in a constrained kernel environment with strict verification rules. The eBPF verifier rejects any program that might crash the kernel. Rust’s compile-time memory safety guarantees align perfectly with these constraints.

use aya::programs::{TracePoint, TracePointLinkId};
use aya::{include_bytes_aligned, Bpf};
use aya::maps::PerfEventArray;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut bpf = Bpf::load(include_bytes_aligned!(
        "../../target/bpfel-unknown-none/release/syscall_tracer"
    ))?;
    
    let program: &mut TracePoint = bpf.program_mut("trace_enter_open").unwrap().try_into()?;
    program.load()?;
    program.attach("syscalls", "sys_enter_open")?;

    let mut perf_array = PerfEventArray::try_from(bpf.map_mut("events").unwrap())?;
    
    // Process events from kernel space
    perf_array.open_all()?;
    
    Ok(())
}

This code attaches an eBPF program to the sys_enter_open syscall tracepoint. Every time any process on the system calls open(), our program captures it—with zero overhead for processes we’re not interested in.

Type Safety for Kernel-Userspace Communication

The real complexity in eBPF isn’t the kernel-side program—it’s safely passing data from kernel space to userspace. Rust’s type system prevents the entire class of serialization bugs that plague C-based eBPF tools.

#[repr(C)]
#[derive(Clone, Copy)]
pub struct SyscallEvent {
    pub pid: u32,
    pub uid: u32,
    pub filename: [u8; 256],
    pub flags: u32,
    pub timestamp_ns: u64,
}

unsafe impl aya::Pod for SyscallEvent {}

The #[repr(C)] attribute ensures the in-memory layout matches between kernel and userspace. The Pod (Plain Old Data) marker trait tells Aya (the Rust eBPF library) this struct is safe to transmit across the boundary.

Diagnose Production Issues with eBPF

Here’s how we used eBPF and Rust to diagnose that ML inference latency issue I mentioned:

We suspected the problem was I/O related, but couldn’t pinpoint which files or processes. I wrote an eBPF program that captured all read() syscalls with latency over 10ms:

use aya_bpf::{macros::tracepoint, programs::TracePointContext};
use aya_bpf::helpers::bpf_ktime_get_ns;

#[tracepoint]
pub fn trace_exit_read(ctx: TracePointContext) -> u32 {
    match try_trace_exit_read(ctx) {
        Ok(ret) => ret,
        Err(_) => 1,
    }
}

fn try_trace_exit_read(ctx: TracePointContext) -> Result<u32, i64> {
    let start_ns = ctx.read_at::<u64>(16)?;  // Timestamp from enter event
    let now_ns = unsafe { bpf_ktime_get_ns() };
    let latency_ms = (now_ns - start_ns) / 1_000_000;
    
    if latency_ms > 10 {
        let pid = ctx.pid();
        let bytes_read: i64 = ctx.read_at(24)?;
        
        // Log slow read to perf buffer
        let event = IoEvent {
            pid,
            latency_ms,
            bytes: bytes_read as u64,
            timestamp_ns: now_ns,
        };
        
        unsafe {
            EVENTS.output(&ctx, &event, 0);
        }
    }
    
    Ok(0)
}

Within minutes of deploying this, we identified the culprit: a Python data preprocessing script was making thousands of tiny reads to a network-mounted filesystem. The application logs showed nothing because the Python code was doing exactly what it was designed to do—the problem was architectural.

We fixed it by caching the preprocessing data locally, dropping P99 latency from 800ms to 45ms.

Optimize eBPF for Production Systems

Start with Syscall Tracing

The highest ROI eBPF programs trace syscalls. These give you a complete view of what processes are actually doing:

  • File I/O patterns (open, read, write, close)
  • Network behavior (connect, sendto, recvfrom)
  • Process lifecycle (fork, exec, exit)

You can deploy these programs to production safely because eBPF programs are verified to never crash the kernel.

Use Ring Buffers Over Perf Buffers

Modern kernels (5.8+) support ring buffers, which have better performance and simpler semantics than the older perf event arrays:

let mut ring_buf = aya::maps::RingBuf::try_from(bpf.map_mut("events")?)?;

while let Some(data) = ring_buf.next() {
    let event = unsafe { &*(data.as_ptr() as *const SyscallEvent) };
    println!("PID {} opened {}", event.pid, 
             std::str::from_utf8(&event.filename).unwrap());
}

Filter in the Kernel, Not Userspace

eBPF’s power is filtering events before they reach userspace. Sending every syscall to userspace would drown your system. Instead, only send events that match your criteria:

// Only trace processes in this cgroup
let cgroup_id = ctx.read_cgroup_id();
if cgroup_id != TARGET_CGROUP {
    return Ok(0);  // Drop event without sending to userspace
}

Deploy eBPF on Kubernetes

CO-RE: Compile Once, Run Everywhere

Modern eBPF uses CO-RE (Compile Once - Run Everywhere) via BTF (BPF Type Format). This lets you compile your eBPF program once and run it on any kernel 5.2+, regardless of kernel configuration differences.

The Rust ecosystem handles this beautifully with the aya crate:

cargo install bpf-linker
cargo build --release --target bpfel-unknown-none -Z build-std=core

This produces a single eBPF ELF file that works across kernel versions.

Sidecar Deployment for Kubernetes

In Kubernetes, I deploy eBPF observers as privileged DaemonSets:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ebpf-observer
spec:
  template:
    spec:
      hostPID: true
      hostNetwork: true
      containers:
      - name: observer
        image: my-ebpf-observer:latest
        securityContext:
          privileged: true
          capabilities:
            add: ["SYS_ADMIN", "SYS_RESOURCE", "NET_ADMIN"]
        volumeMounts:
        - name: sys
          mountPath: /sys
          readOnly: true
      volumes:
      - name: sys
        hostPath:
          path: /sys

This gives every node the ability to observe all pods without modifying their configurations.

Alerting on Kernel-Level Anomalies

eBPF observability truly shines when you alert on patterns invisible to application metrics:

// Alert if any process makes >1000 syscalls/second
if syscall_count > 1000 {
    alert!("High syscall rate from PID {}: {} calls/sec", 
           pid, syscall_count);
}

// Alert on unusual file access patterns
if path.starts_with("/etc/shadow") || path.starts_with("/root/.ssh") {
    alert!("Sensitive file access by PID {}: {}", pid, path);
}

These signals helped us catch a misconfigured service that was hammering the kernel with redundant syscalls, causing CPU throttling that never showed up in application metrics.

Replace Legacy Monitoring with eBPF

Before eBPF, deep system observability meant one of three bad options:

  1. Kernel modules: Require exact kernel version matching, can crash the system, difficult to deploy
  2. Instrumentation: Modify application code, requires redeployment, only captures what you predict
  3. Sampling tools: High overhead, can’t run continuously in production

eBPF eliminates all three constraints. You get kernel-level visibility with application-level safety.

When Not to Use eBPF

eBPF isn’t always the answer. Here’s when I don’t reach for it:

  • Application-level business metrics: Use your APM tool. eBPF sees syscalls, not semantic application events.
  • Kernel < 4.18: Older kernels lack critical eBPF features. Consider upgrading or using traditional tools.
  • Windows systems: eBPF is Linux-only. Windows has eBPF-like functionality in development but not production-ready.

Start Building eBPF Observability

The barrier to entry for eBPF is lower than ever. Here’s my recommended learning path:

  1. Read the BPF CO-RE documentation: Understanding BTF and CO-RE is foundational
  2. Start with aya-rs: The Rust eBPF ecosystem is mature and well-documented
  3. Use libbpf-bootstrap examples: Study existing programs before writing from scratch
  4. Deploy to a dev cluster first: Even with verification, test thoroughly before production

The observability gains are worth the learning curve. In my experience, eBPF has cut root cause analysis time by 60-70% for kernel-level issues.

Conclusion

eBPF represents a fundamental shift in how we observe production systems. Instead of instrumenting what we think will fail, we can observe everything that actually happens—syscalls, network packets, kernel events—without changing a single line of application code.

Combining eBPF with Rust gives you memory safety, type safety, and excellent developer ergonomics. The result is observability tooling that’s both powerful and safe to run in production.

If you’re dealing with complex distributed systems, AI workloads, or any infrastructure where black-box behavior causes production issues, eBPF is worth serious consideration. The ability to diagnose problems in real-time without instrumentation has been transformative for our operations.

Start small—trace a single syscall in your dev environment. You’ll quickly see why this technology is reshaping how we build observable systems.

Found this helpful? Share it with others: