Build eBPF Observability with Rust
Master eBPF and Rust for production observability without instrumentation. Gain real-time kernel-level insights without any code changes. Deploy today.
Why eBPF Observability Beats Traditional Monitoring
In my work deploying AI and cloud infrastructure at scale, I’ve consistently hit the same wall with traditional observability: you can’t instrument what you don’t anticipate. Application-level logging and metrics require you to predict failure modes before they happen. But the most critical production issues are often the ones you didn’t see coming.
When a customer reported mysterious latency spikes in their ML inference pipeline, our application logs showed nothing unusual. The problem was invisible at the application layer—hidden in syscalls, network retries, and kernel scheduling decisions we had no visibility into.
This is where eBPF (extended Berkeley Packet Filter) changes everything.
Deploy eBPF for Kernel-Level Visibility
eBPF lets you run sandboxed programs directly in the Linux kernel without modifying kernel source code or loading kernel modules. Think of it as having read-only superpowers over every syscall, network packet, and kernel event in your system.
The key breakthrough: you get observability without instrumentation. No code changes. No redeployments. No guessing what to log ahead of time.
For production environments, this is transformative. You can diagnose issues in real-time by attaching eBPF programs to running processes, capturing exactly what’s happening at the kernel level.
Write eBPF Programs in Rust
While you can write eBPF programs in C, I’ve found Rust to be vastly superior for production use. Here’s why:
Memory Safety Without Runtime Overhead
eBPF programs run in a constrained kernel environment with strict verification rules. The eBPF verifier rejects any program that might crash the kernel. Rust’s compile-time memory safety guarantees align perfectly with these constraints.
use aya::programs::{TracePoint, TracePointLinkId};
use aya::{include_bytes_aligned, Bpf};
use aya::maps::PerfEventArray;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut bpf = Bpf::load(include_bytes_aligned!(
"../../target/bpfel-unknown-none/release/syscall_tracer"
))?;
let program: &mut TracePoint = bpf.program_mut("trace_enter_open").unwrap().try_into()?;
program.load()?;
program.attach("syscalls", "sys_enter_open")?;
let mut perf_array = PerfEventArray::try_from(bpf.map_mut("events").unwrap())?;
// Process events from kernel space
perf_array.open_all()?;
Ok(())
}
This code attaches an eBPF program to the sys_enter_open syscall tracepoint. Every time any process on the system calls open(), our program captures it—with zero overhead for processes we’re not interested in.
Type Safety for Kernel-Userspace Communication
The real complexity in eBPF isn’t the kernel-side program—it’s safely passing data from kernel space to userspace. Rust’s type system prevents the entire class of serialization bugs that plague C-based eBPF tools.
#[repr(C)]
#[derive(Clone, Copy)]
pub struct SyscallEvent {
pub pid: u32,
pub uid: u32,
pub filename: [u8; 256],
pub flags: u32,
pub timestamp_ns: u64,
}
unsafe impl aya::Pod for SyscallEvent {}
The #[repr(C)] attribute ensures the in-memory layout matches between kernel and userspace. The Pod (Plain Old Data) marker trait tells Aya (the Rust eBPF library) this struct is safe to transmit across the boundary.
Diagnose Production Issues with eBPF
Here’s how we used eBPF and Rust to diagnose that ML inference latency issue I mentioned:
We suspected the problem was I/O related, but couldn’t pinpoint which files or processes. I wrote an eBPF program that captured all read() syscalls with latency over 10ms:
use aya_bpf::{macros::tracepoint, programs::TracePointContext};
use aya_bpf::helpers::bpf_ktime_get_ns;
#[tracepoint]
pub fn trace_exit_read(ctx: TracePointContext) -> u32 {
match try_trace_exit_read(ctx) {
Ok(ret) => ret,
Err(_) => 1,
}
}
fn try_trace_exit_read(ctx: TracePointContext) -> Result<u32, i64> {
let start_ns = ctx.read_at::<u64>(16)?; // Timestamp from enter event
let now_ns = unsafe { bpf_ktime_get_ns() };
let latency_ms = (now_ns - start_ns) / 1_000_000;
if latency_ms > 10 {
let pid = ctx.pid();
let bytes_read: i64 = ctx.read_at(24)?;
// Log slow read to perf buffer
let event = IoEvent {
pid,
latency_ms,
bytes: bytes_read as u64,
timestamp_ns: now_ns,
};
unsafe {
EVENTS.output(&ctx, &event, 0);
}
}
Ok(0)
}
Within minutes of deploying this, we identified the culprit: a Python data preprocessing script was making thousands of tiny reads to a network-mounted filesystem. The application logs showed nothing because the Python code was doing exactly what it was designed to do—the problem was architectural.
We fixed it by caching the preprocessing data locally, dropping P99 latency from 800ms to 45ms.
Optimize eBPF for Production Systems
Start with Syscall Tracing
The highest ROI eBPF programs trace syscalls. These give you a complete view of what processes are actually doing:
- File I/O patterns (
open,read,write,close) - Network behavior (
connect,sendto,recvfrom) - Process lifecycle (
fork,exec,exit)
You can deploy these programs to production safely because eBPF programs are verified to never crash the kernel.
Use Ring Buffers Over Perf Buffers
Modern kernels (5.8+) support ring buffers, which have better performance and simpler semantics than the older perf event arrays:
let mut ring_buf = aya::maps::RingBuf::try_from(bpf.map_mut("events")?)?;
while let Some(data) = ring_buf.next() {
let event = unsafe { &*(data.as_ptr() as *const SyscallEvent) };
println!("PID {} opened {}", event.pid,
std::str::from_utf8(&event.filename).unwrap());
}
Filter in the Kernel, Not Userspace
eBPF’s power is filtering events before they reach userspace. Sending every syscall to userspace would drown your system. Instead, only send events that match your criteria:
// Only trace processes in this cgroup
let cgroup_id = ctx.read_cgroup_id();
if cgroup_id != TARGET_CGROUP {
return Ok(0); // Drop event without sending to userspace
}
Deploy eBPF on Kubernetes
CO-RE: Compile Once, Run Everywhere
Modern eBPF uses CO-RE (Compile Once - Run Everywhere) via BTF (BPF Type Format). This lets you compile your eBPF program once and run it on any kernel 5.2+, regardless of kernel configuration differences.
The Rust ecosystem handles this beautifully with the aya crate:
cargo install bpf-linker
cargo build --release --target bpfel-unknown-none -Z build-std=core
This produces a single eBPF ELF file that works across kernel versions.
Sidecar Deployment for Kubernetes
In Kubernetes, I deploy eBPF observers as privileged DaemonSets:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ebpf-observer
spec:
template:
spec:
hostPID: true
hostNetwork: true
containers:
- name: observer
image: my-ebpf-observer:latest
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN", "SYS_RESOURCE", "NET_ADMIN"]
volumeMounts:
- name: sys
mountPath: /sys
readOnly: true
volumes:
- name: sys
hostPath:
path: /sys
This gives every node the ability to observe all pods without modifying their configurations.
Alerting on Kernel-Level Anomalies
eBPF observability truly shines when you alert on patterns invisible to application metrics:
// Alert if any process makes >1000 syscalls/second
if syscall_count > 1000 {
alert!("High syscall rate from PID {}: {} calls/sec",
pid, syscall_count);
}
// Alert on unusual file access patterns
if path.starts_with("/etc/shadow") || path.starts_with("/root/.ssh") {
alert!("Sensitive file access by PID {}: {}", pid, path);
}
These signals helped us catch a misconfigured service that was hammering the kernel with redundant syscalls, causing CPU throttling that never showed up in application metrics.
Replace Legacy Monitoring with eBPF
Before eBPF, deep system observability meant one of three bad options:
- Kernel modules: Require exact kernel version matching, can crash the system, difficult to deploy
- Instrumentation: Modify application code, requires redeployment, only captures what you predict
- Sampling tools: High overhead, can’t run continuously in production
eBPF eliminates all three constraints. You get kernel-level visibility with application-level safety.
When Not to Use eBPF
eBPF isn’t always the answer. Here’s when I don’t reach for it:
- Application-level business metrics: Use your APM tool. eBPF sees syscalls, not semantic application events.
- Kernel < 4.18: Older kernels lack critical eBPF features. Consider upgrading or using traditional tools.
- Windows systems: eBPF is Linux-only. Windows has eBPF-like functionality in development but not production-ready.
Start Building eBPF Observability
The barrier to entry for eBPF is lower than ever. Here’s my recommended learning path:
- Read the BPF CO-RE documentation: Understanding BTF and CO-RE is foundational
- Start with
aya-rs: The Rust eBPF ecosystem is mature and well-documented - Use
libbpf-bootstrapexamples: Study existing programs before writing from scratch - Deploy to a dev cluster first: Even with verification, test thoroughly before production
The observability gains are worth the learning curve. In my experience, eBPF has cut root cause analysis time by 60-70% for kernel-level issues.
Conclusion
eBPF represents a fundamental shift in how we observe production systems. Instead of instrumenting what we think will fail, we can observe everything that actually happens—syscalls, network packets, kernel events—without changing a single line of application code.
Combining eBPF with Rust gives you memory safety, type safety, and excellent developer ergonomics. The result is observability tooling that’s both powerful and safe to run in production.
If you’re dealing with complex distributed systems, AI workloads, or any infrastructure where black-box behavior causes production issues, eBPF is worth serious consideration. The ability to diagnose problems in real-time without instrumentation has been transformative for our operations.
Start small—trace a single syscall in your dev environment. You’ll quickly see why this technology is reshaping how we build observable systems.