Monitor BGP Routing in Production
Detect BGP routing anomalies, prevent network outages, and maintain infrastructure reliability with practical monitoring strategies for DevOps teams.
Border Gateway Protocol failures don’t just affect ISPs—they impact every production system relying on internet connectivity. Effective BGP monitoring helps detect routing anomalies before they cause outages. I’ve watched traffic vanish during BGP route leaks, debugged mysterious latency spikes caused by suboptimal AS paths, and responded to incidents where entire cloud regions became unreachable due to routing table corruption. These aren’t theoretical problems; they’re operational realities that demand proactive network monitoring.
The challenge with BGP monitoring is that most DevOps teams treat it as someone else’s problem. We monitor application metrics, database performance, and container health obsessively—but the routing layer that underpins everything remains a black box. This gap becomes painfully obvious during incidents when you’re trying to explain to executives why your multi-region failover didn’t work because of an upstream routing issue you couldn’t see coming.
Deploy BGP Monitoring for Cloud Infrastructure
BGP is the routing protocol that makes the internet work. It determines how traffic flows between autonomous systems (AS), which cloud providers to use for connectivity, and which paths your packets take to reach users. When BGP behaves unexpectedly, your production systems suffer in ways that traditional monitoring can’t detect.
I’ve encountered several categories of BGP-related incidents in production environments:
Route hijacking - Malicious or accidental announcement of IP prefixes by unauthorized networks, redirecting traffic to the wrong destination. This can cause complete service outages or security breaches.
Route leaks - Networks accidentally propagating routes they shouldn’t, creating inefficient paths or overwhelming routing tables. These cause latency spikes and partial connectivity loss.
AS path manipulation - Intentional or unintentional changes to AS paths that affect traffic engineering and failover behavior. Your carefully planned multi-cloud strategy fails because traffic takes unexpected routes.
BGP convergence delays - Slow propagation of routing updates during incidents, extending outage windows. You think your failover is instant, but it takes 15 minutes for routes to converge.
The most frustrating part is that these issues are invisible to standard monitoring. Your application thinks everything is fine—the database is responsive, the load balancer is healthy, the CDN is caching correctly. Meanwhile, 30% of your users can’t reach your service because of a routing problem three autonomous systems away.
Implement Practical BGP Monitoring Solutions
Implementing BGP monitoring doesn’t require running your own AS or becoming a network engineer. Modern cloud architectures provide several pragmatic entry points for visibility.
1. Detect Unauthorized IP Prefix Announcements
If you announce IP prefixes (common for on-premises infrastructure or BGP-enabled cloud environments), you need to monitor who’s announcing your routes and where they’re visible.
# bgp_monitor.py
import requests
import json
from datetime import datetime
def check_prefix_announcements(prefix, expected_asn):
"""
Query RIPEstat or similar service to verify prefix announcements
"""
url = f"https://stat.ripe.net/data/announced-prefixes/data.json"
params = {
"resource": prefix,
"min_peers_seeing": 5
}
response = requests.get(url, params=params)
data = response.json()
announcements = []
if "data" in data and "prefixes" in data["data"]:
for item in data["data"]["prefixes"]:
announcements.append({
"prefix": item.get("prefix"),
"origin_asn": item.get("origin"),
"seen_by": item.get("peers_seeing", 0),
"timestamp": datetime.now().isoformat()
})
# Alert if unexpected ASN is announcing your prefix
for announcement in announcements:
if announcement["origin_asn"] != expected_asn:
send_alert(
f"Unexpected BGP announcement for {prefix}",
f"ASN {announcement['origin_asn']} is announcing your prefix"
)
return announcements
def send_alert(title, message):
"""Send alert to your monitoring system"""
# Integrate with PagerDuty, Slack, etc.
print(f"ALERT: {title} - {message}")
# Monitor your prefixes every 5 minutes
prefixes_to_monitor = [
{"prefix": "203.0.113.0/24", "expected_asn": "AS64500"},
{"prefix": "198.51.100.0/24", "expected_asn": "AS64500"}
]
for config in prefixes_to_monitor:
check_prefix_announcements(config["prefix"], config["expected_asn"])
This script queries public BGP data feeds to verify your IP prefixes are only announced by your authorized networks. I run similar checks every 5 minutes in production, with alerts routing to PagerDuty for immediate response.
2. Track AS Path Changes Affecting Service Reliability
For services where you care about routing paths (multi-cloud architectures, latency-sensitive applications), monitoring AS path changes provides early warning of routing instability.
#!/bin/bash
# as_path_monitor.sh
TARGET_IP="8.8.8.8"
EXPECTED_ASN="AS15169" # Google's ASN
ALERT_THRESHOLD=3
# Use traceroute with AS number lookup
traceroute -A "$TARGET_IP" 2>&1 | \
grep -oP '\[AS\K[0-9]+' | \
sort -u > /tmp/current_path.txt
# Compare with previous path
if [ -f /tmp/previous_path.txt ]; then
DIFF_COUNT=$(diff /tmp/previous_path.txt /tmp/current_path.txt | wc -l)
if [ "$DIFF_COUNT" -gt "$ALERT_THRESHOLD" ]; then
echo "ALERT: AS path changed significantly ($DIFF_COUNT hops)"
echo "Previous path:"
cat /tmp/previous_path.txt
echo "Current path:"
cat /tmp/current_path.txt
# Send alert to monitoring system
curl -X POST https://your-monitoring-endpoint.com/alert \
-H "Content-Type: application/json" \
-d "{\"message\": \"BGP AS path change detected\", \"severity\": \"warning\"}"
fi
fi
# Save current path for next comparison
cp /tmp/current_path.txt /tmp/previous_path.txt
This approach is particularly valuable for understanding why latency suddenly increased or why your multi-region failover didn’t behave as expected. In my experience, unexpected AS path changes often precede larger routing incidents by 10-30 minutes, giving you a window to prepare.
3. Analyze Routing Table Growth and Convergence Issues
Even if you don’t control BGP directly, monitoring global routing table metrics helps predict infrastructure instability.
# routing_table_monitor.py
import requests
from datetime import datetime, timedelta
def check_routing_table_size():
"""
Monitor global BGP routing table size via public APIs
"""
url = "https://stat.ripe.net/data/routing-status/data.json"
params = {"resource": "0.0.0.0/0"}
response = requests.get(url, params=params)
data = response.json()
if "data" in data:
status = data["data"]
announced = status.get("announced", False)
visible_peers = status.get("observed_neighbours", 0)
metrics = {
"timestamp": datetime.now().isoformat(),
"announced": announced,
"visible_peers": visible_peers
}
# Track table size growth rate
store_metrics(metrics)
# Alert on rapid growth (possible route leak)
if check_growth_rate_anomaly(metrics):
send_alert(
"BGP Routing Table Anomaly",
f"Unusual routing table growth detected: {visible_peers} peers"
)
return metrics
def check_growth_rate_anomaly(current_metrics):
"""
Check if routing table size is growing abnormally fast
"""
# Implement time-series analysis
# Alert if growth exceeds 10% in 15 minutes
historical = get_historical_metrics(lookback=timedelta(minutes=15))
if not historical:
return False
baseline = historical[0].get("visible_peers", 0)
current = current_metrics.get("visible_peers", 0)
growth_rate = ((current - baseline) / baseline) * 100
return growth_rate > 10
def store_metrics(metrics):
"""Store metrics for historical analysis"""
# Write to time-series database (Prometheus, InfluxDB, etc.)
pass
def get_historical_metrics(lookback):
"""Retrieve historical metrics"""
# Query time-series database
return []
def send_alert(title, message):
"""Send alert to monitoring system"""
print(f"ALERT: {title} - {message}")
# Run every 5 minutes
check_routing_table_size()
Rapid routing table growth often indicates route leaks in progress. By correlating this data with application metrics, you can distinguish between application-level issues and network-level routing problems.
Integrate BGP Data with Observability Platforms
BGP monitoring is most valuable when integrated into your existing observability stack. I’ve found several integration patterns that work well in production:
Correlation with latency metrics - When API latency increases, check for AS path changes. If paths changed at the same time latency spiked, you’ve found your root cause.
Multi-region health checks - Run HTTP health checks from multiple geographic locations. If a subset of locations fail while others succeed, investigate BGP routing for those failing regions.
CDN and DNS monitoring - CDN providers and DNS services often have better BGP visibility. Monitor their health dashboards and correlate with your application metrics.
Cloud provider status pages - AWS, Azure, and GCP publish network status. Automate scraping these pages and correlate with your own monitoring data.
In practice, I’ve built dashboards that overlay BGP routing changes on top of standard application metrics. This visualization makes it immediately obvious when network-level issues are affecting application performance, reducing mean time to resolution (MTTR) from hours to minutes.
Respond to BGP Routing Incidents Effectively
When BGP monitoring alerts fire, your response depends on whether you control the routing or not.
If you announce your own prefixes:
- Verify ROA (Route Origin Authorization) records are correct
- Check for unauthorized announcements and contact your upstream provider
- Prepare to withdraw announcements if hijacking is confirmed
- Document the incident for post-mortem analysis
If you’re dependent on cloud providers:
- Verify the issue spans multiple cloud providers (rules out provider-specific problems)
- Engage your cloud provider’s support with specific BGP data
- Consider activating multi-cloud failover if available
- Monitor provider status pages for updates
For all scenarios:
- Capture BGP routing data before it converges (use looking glasses and route collectors)
- Document AS paths, peer counts, and timeline of changes
- Correlate with application-level impact (which users, which regions, which services)
- Update runbooks based on lessons learned
I’ve learned the hard way that BGP incidents require different playbooks than application incidents. The debugging tools are different, the escalation paths are different, and the resolution timelines are often outside your direct control.
Practical Lessons from Production BGP Incidents
Over the years, I’ve developed some rules of thumb for BGP monitoring in production environments:
False positives are expensive - BGP routing changes constantly. Tune your alerting thresholds to focus on changes that actually impact your services, not every minor AS path variation.
Latency is your canary - Increased latency from specific geographic regions often precedes complete routing failures. Monitor P95 and P99 latency by region, not just global averages.
Multi-cloud isn’t automatic failover - Even with infrastructure in multiple clouds, BGP routing can fail in ways that make your failover useless. Test your failover scenarios with simulated routing failures.
BGP data sources matter - Different looking glasses and route collectors see different views of the internet. Use multiple data sources for comprehensive visibility.
Document your prefixes and ASNs - When incidents happen, you need this information immediately. Keep it in your runbooks and incident response documentation.
The most important lesson: BGP monitoring isn’t about becoming a networking expert. It’s about having enough visibility into the routing layer to distinguish between “our code is broken” and “the internet is broken.” That distinction saves hours of debugging time and prevents unnecessary escalations.
Build a Sustainable Network Monitoring Strategy
Start small and expand as you gain confidence. A minimal viable BGP monitoring strategy includes:
- Prefix monitoring - Alert if your IP prefixes are announced by unexpected ASNs
- Regional health checks - HTTP checks from diverse geographic locations
- AS path baselines - Track normal AS paths to critical services and alert on significant deviations
- Cloud provider status integration - Automate monitoring of provider network status
As your monitoring matures, add:
- BGP route collector integration for historical analysis
- Automated failover testing with simulated routing failures
- Correlation between BGP events and application-level impact
- Integration with incident management workflows
The goal isn’t to predict every BGP incident—that’s impossible. The goal is to reduce the time between “something’s wrong” and “we know it’s a routing issue” from hours to minutes.
Implementing BGP monitoring transforms how you respond to production incidents. When your monitoring shows that routing changed at the exact moment your application started having problems, you skip straight to the right escalation path. Start with prefix monitoring and regional health checks, then expand to AS path tracking and route collector integration. The distinction between application failures and network routing issues saves hours of debugging time and improves infrastructure reliability across your entire stack.