diff --git a/monitoring/CONFIGURATION_REVIEW.md b/monitoring/CONFIGURATION_REVIEW.md new file mode 100644 index 0000000..4314187 --- /dev/null +++ b/monitoring/CONFIGURATION_REVIEW.md @@ -0,0 +1,267 @@ +# Configuration Review Summary + +## Overview +This document summarizes the configuration review and enhancements made to the EVPN-VXLAN monitoring stack to support Flow Plugin visualization. + +## Changes Made + +### 1. **gnmic Configuration** (`monitoring/gnmic/gnmic.yaml`) + +#### ✅ Improvements: +- **Added BGP/EVPN telemetry subscriptions** + - BGP neighbor state monitoring + - EVPN AFI/SAFI metrics + - Critical for overlay health visibility + +- **Added routing telemetry** + - Static routes monitoring + - IPv4 unicast AFT entries + - Underlay health visibility + +- **Enhanced VXLAN subscriptions** + - VLAN member state + - Connection point endpoints + - On-change streaming for real-time updates + +- **Added MLAG telemetry** + - LACP interface state + - LACP member state + - Redundancy monitoring + +- **Optimized sample intervals** + - Interfaces: 10s (was 15s) for better granularity + - BGP/EVPN: 30s for overlay health + - System: 30s for resource monitoring + - MLAG: 15s for redundancy tracking + +- **Enhanced event processors** + - Better metric name transformation + - Interface name cleanup (Ethernet → eth) + - Source label enrichment + +#### 📊 Key Metrics Now Available: +``` +# Interface metrics (for Flow Plugin) +gnmic_interfaces_interface_state_counters_in_octets +gnmic_interfaces_interface_state_counters_out_octets +gnmic_interfaces_interface_state_oper_status +gnmic_interfaces_interface_state_admin_status + +# BGP/EVPN metrics (overlay health) +gnmic_network_instances_bgp_neighbors_neighbor_state_session_state +gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_received +gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_sent + +# MLAG metrics (redundancy) +gnmic_lacp_interfaces_interface_state_system_priority +gnmic_lacp_interfaces_interface_members_member_state_activity + +# System metrics +gnmic_system_state_hostname +gnmic_system_memory_state_physical +gnmic_system_cpus_cpu_state_total_utilization +``` + +### 2. **Prometheus Configuration** (`monitoring/prometheus/prometheus.yml`) + +#### ✅ Improvements: +- **Enhanced metric relabeling** + - Explicit keep rules for interface, BGP, MLAG, system, and VXLAN metrics + - Drop rule for unneeded metrics to reduce storage + - Better than original overly-restrictive regex + +- **Added topology label extraction** + - Extracts device_type (spine/leaf) from source label + - Extracts device_number for aggregation + - Enables better Grafana queries + +- **Additional cluster label** + - Added `cluster: evpn-vxlan-lab` for multi-cluster scenarios + +#### 📈 Metric Filtering Logic: +```yaml +# KEEP these patterns: +- gnmic_interfaces_.* # All interface metrics +- gnmic_.*bgp.* # All BGP metrics +- gnmic_.*lacp.* # All LACP/MLAG metrics +- gnmic_system.* # All system metrics +- gnmic_.*vxlan.*|gnmic_.*vlan.* # VXLAN/VLAN metrics + +# DROP everything else matching gnmic_.* +``` + +### 3. **Docker Compose** (`monitoring/docker-compose.yml`) + +#### ✅ Improvements: +- **Replaced archived weathermap plugin** with active alternatives + - `agenty-flowcharting-panel` - Flow/flowchart visualization + - `yesoreyeram-infinity-datasource` - Enhanced data sources + +- **Enabled anonymous access** for easier demo/testing + - Anonymous role: Viewer (read-only) + - Still requires admin/admin for editing + +- **Added health checks** for all services + - gnmic: checks /metrics endpoint + - prometheus: checks /-/healthy endpoint + - grafana: checks /api/health endpoint + +### 4. **New Flow Topology Dashboard** (`monitoring/grafana/dashboards/fabric-flow-topology.json`) + +#### 🎨 Features: +- **Mermaid-style flowchart** showing fabric topology + - 2 Spines (AS 65000) + - 8 Leaves in 4 VTEP pairs (AS 65001-65004) + - MLAG peer-link visualization + - All spine-to-leaf uplinks + +- **Live bandwidth overlays** on links + - Real-time rate calculations using Prometheus queries + - Color-coded thresholds (green → yellow → orange → red) + - Pattern matching for automatic metric association + +- **Separate bandwidth graphs** + - Spine interface bandwidth (TX/RX) + - Leaf interface bandwidth (TX/RX) + - Mean and max calculations in legend + +## Testing the Changes + +### 1. Validate gnmic Configuration +```bash +# Test from gnmic container or locally with gnmic installed +gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure capabilities + +# Test specific subscription +gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure \ + subscribe --path /network-instances/network-instance/protocols/protocol/bgp/neighbors \ + --stream-mode sample --sample-interval 10s +``` + +### 2. Check Prometheus Metrics +```bash +# Once stack is running +curl http://localhost:9804/metrics | grep gnmic_interfaces + +# Check Prometheus targets +curl http://localhost:9090/api/v1/targets + +# Query specific metric +curl -G http://localhost:9090/api/v1/query \ + --data-urlencode 'query=gnmic_interfaces_interface_state_counters_out_octets' +``` + +### 3. Verify Grafana Dashboards +1. Access http://localhost:3000 +2. Navigate to Dashboards → EVPN-VXLAN Fabric Flow Topology +3. Verify: + - Flow diagram renders correctly + - Bandwidth overlays show on links + - Time series graphs display data + - Colors change based on utilization thresholds + +## Comparison: Old vs New + +### Old Configuration (weathermap) +- ❌ Used archived weathermap plugin (no longer maintained) +- ❌ Limited telemetry (interfaces only) +- ❌ No BGP/EVPN visibility +- ❌ Static bandwidth thresholds +- ❌ Manual metric path specification + +### New Configuration (Flow Plugin) +- ✅ Uses actively maintained Flow Charting plugin +- ✅ Comprehensive telemetry (interfaces, BGP, EVPN, MLAG, system) +- ✅ Full overlay health visibility +- ✅ Dynamic bandwidth visualization +- ✅ Pattern-based automatic metric mapping +- ✅ Better metric organization and filtering + +## Next Steps + +### Recommended Additional Enhancements + +1. **Add BGP State Dashboard** + - BGP neighbor states across fabric + - EVPN route counts per VTEP + - Session flap detection + +2. **Add VXLAN Overlay Dashboard** + - Active VNIs per VTEP + - VTEP reachability matrix + - L2/L3 VXLAN traffic stats + +3. **Add MLAG Health Dashboard** + - Peer-link status and bandwidth + - MLAG port status + - Dual-active detection events + +4. **Add Alerting Rules** + - BGP session down alerts + - Interface utilization thresholds + - MLAG peer-link failures + +5. **Add Recording Rules** (optional, for performance) + ```yaml + # Example: Pre-calculate interface utilization percentages + - record: interface:bandwidth:utilization_percent + expr: | + (rate(gnmic_interfaces_interface_state_counters_out_octets[5m]) * 8 / 10000000000) * 100 + ``` + +## Troubleshooting + +### Issue: No metrics in Prometheus +**Check:** +```bash +# Verify gnmic is collecting +docker logs gnmic + +# Check gnmic metrics endpoint +curl http://localhost:9804/metrics + +# Verify Prometheus can scrape +docker logs prometheus | grep gnmic +``` + +### Issue: Flow diagram not rendering +**Check:** +1. Flow Charting plugin installed: Settings → Plugins → search "agenty" +2. Prometheus datasource configured: Configuration → Data Sources +3. Metric queries returning data in Explore view +4. Browser console for JavaScript errors + +### Issue: Missing BGP metrics +**Check:** +```bash +# SSH to a switch +ssh admin@172.16.0.1 + +# Verify gNMI is enabled +show management api gnmi +``` + +If not enabled on switches, add to configs: +``` +management api gnmi + transport grpc default +``` + +## References + +- [gnmic Documentation](https://gnmic.openconfig.net) +- [Agenty Flow Charting Plugin](https://grafana.com/grafana/plugins/agenty-flowcharting-panel/) +- [Nokia SRL Telemetry Lab](https://github.com/srl-labs/srl-telemetry-lab) (reference implementation) +- [Arista gNMI Documentation](https://aristanetworks.github.io/openmgmt/) + +## Summary + +This configuration review has transformed your monitoring stack from using an archived plugin with limited visibility to a modern, comprehensive telemetry solution: + +- **Better Plugin**: Active Flow Charting vs archived weathermap +- **More Data**: 5 subscription types vs 2 (interfaces, system, BGP, VXLAN, MLAG) +- **Better Filtering**: Explicit metric keeping vs overly restrictive regex +- **Health Checks**: Automated service health monitoring +- **Production Ready**: Comprehensive visibility of underlay AND overlay + +The stack is now aligned with industry best practices as demonstrated in the Nokia SRL telemetry lab, adapted specifically for Arista cEOS switches.