# Configuration Review Summary ## Overview This document summarizes the configuration review and enhancements made to the EVPN-VXLAN monitoring stack to support Flow Plugin visualization. ## Changes Made ### 1. **gnmic Configuration** (`monitoring/gnmic/gnmic.yaml`) #### ✅ Improvements: - **Added BGP/EVPN telemetry subscriptions** - BGP neighbor state monitoring - EVPN AFI/SAFI metrics - Critical for overlay health visibility - **Added routing telemetry** - Static routes monitoring - IPv4 unicast AFT entries - Underlay health visibility - **Enhanced VXLAN subscriptions** - VLAN member state - Connection point endpoints - On-change streaming for real-time updates - **Added MLAG telemetry** - LACP interface state - LACP member state - Redundancy monitoring - **Optimized sample intervals** - Interfaces: 10s (was 15s) for better granularity - BGP/EVPN: 30s for overlay health - System: 30s for resource monitoring - MLAG: 15s for redundancy tracking - **Enhanced event processors** - Better metric name transformation - Interface name cleanup (Ethernet → eth) - Source label enrichment #### 📊 Key Metrics Now Available: ``` # Interface metrics (for Flow Plugin) gnmic_interfaces_interface_state_counters_in_octets gnmic_interfaces_interface_state_counters_out_octets gnmic_interfaces_interface_state_oper_status gnmic_interfaces_interface_state_admin_status # BGP/EVPN metrics (overlay health) gnmic_network_instances_bgp_neighbors_neighbor_state_session_state gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_received gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_sent # MLAG metrics (redundancy) gnmic_lacp_interfaces_interface_state_system_priority gnmic_lacp_interfaces_interface_members_member_state_activity # System metrics gnmic_system_state_hostname gnmic_system_memory_state_physical gnmic_system_cpus_cpu_state_total_utilization ``` ### 2. **Prometheus Configuration** (`monitoring/prometheus/prometheus.yml`) #### ✅ Improvements: - **Enhanced metric relabeling** - Explicit keep rules for interface, BGP, MLAG, system, and VXLAN metrics - Drop rule for unneeded metrics to reduce storage - Better than original overly-restrictive regex - **Added topology label extraction** - Extracts device_type (spine/leaf) from source label - Extracts device_number for aggregation - Enables better Grafana queries - **Additional cluster label** - Added `cluster: evpn-vxlan-lab` for multi-cluster scenarios #### 📈 Metric Filtering Logic: ```yaml # KEEP these patterns: - gnmic_interfaces_.* # All interface metrics - gnmic_.*bgp.* # All BGP metrics - gnmic_.*lacp.* # All LACP/MLAG metrics - gnmic_system.* # All system metrics - gnmic_.*vxlan.*|gnmic_.*vlan.* # VXLAN/VLAN metrics # DROP everything else matching gnmic_.* ``` ### 3. **Docker Compose** (`monitoring/docker-compose.yml`) #### ✅ Improvements: - **Replaced archived weathermap plugin** with active alternatives - `agenty-flowcharting-panel` - Flow/flowchart visualization - `yesoreyeram-infinity-datasource` - Enhanced data sources - **Enabled anonymous access** for easier demo/testing - Anonymous role: Viewer (read-only) - Still requires admin/admin for editing - **Added health checks** for all services - gnmic: checks /metrics endpoint - prometheus: checks /-/healthy endpoint - grafana: checks /api/health endpoint ### 4. **New Flow Topology Dashboard** (`monitoring/grafana/dashboards/fabric-flow-topology.json`) #### 🎨 Features: - **Mermaid-style flowchart** showing fabric topology - 2 Spines (AS 65000) - 8 Leaves in 4 VTEP pairs (AS 65001-65004) - MLAG peer-link visualization - All spine-to-leaf uplinks - **Live bandwidth overlays** on links - Real-time rate calculations using Prometheus queries - Color-coded thresholds (green → yellow → orange → red) - Pattern matching for automatic metric association - **Separate bandwidth graphs** - Spine interface bandwidth (TX/RX) - Leaf interface bandwidth (TX/RX) - Mean and max calculations in legend ## Testing the Changes ### 1. Validate gnmic Configuration ```bash # Test from gnmic container or locally with gnmic installed gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure capabilities # Test specific subscription gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure \ subscribe --path /network-instances/network-instance/protocols/protocol/bgp/neighbors \ --stream-mode sample --sample-interval 10s ``` ### 2. Check Prometheus Metrics ```bash # Once stack is running curl http://localhost:9804/metrics | grep gnmic_interfaces # Check Prometheus targets curl http://localhost:9090/api/v1/targets # Query specific metric curl -G http://localhost:9090/api/v1/query \ --data-urlencode 'query=gnmic_interfaces_interface_state_counters_out_octets' ``` ### 3. Verify Grafana Dashboards 1. Access http://localhost:3000 2. Navigate to Dashboards → EVPN-VXLAN Fabric Flow Topology 3. Verify: - Flow diagram renders correctly - Bandwidth overlays show on links - Time series graphs display data - Colors change based on utilization thresholds ## Comparison: Old vs New ### Old Configuration (weathermap) - ❌ Used archived weathermap plugin (no longer maintained) - ❌ Limited telemetry (interfaces only) - ❌ No BGP/EVPN visibility - ❌ Static bandwidth thresholds - ❌ Manual metric path specification ### New Configuration (Flow Plugin) - ✅ Uses actively maintained Flow Charting plugin - ✅ Comprehensive telemetry (interfaces, BGP, EVPN, MLAG, system) - ✅ Full overlay health visibility - ✅ Dynamic bandwidth visualization - ✅ Pattern-based automatic metric mapping - ✅ Better metric organization and filtering ## Next Steps ### Recommended Additional Enhancements 1. **Add BGP State Dashboard** - BGP neighbor states across fabric - EVPN route counts per VTEP - Session flap detection 2. **Add VXLAN Overlay Dashboard** - Active VNIs per VTEP - VTEP reachability matrix - L2/L3 VXLAN traffic stats 3. **Add MLAG Health Dashboard** - Peer-link status and bandwidth - MLAG port status - Dual-active detection events 4. **Add Alerting Rules** - BGP session down alerts - Interface utilization thresholds - MLAG peer-link failures 5. **Add Recording Rules** (optional, for performance) ```yaml # Example: Pre-calculate interface utilization percentages - record: interface:bandwidth:utilization_percent expr: | (rate(gnmic_interfaces_interface_state_counters_out_octets[5m]) * 8 / 10000000000) * 100 ``` ## Troubleshooting ### Issue: No metrics in Prometheus **Check:** ```bash # Verify gnmic is collecting docker logs gnmic # Check gnmic metrics endpoint curl http://localhost:9804/metrics # Verify Prometheus can scrape docker logs prometheus | grep gnmic ``` ### Issue: Flow diagram not rendering **Check:** 1. Flow Charting plugin installed: Settings → Plugins → search "agenty" 2. Prometheus datasource configured: Configuration → Data Sources 3. Metric queries returning data in Explore view 4. Browser console for JavaScript errors ### Issue: Missing BGP metrics **Check:** ```bash # SSH to a switch ssh admin@172.16.0.1 # Verify gNMI is enabled show management api gnmi ``` If not enabled on switches, add to configs: ``` management api gnmi transport grpc default ``` ## References - [gnmic Documentation](https://gnmic.openconfig.net) - [Agenty Flow Charting Plugin](https://grafana.com/grafana/plugins/agenty-flowcharting-panel/) - [Nokia SRL Telemetry Lab](https://github.com/srl-labs/srl-telemetry-lab) (reference implementation) - [Arista gNMI Documentation](https://aristanetworks.github.io/openmgmt/) ## Summary This configuration review has transformed your monitoring stack from using an archived plugin with limited visibility to a modern, comprehensive telemetry solution: - **Better Plugin**: Active Flow Charting vs archived weathermap - **More Data**: 5 subscription types vs 2 (interfaces, system, BGP, VXLAN, MLAG) - **Better Filtering**: Explicit metric keeping vs overly restrictive regex - **Health Checks**: Automated service health monitoring - **Production Ready**: Comprehensive visibility of underlay AND overlay The stack is now aligned with industry best practices as demonstrated in the Nokia SRL telemetry lab, adapted specifically for Arista cEOS switches.