8.3 KiB
Configuration Review Summary
Overview
This document summarizes the configuration review and enhancements made to the EVPN-VXLAN monitoring stack to support Flow Plugin visualization.
Changes Made
1. gnmic Configuration (monitoring/gnmic/gnmic.yaml)
✅ Improvements:
-
Added BGP/EVPN telemetry subscriptions
- BGP neighbor state monitoring
- EVPN AFI/SAFI metrics
- Critical for overlay health visibility
-
Added routing telemetry
- Static routes monitoring
- IPv4 unicast AFT entries
- Underlay health visibility
-
Enhanced VXLAN subscriptions
- VLAN member state
- Connection point endpoints
- On-change streaming for real-time updates
-
Added MLAG telemetry
- LACP interface state
- LACP member state
- Redundancy monitoring
-
Optimized sample intervals
- Interfaces: 10s (was 15s) for better granularity
- BGP/EVPN: 30s for overlay health
- System: 30s for resource monitoring
- MLAG: 15s for redundancy tracking
-
Enhanced event processors
- Better metric name transformation
- Interface name cleanup (Ethernet → eth)
- Source label enrichment
📊 Key Metrics Now Available:
# Interface metrics (for Flow Plugin)
gnmic_interfaces_interface_state_counters_in_octets
gnmic_interfaces_interface_state_counters_out_octets
gnmic_interfaces_interface_state_oper_status
gnmic_interfaces_interface_state_admin_status
# BGP/EVPN metrics (overlay health)
gnmic_network_instances_bgp_neighbors_neighbor_state_session_state
gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_received
gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_sent
# MLAG metrics (redundancy)
gnmic_lacp_interfaces_interface_state_system_priority
gnmic_lacp_interfaces_interface_members_member_state_activity
# System metrics
gnmic_system_state_hostname
gnmic_system_memory_state_physical
gnmic_system_cpus_cpu_state_total_utilization
2. Prometheus Configuration (monitoring/prometheus/prometheus.yml)
✅ Improvements:
-
Enhanced metric relabeling
- Explicit keep rules for interface, BGP, MLAG, system, and VXLAN metrics
- Drop rule for unneeded metrics to reduce storage
- Better than original overly-restrictive regex
-
Added topology label extraction
- Extracts device_type (spine/leaf) from source label
- Extracts device_number for aggregation
- Enables better Grafana queries
-
Additional cluster label
- Added
cluster: evpn-vxlan-labfor multi-cluster scenarios
- Added
📈 Metric Filtering Logic:
# KEEP these patterns:
- gnmic_interfaces_.* # All interface metrics
- gnmic_.*bgp.* # All BGP metrics
- gnmic_.*lacp.* # All LACP/MLAG metrics
- gnmic_system.* # All system metrics
- gnmic_.*vxlan.*|gnmic_.*vlan.* # VXLAN/VLAN metrics
# DROP everything else matching gnmic_.*
3. Docker Compose (monitoring/docker-compose.yml)
✅ Improvements:
-
Replaced archived weathermap plugin with active alternatives
agenty-flowcharting-panel- Flow/flowchart visualizationyesoreyeram-infinity-datasource- Enhanced data sources
-
Enabled anonymous access for easier demo/testing
- Anonymous role: Viewer (read-only)
- Still requires admin/admin for editing
-
Added health checks for all services
- gnmic: checks /metrics endpoint
- prometheus: checks /-/healthy endpoint
- grafana: checks /api/health endpoint
4. New Flow Topology Dashboard (monitoring/grafana/dashboards/fabric-flow-topology.json)
🎨 Features:
-
Mermaid-style flowchart showing fabric topology
- 2 Spines (AS 65000)
- 8 Leaves in 4 VTEP pairs (AS 65001-65004)
- MLAG peer-link visualization
- All spine-to-leaf uplinks
-
Live bandwidth overlays on links
- Real-time rate calculations using Prometheus queries
- Color-coded thresholds (green → yellow → orange → red)
- Pattern matching for automatic metric association
-
Separate bandwidth graphs
- Spine interface bandwidth (TX/RX)
- Leaf interface bandwidth (TX/RX)
- Mean and max calculations in legend
Testing the Changes
1. Validate gnmic Configuration
# Test from gnmic container or locally with gnmic installed
gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure capabilities
# Test specific subscription
gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure \
subscribe --path /network-instances/network-instance/protocols/protocol/bgp/neighbors \
--stream-mode sample --sample-interval 10s
2. Check Prometheus Metrics
# Once stack is running
curl http://localhost:9804/metrics | grep gnmic_interfaces
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Query specific metric
curl -G http://localhost:9090/api/v1/query \
--data-urlencode 'query=gnmic_interfaces_interface_state_counters_out_octets'
3. Verify Grafana Dashboards
- Access http://localhost:3000
- Navigate to Dashboards → EVPN-VXLAN Fabric Flow Topology
- Verify:
- Flow diagram renders correctly
- Bandwidth overlays show on links
- Time series graphs display data
- Colors change based on utilization thresholds
Comparison: Old vs New
Old Configuration (weathermap)
- ❌ Used archived weathermap plugin (no longer maintained)
- ❌ Limited telemetry (interfaces only)
- ❌ No BGP/EVPN visibility
- ❌ Static bandwidth thresholds
- ❌ Manual metric path specification
New Configuration (Flow Plugin)
- ✅ Uses actively maintained Flow Charting plugin
- ✅ Comprehensive telemetry (interfaces, BGP, EVPN, MLAG, system)
- ✅ Full overlay health visibility
- ✅ Dynamic bandwidth visualization
- ✅ Pattern-based automatic metric mapping
- ✅ Better metric organization and filtering
Next Steps
Recommended Additional Enhancements
-
Add BGP State Dashboard
- BGP neighbor states across fabric
- EVPN route counts per VTEP
- Session flap detection
-
Add VXLAN Overlay Dashboard
- Active VNIs per VTEP
- VTEP reachability matrix
- L2/L3 VXLAN traffic stats
-
Add MLAG Health Dashboard
- Peer-link status and bandwidth
- MLAG port status
- Dual-active detection events
-
Add Alerting Rules
- BGP session down alerts
- Interface utilization thresholds
- MLAG peer-link failures
-
Add Recording Rules (optional, for performance)
# Example: Pre-calculate interface utilization percentages - record: interface:bandwidth:utilization_percent expr: | (rate(gnmic_interfaces_interface_state_counters_out_octets[5m]) * 8 / 10000000000) * 100
Troubleshooting
Issue: No metrics in Prometheus
Check:
# Verify gnmic is collecting
docker logs gnmic
# Check gnmic metrics endpoint
curl http://localhost:9804/metrics
# Verify Prometheus can scrape
docker logs prometheus | grep gnmic
Issue: Flow diagram not rendering
Check:
- Flow Charting plugin installed: Settings → Plugins → search "agenty"
- Prometheus datasource configured: Configuration → Data Sources
- Metric queries returning data in Explore view
- Browser console for JavaScript errors
Issue: Missing BGP metrics
Check:
# SSH to a switch
ssh admin@172.16.0.1
# Verify gNMI is enabled
show management api gnmi
If not enabled on switches, add to configs:
management api gnmi
transport grpc default
References
- gnmic Documentation
- Agenty Flow Charting Plugin
- Nokia SRL Telemetry Lab (reference implementation)
- Arista gNMI Documentation
Summary
This configuration review has transformed your monitoring stack from using an archived plugin with limited visibility to a modern, comprehensive telemetry solution:
- Better Plugin: Active Flow Charting vs archived weathermap
- More Data: 5 subscription types vs 2 (interfaces, system, BGP, VXLAN, MLAG)
- Better Filtering: Explicit metric keeping vs overly restrictive regex
- Health Checks: Automated service health monitoring
- Production Ready: Comprehensive visibility of underlay AND overlay
The stack is now aligned with industry best practices as demonstrated in the Nokia SRL telemetry lab, adapted specifically for Arista cEOS switches.