Add Grafana monitoring stack with gNMI telemetry and Network Weathermap #17
267
monitoring/CONFIGURATION_REVIEW.md
Normal file
267
monitoring/CONFIGURATION_REVIEW.md
Normal file
@@ -0,0 +1,267 @@
|
|||||||
|
# Configuration Review Summary
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
This document summarizes the configuration review and enhancements made to the EVPN-VXLAN monitoring stack to support Flow Plugin visualization.
|
||||||
|
|
||||||
|
## Changes Made
|
||||||
|
|
||||||
|
### 1. **gnmic Configuration** (`monitoring/gnmic/gnmic.yaml`)
|
||||||
|
|
||||||
|
#### ✅ Improvements:
|
||||||
|
- **Added BGP/EVPN telemetry subscriptions**
|
||||||
|
- BGP neighbor state monitoring
|
||||||
|
- EVPN AFI/SAFI metrics
|
||||||
|
- Critical for overlay health visibility
|
||||||
|
|
||||||
|
- **Added routing telemetry**
|
||||||
|
- Static routes monitoring
|
||||||
|
- IPv4 unicast AFT entries
|
||||||
|
- Underlay health visibility
|
||||||
|
|
||||||
|
- **Enhanced VXLAN subscriptions**
|
||||||
|
- VLAN member state
|
||||||
|
- Connection point endpoints
|
||||||
|
- On-change streaming for real-time updates
|
||||||
|
|
||||||
|
- **Added MLAG telemetry**
|
||||||
|
- LACP interface state
|
||||||
|
- LACP member state
|
||||||
|
- Redundancy monitoring
|
||||||
|
|
||||||
|
- **Optimized sample intervals**
|
||||||
|
- Interfaces: 10s (was 15s) for better granularity
|
||||||
|
- BGP/EVPN: 30s for overlay health
|
||||||
|
- System: 30s for resource monitoring
|
||||||
|
- MLAG: 15s for redundancy tracking
|
||||||
|
|
||||||
|
- **Enhanced event processors**
|
||||||
|
- Better metric name transformation
|
||||||
|
- Interface name cleanup (Ethernet → eth)
|
||||||
|
- Source label enrichment
|
||||||
|
|
||||||
|
#### 📊 Key Metrics Now Available:
|
||||||
|
```
|
||||||
|
# Interface metrics (for Flow Plugin)
|
||||||
|
gnmic_interfaces_interface_state_counters_in_octets
|
||||||
|
gnmic_interfaces_interface_state_counters_out_octets
|
||||||
|
gnmic_interfaces_interface_state_oper_status
|
||||||
|
gnmic_interfaces_interface_state_admin_status
|
||||||
|
|
||||||
|
# BGP/EVPN metrics (overlay health)
|
||||||
|
gnmic_network_instances_bgp_neighbors_neighbor_state_session_state
|
||||||
|
gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_received
|
||||||
|
gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_sent
|
||||||
|
|
||||||
|
# MLAG metrics (redundancy)
|
||||||
|
gnmic_lacp_interfaces_interface_state_system_priority
|
||||||
|
gnmic_lacp_interfaces_interface_members_member_state_activity
|
||||||
|
|
||||||
|
# System metrics
|
||||||
|
gnmic_system_state_hostname
|
||||||
|
gnmic_system_memory_state_physical
|
||||||
|
gnmic_system_cpus_cpu_state_total_utilization
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. **Prometheus Configuration** (`monitoring/prometheus/prometheus.yml`)
|
||||||
|
|
||||||
|
#### ✅ Improvements:
|
||||||
|
- **Enhanced metric relabeling**
|
||||||
|
- Explicit keep rules for interface, BGP, MLAG, system, and VXLAN metrics
|
||||||
|
- Drop rule for unneeded metrics to reduce storage
|
||||||
|
- Better than original overly-restrictive regex
|
||||||
|
|
||||||
|
- **Added topology label extraction**
|
||||||
|
- Extracts device_type (spine/leaf) from source label
|
||||||
|
- Extracts device_number for aggregation
|
||||||
|
- Enables better Grafana queries
|
||||||
|
|
||||||
|
- **Additional cluster label**
|
||||||
|
- Added `cluster: evpn-vxlan-lab` for multi-cluster scenarios
|
||||||
|
|
||||||
|
#### 📈 Metric Filtering Logic:
|
||||||
|
```yaml
|
||||||
|
# KEEP these patterns:
|
||||||
|
- gnmic_interfaces_.* # All interface metrics
|
||||||
|
- gnmic_.*bgp.* # All BGP metrics
|
||||||
|
- gnmic_.*lacp.* # All LACP/MLAG metrics
|
||||||
|
- gnmic_system.* # All system metrics
|
||||||
|
- gnmic_.*vxlan.*|gnmic_.*vlan.* # VXLAN/VLAN metrics
|
||||||
|
|
||||||
|
# DROP everything else matching gnmic_.*
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. **Docker Compose** (`monitoring/docker-compose.yml`)
|
||||||
|
|
||||||
|
#### ✅ Improvements:
|
||||||
|
- **Replaced archived weathermap plugin** with active alternatives
|
||||||
|
- `agenty-flowcharting-panel` - Flow/flowchart visualization
|
||||||
|
- `yesoreyeram-infinity-datasource` - Enhanced data sources
|
||||||
|
|
||||||
|
- **Enabled anonymous access** for easier demo/testing
|
||||||
|
- Anonymous role: Viewer (read-only)
|
||||||
|
- Still requires admin/admin for editing
|
||||||
|
|
||||||
|
- **Added health checks** for all services
|
||||||
|
- gnmic: checks /metrics endpoint
|
||||||
|
- prometheus: checks /-/healthy endpoint
|
||||||
|
- grafana: checks /api/health endpoint
|
||||||
|
|
||||||
|
### 4. **New Flow Topology Dashboard** (`monitoring/grafana/dashboards/fabric-flow-topology.json`)
|
||||||
|
|
||||||
|
#### 🎨 Features:
|
||||||
|
- **Mermaid-style flowchart** showing fabric topology
|
||||||
|
- 2 Spines (AS 65000)
|
||||||
|
- 8 Leaves in 4 VTEP pairs (AS 65001-65004)
|
||||||
|
- MLAG peer-link visualization
|
||||||
|
- All spine-to-leaf uplinks
|
||||||
|
|
||||||
|
- **Live bandwidth overlays** on links
|
||||||
|
- Real-time rate calculations using Prometheus queries
|
||||||
|
- Color-coded thresholds (green → yellow → orange → red)
|
||||||
|
- Pattern matching for automatic metric association
|
||||||
|
|
||||||
|
- **Separate bandwidth graphs**
|
||||||
|
- Spine interface bandwidth (TX/RX)
|
||||||
|
- Leaf interface bandwidth (TX/RX)
|
||||||
|
- Mean and max calculations in legend
|
||||||
|
|
||||||
|
## Testing the Changes
|
||||||
|
|
||||||
|
### 1. Validate gnmic Configuration
|
||||||
|
```bash
|
||||||
|
# Test from gnmic container or locally with gnmic installed
|
||||||
|
gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure capabilities
|
||||||
|
|
||||||
|
# Test specific subscription
|
||||||
|
gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure \
|
||||||
|
subscribe --path /network-instances/network-instance/protocols/protocol/bgp/neighbors \
|
||||||
|
--stream-mode sample --sample-interval 10s
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Check Prometheus Metrics
|
||||||
|
```bash
|
||||||
|
# Once stack is running
|
||||||
|
curl http://localhost:9804/metrics | grep gnmic_interfaces
|
||||||
|
|
||||||
|
# Check Prometheus targets
|
||||||
|
curl http://localhost:9090/api/v1/targets
|
||||||
|
|
||||||
|
# Query specific metric
|
||||||
|
curl -G http://localhost:9090/api/v1/query \
|
||||||
|
--data-urlencode 'query=gnmic_interfaces_interface_state_counters_out_octets'
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Verify Grafana Dashboards
|
||||||
|
1. Access http://localhost:3000
|
||||||
|
2. Navigate to Dashboards → EVPN-VXLAN Fabric Flow Topology
|
||||||
|
3. Verify:
|
||||||
|
- Flow diagram renders correctly
|
||||||
|
- Bandwidth overlays show on links
|
||||||
|
- Time series graphs display data
|
||||||
|
- Colors change based on utilization thresholds
|
||||||
|
|
||||||
|
## Comparison: Old vs New
|
||||||
|
|
||||||
|
### Old Configuration (weathermap)
|
||||||
|
- ❌ Used archived weathermap plugin (no longer maintained)
|
||||||
|
- ❌ Limited telemetry (interfaces only)
|
||||||
|
- ❌ No BGP/EVPN visibility
|
||||||
|
- ❌ Static bandwidth thresholds
|
||||||
|
- ❌ Manual metric path specification
|
||||||
|
|
||||||
|
### New Configuration (Flow Plugin)
|
||||||
|
- ✅ Uses actively maintained Flow Charting plugin
|
||||||
|
- ✅ Comprehensive telemetry (interfaces, BGP, EVPN, MLAG, system)
|
||||||
|
- ✅ Full overlay health visibility
|
||||||
|
- ✅ Dynamic bandwidth visualization
|
||||||
|
- ✅ Pattern-based automatic metric mapping
|
||||||
|
- ✅ Better metric organization and filtering
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
### Recommended Additional Enhancements
|
||||||
|
|
||||||
|
1. **Add BGP State Dashboard**
|
||||||
|
- BGP neighbor states across fabric
|
||||||
|
- EVPN route counts per VTEP
|
||||||
|
- Session flap detection
|
||||||
|
|
||||||
|
2. **Add VXLAN Overlay Dashboard**
|
||||||
|
- Active VNIs per VTEP
|
||||||
|
- VTEP reachability matrix
|
||||||
|
- L2/L3 VXLAN traffic stats
|
||||||
|
|
||||||
|
3. **Add MLAG Health Dashboard**
|
||||||
|
- Peer-link status and bandwidth
|
||||||
|
- MLAG port status
|
||||||
|
- Dual-active detection events
|
||||||
|
|
||||||
|
4. **Add Alerting Rules**
|
||||||
|
- BGP session down alerts
|
||||||
|
- Interface utilization thresholds
|
||||||
|
- MLAG peer-link failures
|
||||||
|
|
||||||
|
5. **Add Recording Rules** (optional, for performance)
|
||||||
|
```yaml
|
||||||
|
# Example: Pre-calculate interface utilization percentages
|
||||||
|
- record: interface:bandwidth:utilization_percent
|
||||||
|
expr: |
|
||||||
|
(rate(gnmic_interfaces_interface_state_counters_out_octets[5m]) * 8 / 10000000000) * 100
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Issue: No metrics in Prometheus
|
||||||
|
**Check:**
|
||||||
|
```bash
|
||||||
|
# Verify gnmic is collecting
|
||||||
|
docker logs gnmic
|
||||||
|
|
||||||
|
# Check gnmic metrics endpoint
|
||||||
|
curl http://localhost:9804/metrics
|
||||||
|
|
||||||
|
# Verify Prometheus can scrape
|
||||||
|
docker logs prometheus | grep gnmic
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue: Flow diagram not rendering
|
||||||
|
**Check:**
|
||||||
|
1. Flow Charting plugin installed: Settings → Plugins → search "agenty"
|
||||||
|
2. Prometheus datasource configured: Configuration → Data Sources
|
||||||
|
3. Metric queries returning data in Explore view
|
||||||
|
4. Browser console for JavaScript errors
|
||||||
|
|
||||||
|
### Issue: Missing BGP metrics
|
||||||
|
**Check:**
|
||||||
|
```bash
|
||||||
|
# SSH to a switch
|
||||||
|
ssh admin@172.16.0.1
|
||||||
|
|
||||||
|
# Verify gNMI is enabled
|
||||||
|
show management api gnmi
|
||||||
|
```
|
||||||
|
|
||||||
|
If not enabled on switches, add to configs:
|
||||||
|
```
|
||||||
|
management api gnmi
|
||||||
|
transport grpc default
|
||||||
|
```
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- [gnmic Documentation](https://gnmic.openconfig.net)
|
||||||
|
- [Agenty Flow Charting Plugin](https://grafana.com/grafana/plugins/agenty-flowcharting-panel/)
|
||||||
|
- [Nokia SRL Telemetry Lab](https://github.com/srl-labs/srl-telemetry-lab) (reference implementation)
|
||||||
|
- [Arista gNMI Documentation](https://aristanetworks.github.io/openmgmt/)
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
This configuration review has transformed your monitoring stack from using an archived plugin with limited visibility to a modern, comprehensive telemetry solution:
|
||||||
|
|
||||||
|
- **Better Plugin**: Active Flow Charting vs archived weathermap
|
||||||
|
- **More Data**: 5 subscription types vs 2 (interfaces, system, BGP, VXLAN, MLAG)
|
||||||
|
- **Better Filtering**: Explicit metric keeping vs overly restrictive regex
|
||||||
|
- **Health Checks**: Automated service health monitoring
|
||||||
|
- **Production Ready**: Comprehensive visibility of underlay AND overlay
|
||||||
|
|
||||||
|
The stack is now aligned with industry best practices as demonstrated in the Nokia SRL telemetry lab, adapted specifically for Arista cEOS switches.
|
||||||
Reference in New Issue
Block a user