Add comprehensive configuration review documentation
This commit is contained in:
267
monitoring/CONFIGURATION_REVIEW.md
Normal file
267
monitoring/CONFIGURATION_REVIEW.md
Normal file
@@ -0,0 +1,267 @@
|
||||
# Configuration Review Summary
|
||||
|
||||
## Overview
|
||||
This document summarizes the configuration review and enhancements made to the EVPN-VXLAN monitoring stack to support Flow Plugin visualization.
|
||||
|
||||
## Changes Made
|
||||
|
||||
### 1. **gnmic Configuration** (`monitoring/gnmic/gnmic.yaml`)
|
||||
|
||||
#### ✅ Improvements:
|
||||
- **Added BGP/EVPN telemetry subscriptions**
|
||||
- BGP neighbor state monitoring
|
||||
- EVPN AFI/SAFI metrics
|
||||
- Critical for overlay health visibility
|
||||
|
||||
- **Added routing telemetry**
|
||||
- Static routes monitoring
|
||||
- IPv4 unicast AFT entries
|
||||
- Underlay health visibility
|
||||
|
||||
- **Enhanced VXLAN subscriptions**
|
||||
- VLAN member state
|
||||
- Connection point endpoints
|
||||
- On-change streaming for real-time updates
|
||||
|
||||
- **Added MLAG telemetry**
|
||||
- LACP interface state
|
||||
- LACP member state
|
||||
- Redundancy monitoring
|
||||
|
||||
- **Optimized sample intervals**
|
||||
- Interfaces: 10s (was 15s) for better granularity
|
||||
- BGP/EVPN: 30s for overlay health
|
||||
- System: 30s for resource monitoring
|
||||
- MLAG: 15s for redundancy tracking
|
||||
|
||||
- **Enhanced event processors**
|
||||
- Better metric name transformation
|
||||
- Interface name cleanup (Ethernet → eth)
|
||||
- Source label enrichment
|
||||
|
||||
#### 📊 Key Metrics Now Available:
|
||||
```
|
||||
# Interface metrics (for Flow Plugin)
|
||||
gnmic_interfaces_interface_state_counters_in_octets
|
||||
gnmic_interfaces_interface_state_counters_out_octets
|
||||
gnmic_interfaces_interface_state_oper_status
|
||||
gnmic_interfaces_interface_state_admin_status
|
||||
|
||||
# BGP/EVPN metrics (overlay health)
|
||||
gnmic_network_instances_bgp_neighbors_neighbor_state_session_state
|
||||
gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_received
|
||||
gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_sent
|
||||
|
||||
# MLAG metrics (redundancy)
|
||||
gnmic_lacp_interfaces_interface_state_system_priority
|
||||
gnmic_lacp_interfaces_interface_members_member_state_activity
|
||||
|
||||
# System metrics
|
||||
gnmic_system_state_hostname
|
||||
gnmic_system_memory_state_physical
|
||||
gnmic_system_cpus_cpu_state_total_utilization
|
||||
```
|
||||
|
||||
### 2. **Prometheus Configuration** (`monitoring/prometheus/prometheus.yml`)
|
||||
|
||||
#### ✅ Improvements:
|
||||
- **Enhanced metric relabeling**
|
||||
- Explicit keep rules for interface, BGP, MLAG, system, and VXLAN metrics
|
||||
- Drop rule for unneeded metrics to reduce storage
|
||||
- Better than original overly-restrictive regex
|
||||
|
||||
- **Added topology label extraction**
|
||||
- Extracts device_type (spine/leaf) from source label
|
||||
- Extracts device_number for aggregation
|
||||
- Enables better Grafana queries
|
||||
|
||||
- **Additional cluster label**
|
||||
- Added `cluster: evpn-vxlan-lab` for multi-cluster scenarios
|
||||
|
||||
#### 📈 Metric Filtering Logic:
|
||||
```yaml
|
||||
# KEEP these patterns:
|
||||
- gnmic_interfaces_.* # All interface metrics
|
||||
- gnmic_.*bgp.* # All BGP metrics
|
||||
- gnmic_.*lacp.* # All LACP/MLAG metrics
|
||||
- gnmic_system.* # All system metrics
|
||||
- gnmic_.*vxlan.*|gnmic_.*vlan.* # VXLAN/VLAN metrics
|
||||
|
||||
# DROP everything else matching gnmic_.*
|
||||
```
|
||||
|
||||
### 3. **Docker Compose** (`monitoring/docker-compose.yml`)
|
||||
|
||||
#### ✅ Improvements:
|
||||
- **Replaced archived weathermap plugin** with active alternatives
|
||||
- `agenty-flowcharting-panel` - Flow/flowchart visualization
|
||||
- `yesoreyeram-infinity-datasource` - Enhanced data sources
|
||||
|
||||
- **Enabled anonymous access** for easier demo/testing
|
||||
- Anonymous role: Viewer (read-only)
|
||||
- Still requires admin/admin for editing
|
||||
|
||||
- **Added health checks** for all services
|
||||
- gnmic: checks /metrics endpoint
|
||||
- prometheus: checks /-/healthy endpoint
|
||||
- grafana: checks /api/health endpoint
|
||||
|
||||
### 4. **New Flow Topology Dashboard** (`monitoring/grafana/dashboards/fabric-flow-topology.json`)
|
||||
|
||||
#### 🎨 Features:
|
||||
- **Mermaid-style flowchart** showing fabric topology
|
||||
- 2 Spines (AS 65000)
|
||||
- 8 Leaves in 4 VTEP pairs (AS 65001-65004)
|
||||
- MLAG peer-link visualization
|
||||
- All spine-to-leaf uplinks
|
||||
|
||||
- **Live bandwidth overlays** on links
|
||||
- Real-time rate calculations using Prometheus queries
|
||||
- Color-coded thresholds (green → yellow → orange → red)
|
||||
- Pattern matching for automatic metric association
|
||||
|
||||
- **Separate bandwidth graphs**
|
||||
- Spine interface bandwidth (TX/RX)
|
||||
- Leaf interface bandwidth (TX/RX)
|
||||
- Mean and max calculations in legend
|
||||
|
||||
## Testing the Changes
|
||||
|
||||
### 1. Validate gnmic Configuration
|
||||
```bash
|
||||
# Test from gnmic container or locally with gnmic installed
|
||||
gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure capabilities
|
||||
|
||||
# Test specific subscription
|
||||
gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure \
|
||||
subscribe --path /network-instances/network-instance/protocols/protocol/bgp/neighbors \
|
||||
--stream-mode sample --sample-interval 10s
|
||||
```
|
||||
|
||||
### 2. Check Prometheus Metrics
|
||||
```bash
|
||||
# Once stack is running
|
||||
curl http://localhost:9804/metrics | grep gnmic_interfaces
|
||||
|
||||
# Check Prometheus targets
|
||||
curl http://localhost:9090/api/v1/targets
|
||||
|
||||
# Query specific metric
|
||||
curl -G http://localhost:9090/api/v1/query \
|
||||
--data-urlencode 'query=gnmic_interfaces_interface_state_counters_out_octets'
|
||||
```
|
||||
|
||||
### 3. Verify Grafana Dashboards
|
||||
1. Access http://localhost:3000
|
||||
2. Navigate to Dashboards → EVPN-VXLAN Fabric Flow Topology
|
||||
3. Verify:
|
||||
- Flow diagram renders correctly
|
||||
- Bandwidth overlays show on links
|
||||
- Time series graphs display data
|
||||
- Colors change based on utilization thresholds
|
||||
|
||||
## Comparison: Old vs New
|
||||
|
||||
### Old Configuration (weathermap)
|
||||
- ❌ Used archived weathermap plugin (no longer maintained)
|
||||
- ❌ Limited telemetry (interfaces only)
|
||||
- ❌ No BGP/EVPN visibility
|
||||
- ❌ Static bandwidth thresholds
|
||||
- ❌ Manual metric path specification
|
||||
|
||||
### New Configuration (Flow Plugin)
|
||||
- ✅ Uses actively maintained Flow Charting plugin
|
||||
- ✅ Comprehensive telemetry (interfaces, BGP, EVPN, MLAG, system)
|
||||
- ✅ Full overlay health visibility
|
||||
- ✅ Dynamic bandwidth visualization
|
||||
- ✅ Pattern-based automatic metric mapping
|
||||
- ✅ Better metric organization and filtering
|
||||
|
||||
## Next Steps
|
||||
|
||||
### Recommended Additional Enhancements
|
||||
|
||||
1. **Add BGP State Dashboard**
|
||||
- BGP neighbor states across fabric
|
||||
- EVPN route counts per VTEP
|
||||
- Session flap detection
|
||||
|
||||
2. **Add VXLAN Overlay Dashboard**
|
||||
- Active VNIs per VTEP
|
||||
- VTEP reachability matrix
|
||||
- L2/L3 VXLAN traffic stats
|
||||
|
||||
3. **Add MLAG Health Dashboard**
|
||||
- Peer-link status and bandwidth
|
||||
- MLAG port status
|
||||
- Dual-active detection events
|
||||
|
||||
4. **Add Alerting Rules**
|
||||
- BGP session down alerts
|
||||
- Interface utilization thresholds
|
||||
- MLAG peer-link failures
|
||||
|
||||
5. **Add Recording Rules** (optional, for performance)
|
||||
```yaml
|
||||
# Example: Pre-calculate interface utilization percentages
|
||||
- record: interface:bandwidth:utilization_percent
|
||||
expr: |
|
||||
(rate(gnmic_interfaces_interface_state_counters_out_octets[5m]) * 8 / 10000000000) * 100
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: No metrics in Prometheus
|
||||
**Check:**
|
||||
```bash
|
||||
# Verify gnmic is collecting
|
||||
docker logs gnmic
|
||||
|
||||
# Check gnmic metrics endpoint
|
||||
curl http://localhost:9804/metrics
|
||||
|
||||
# Verify Prometheus can scrape
|
||||
docker logs prometheus | grep gnmic
|
||||
```
|
||||
|
||||
### Issue: Flow diagram not rendering
|
||||
**Check:**
|
||||
1. Flow Charting plugin installed: Settings → Plugins → search "agenty"
|
||||
2. Prometheus datasource configured: Configuration → Data Sources
|
||||
3. Metric queries returning data in Explore view
|
||||
4. Browser console for JavaScript errors
|
||||
|
||||
### Issue: Missing BGP metrics
|
||||
**Check:**
|
||||
```bash
|
||||
# SSH to a switch
|
||||
ssh admin@172.16.0.1
|
||||
|
||||
# Verify gNMI is enabled
|
||||
show management api gnmi
|
||||
```
|
||||
|
||||
If not enabled on switches, add to configs:
|
||||
```
|
||||
management api gnmi
|
||||
transport grpc default
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [gnmic Documentation](https://gnmic.openconfig.net)
|
||||
- [Agenty Flow Charting Plugin](https://grafana.com/grafana/plugins/agenty-flowcharting-panel/)
|
||||
- [Nokia SRL Telemetry Lab](https://github.com/srl-labs/srl-telemetry-lab) (reference implementation)
|
||||
- [Arista gNMI Documentation](https://aristanetworks.github.io/openmgmt/)
|
||||
|
||||
## Summary
|
||||
|
||||
This configuration review has transformed your monitoring stack from using an archived plugin with limited visibility to a modern, comprehensive telemetry solution:
|
||||
|
||||
- **Better Plugin**: Active Flow Charting vs archived weathermap
|
||||
- **More Data**: 5 subscription types vs 2 (interfaces, system, BGP, VXLAN, MLAG)
|
||||
- **Better Filtering**: Explicit metric keeping vs overly restrictive regex
|
||||
- **Health Checks**: Automated service health monitoring
|
||||
- **Production Ready**: Comprehensive visibility of underlay AND overlay
|
||||
|
||||
The stack is now aligned with industry best practices as demonstrated in the Nokia SRL telemetry lab, adapted specifically for Arista cEOS switches.
|
||||
Reference in New Issue
Block a user