Files

Damien Arnodo 4b657a4e1e Add comprehensive configuration review documentation

2025-12-16 18:53:40 +00:00

8.3 KiB

Raw Blame History

Configuration Review Summary

Overview

This document summarizes the configuration review and enhancements made to the EVPN-VXLAN monitoring stack to support Flow Plugin visualization.

Changes Made

1. gnmic Configuration (`monitoring/gnmic/gnmic.yaml`)

✅ Improvements:

Added BGP/EVPN telemetry subscriptions
- BGP neighbor state monitoring
- EVPN AFI/SAFI metrics
- Critical for overlay health visibility
Added routing telemetry
- Static routes monitoring
- IPv4 unicast AFT entries
- Underlay health visibility
Enhanced VXLAN subscriptions
- VLAN member state
- Connection point endpoints
- On-change streaming for real-time updates
Added MLAG telemetry
- LACP interface state
- LACP member state
- Redundancy monitoring
Optimized sample intervals
- Interfaces: 10s (was 15s) for better granularity
- BGP/EVPN: 30s for overlay health
- System: 30s for resource monitoring
- MLAG: 15s for redundancy tracking
Enhanced event processors
- Better metric name transformation
- Interface name cleanup (Ethernet → eth)
- Source label enrichment

📊 Key Metrics Now Available:

# Interface metrics (for Flow Plugin)
gnmic_interfaces_interface_state_counters_in_octets
gnmic_interfaces_interface_state_counters_out_octets
gnmic_interfaces_interface_state_oper_status
gnmic_interfaces_interface_state_admin_status

# BGP/EVPN metrics (overlay health)
gnmic_network_instances_bgp_neighbors_neighbor_state_session_state
gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_received
gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_sent

# MLAG metrics (redundancy)
gnmic_lacp_interfaces_interface_state_system_priority
gnmic_lacp_interfaces_interface_members_member_state_activity

# System metrics
gnmic_system_state_hostname
gnmic_system_memory_state_physical
gnmic_system_cpus_cpu_state_total_utilization

2. Prometheus Configuration (`monitoring/prometheus/prometheus.yml`)

✅ Improvements:

Enhanced metric relabeling
- Explicit keep rules for interface, BGP, MLAG, system, and VXLAN metrics
- Drop rule for unneeded metrics to reduce storage
- Better than original overly-restrictive regex
Added topology label extraction
- Extracts device_type (spine/leaf) from source label
- Extracts device_number for aggregation
- Enables better Grafana queries
Additional cluster label
- Added cluster: evpn-vxlan-lab for multi-cluster scenarios

📈 Metric Filtering Logic:

# KEEP these patterns:
- gnmic_interfaces_.*          # All interface metrics
- gnmic_.*bgp.*                # All BGP metrics  
- gnmic_.*lacp.*               # All LACP/MLAG metrics
- gnmic_system.*               # All system metrics
- gnmic_.*vxlan.*|gnmic_.*vlan.* # VXLAN/VLAN metrics

# DROP everything else matching gnmic_.*

3. Docker Compose (`monitoring/docker-compose.yml`)

✅ Improvements:

Replaced archived weathermap plugin with active alternatives
- agenty-flowcharting-panel - Flow/flowchart visualization
- yesoreyeram-infinity-datasource - Enhanced data sources
Enabled anonymous access for easier demo/testing
- Anonymous role: Viewer (read-only)
- Still requires admin/admin for editing
Added health checks for all services
- gnmic: checks /metrics endpoint
- prometheus: checks /-/healthy endpoint
- grafana: checks /api/health endpoint

4. New Flow Topology Dashboard (`monitoring/grafana/dashboards/fabric-flow-topology.json`)

🎨 Features:

Mermaid-style flowchart showing fabric topology
- 2 Spines (AS 65000)
- 8 Leaves in 4 VTEP pairs (AS 65001-65004)
- MLAG peer-link visualization
- All spine-to-leaf uplinks
Live bandwidth overlays on links
- Real-time rate calculations using Prometheus queries
- Color-coded thresholds (green → yellow → orange → red)
- Pattern matching for automatic metric association
Separate bandwidth graphs
- Spine interface bandwidth (TX/RX)
- Leaf interface bandwidth (TX/RX)
- Mean and max calculations in legend

Testing the Changes

1. Validate gnmic Configuration

# Test from gnmic container or locally with gnmic installed
gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure capabilities

# Test specific subscription
gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure \
  subscribe --path /network-instances/network-instance/protocols/protocol/bgp/neighbors \
  --stream-mode sample --sample-interval 10s

2. Check Prometheus Metrics

# Once stack is running
curl http://localhost:9804/metrics | grep gnmic_interfaces

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Query specific metric
curl -G http://localhost:9090/api/v1/query \
  --data-urlencode 'query=gnmic_interfaces_interface_state_counters_out_octets'

3. Verify Grafana Dashboards

Access http://localhost:3000
Navigate to Dashboards → EVPN-VXLAN Fabric Flow Topology
Verify:
- Flow diagram renders correctly
- Bandwidth overlays show on links
- Time series graphs display data
- Colors change based on utilization thresholds

Comparison: Old vs New

Old Configuration (weathermap)

❌ Used archived weathermap plugin (no longer maintained)
❌ Limited telemetry (interfaces only)
❌ No BGP/EVPN visibility
❌ Static bandwidth thresholds
❌ Manual metric path specification

New Configuration (Flow Plugin)

✅ Uses actively maintained Flow Charting plugin
✅ Comprehensive telemetry (interfaces, BGP, EVPN, MLAG, system)
✅ Full overlay health visibility
✅ Dynamic bandwidth visualization
✅ Pattern-based automatic metric mapping
✅ Better metric organization and filtering

Next Steps

Recommended Additional Enhancements

Add BGP State Dashboard
- BGP neighbor states across fabric
- EVPN route counts per VTEP
- Session flap detection
Add VXLAN Overlay Dashboard
- Active VNIs per VTEP
- VTEP reachability matrix
- L2/L3 VXLAN traffic stats
Add MLAG Health Dashboard
- Peer-link status and bandwidth
- MLAG port status
- Dual-active detection events
Add Alerting Rules
- BGP session down alerts
- Interface utilization thresholds
- MLAG peer-link failures

Add Recording Rules (optional, for performance)

# Example: Pre-calculate interface utilization percentages
- record: interface:bandwidth:utilization_percent
  expr: |
    (rate(gnmic_interfaces_interface_state_counters_out_octets[5m]) * 8 / 10000000000) * 100

Troubleshooting

Issue: No metrics in Prometheus

Check:

# Verify gnmic is collecting
docker logs gnmic

# Check gnmic metrics endpoint
curl http://localhost:9804/metrics

# Verify Prometheus can scrape
docker logs prometheus | grep gnmic

Issue: Flow diagram not rendering

Check:

Flow Charting plugin installed: Settings → Plugins → search "agenty"
Prometheus datasource configured: Configuration → Data Sources
Metric queries returning data in Explore view
Browser console for JavaScript errors

Issue: Missing BGP metrics

Check:

# SSH to a switch
ssh admin@172.16.0.1

# Verify gNMI is enabled
show management api gnmi

If not enabled on switches, add to configs:

management api gnmi
   transport grpc default

References

gnmic Documentation
Agenty Flow Charting Plugin
Nokia SRL Telemetry Lab (reference implementation)
Arista gNMI Documentation

Summary

This configuration review has transformed your monitoring stack from using an archived plugin with limited visibility to a modern, comprehensive telemetry solution:

Better Plugin: Active Flow Charting vs archived weathermap
More Data: 5 subscription types vs 2 (interfaces, system, BGP, VXLAN, MLAG)
Better Filtering: Explicit metric keeping vs overly restrictive regex
Health Checks: Automated service health monitoring
Production Ready: Comprehensive visibility of underlay AND overlay

The stack is now aligned with industry best practices as demonstrated in the Nokia SRL telemetry lab, adapted specifically for Arista cEOS switches.

8.3 KiB Raw Blame History

Configuration Review Summary

Overview

Changes Made

1. gnmic Configuration (monitoring/gnmic/gnmic.yaml)

✅ Improvements:

📊 Key Metrics Now Available:

2. Prometheus Configuration (monitoring/prometheus/prometheus.yml)

✅ Improvements:

📈 Metric Filtering Logic:

3. Docker Compose (monitoring/docker-compose.yml)

✅ Improvements:

4. New Flow Topology Dashboard (monitoring/grafana/dashboards/fabric-flow-topology.json)

🎨 Features:

Testing the Changes

1. Validate gnmic Configuration

2. Check Prometheus Metrics

3. Verify Grafana Dashboards

Comparison: Old vs New

Old Configuration (weathermap)

New Configuration (Flow Plugin)

Next Steps

Recommended Additional Enhancements

Troubleshooting

Issue: No metrics in Prometheus

Issue: Flow diagram not rendering

Issue: Missing BGP metrics

References

Summary

8.3 KiB

Raw Blame History

1. gnmic Configuration (`monitoring/gnmic/gnmic.yaml`)

2. Prometheus Configuration (`monitoring/prometheus/prometheus.yml`)

3. Docker Compose (`monitoring/docker-compose.yml`)

4. New Flow Topology Dashboard (`monitoring/grafana/dashboards/fabric-flow-topology.json`)