Files
arista-evpn-vxlan-clab/monitoring/CONFIGURATION_REVIEW.md

8.3 KiB

Configuration Review Summary

Overview

This document summarizes the configuration review and enhancements made to the EVPN-VXLAN monitoring stack to support Flow Plugin visualization.

Changes Made

1. gnmic Configuration (monitoring/gnmic/gnmic.yaml)

Improvements:

  • Added BGP/EVPN telemetry subscriptions

    • BGP neighbor state monitoring
    • EVPN AFI/SAFI metrics
    • Critical for overlay health visibility
  • Added routing telemetry

    • Static routes monitoring
    • IPv4 unicast AFT entries
    • Underlay health visibility
  • Enhanced VXLAN subscriptions

    • VLAN member state
    • Connection point endpoints
    • On-change streaming for real-time updates
  • Added MLAG telemetry

    • LACP interface state
    • LACP member state
    • Redundancy monitoring
  • Optimized sample intervals

    • Interfaces: 10s (was 15s) for better granularity
    • BGP/EVPN: 30s for overlay health
    • System: 30s for resource monitoring
    • MLAG: 15s for redundancy tracking
  • Enhanced event processors

    • Better metric name transformation
    • Interface name cleanup (Ethernet → eth)
    • Source label enrichment

📊 Key Metrics Now Available:

# Interface metrics (for Flow Plugin)
gnmic_interfaces_interface_state_counters_in_octets
gnmic_interfaces_interface_state_counters_out_octets
gnmic_interfaces_interface_state_oper_status
gnmic_interfaces_interface_state_admin_status

# BGP/EVPN metrics (overlay health)
gnmic_network_instances_bgp_neighbors_neighbor_state_session_state
gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_received
gnmic_network_instances_bgp_neighbors_neighbor_afi_safis_state_prefixes_sent

# MLAG metrics (redundancy)
gnmic_lacp_interfaces_interface_state_system_priority
gnmic_lacp_interfaces_interface_members_member_state_activity

# System metrics
gnmic_system_state_hostname
gnmic_system_memory_state_physical
gnmic_system_cpus_cpu_state_total_utilization

2. Prometheus Configuration (monitoring/prometheus/prometheus.yml)

Improvements:

  • Enhanced metric relabeling

    • Explicit keep rules for interface, BGP, MLAG, system, and VXLAN metrics
    • Drop rule for unneeded metrics to reduce storage
    • Better than original overly-restrictive regex
  • Added topology label extraction

    • Extracts device_type (spine/leaf) from source label
    • Extracts device_number for aggregation
    • Enables better Grafana queries
  • Additional cluster label

    • Added cluster: evpn-vxlan-lab for multi-cluster scenarios

📈 Metric Filtering Logic:

# KEEP these patterns:
- gnmic_interfaces_.*          # All interface metrics
- gnmic_.*bgp.*                # All BGP metrics  
- gnmic_.*lacp.*               # All LACP/MLAG metrics
- gnmic_system.*               # All system metrics
- gnmic_.*vxlan.*|gnmic_.*vlan.* # VXLAN/VLAN metrics

# DROP everything else matching gnmic_.*

3. Docker Compose (monitoring/docker-compose.yml)

Improvements:

  • Replaced archived weathermap plugin with active alternatives

    • agenty-flowcharting-panel - Flow/flowchart visualization
    • yesoreyeram-infinity-datasource - Enhanced data sources
  • Enabled anonymous access for easier demo/testing

    • Anonymous role: Viewer (read-only)
    • Still requires admin/admin for editing
  • Added health checks for all services

    • gnmic: checks /metrics endpoint
    • prometheus: checks /-/healthy endpoint
    • grafana: checks /api/health endpoint

4. New Flow Topology Dashboard (monitoring/grafana/dashboards/fabric-flow-topology.json)

🎨 Features:

  • Mermaid-style flowchart showing fabric topology

    • 2 Spines (AS 65000)
    • 8 Leaves in 4 VTEP pairs (AS 65001-65004)
    • MLAG peer-link visualization
    • All spine-to-leaf uplinks
  • Live bandwidth overlays on links

    • Real-time rate calculations using Prometheus queries
    • Color-coded thresholds (green → yellow → orange → red)
    • Pattern matching for automatic metric association
  • Separate bandwidth graphs

    • Spine interface bandwidth (TX/RX)
    • Leaf interface bandwidth (TX/RX)
    • Mean and max calculations in legend

Testing the Changes

1. Validate gnmic Configuration

# Test from gnmic container or locally with gnmic installed
gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure capabilities

# Test specific subscription
gnmic -a 172.16.0.1:6030 -u admin -p admin --insecure \
  subscribe --path /network-instances/network-instance/protocols/protocol/bgp/neighbors \
  --stream-mode sample --sample-interval 10s

2. Check Prometheus Metrics

# Once stack is running
curl http://localhost:9804/metrics | grep gnmic_interfaces

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Query specific metric
curl -G http://localhost:9090/api/v1/query \
  --data-urlencode 'query=gnmic_interfaces_interface_state_counters_out_octets'

3. Verify Grafana Dashboards

  1. Access http://localhost:3000
  2. Navigate to Dashboards → EVPN-VXLAN Fabric Flow Topology
  3. Verify:
    • Flow diagram renders correctly
    • Bandwidth overlays show on links
    • Time series graphs display data
    • Colors change based on utilization thresholds

Comparison: Old vs New

Old Configuration (weathermap)

  • Used archived weathermap plugin (no longer maintained)
  • Limited telemetry (interfaces only)
  • No BGP/EVPN visibility
  • Static bandwidth thresholds
  • Manual metric path specification

New Configuration (Flow Plugin)

  • Uses actively maintained Flow Charting plugin
  • Comprehensive telemetry (interfaces, BGP, EVPN, MLAG, system)
  • Full overlay health visibility
  • Dynamic bandwidth visualization
  • Pattern-based automatic metric mapping
  • Better metric organization and filtering

Next Steps

  1. Add BGP State Dashboard

    • BGP neighbor states across fabric
    • EVPN route counts per VTEP
    • Session flap detection
  2. Add VXLAN Overlay Dashboard

    • Active VNIs per VTEP
    • VTEP reachability matrix
    • L2/L3 VXLAN traffic stats
  3. Add MLAG Health Dashboard

    • Peer-link status and bandwidth
    • MLAG port status
    • Dual-active detection events
  4. Add Alerting Rules

    • BGP session down alerts
    • Interface utilization thresholds
    • MLAG peer-link failures
  5. Add Recording Rules (optional, for performance)

    # Example: Pre-calculate interface utilization percentages
    - record: interface:bandwidth:utilization_percent
      expr: |
        (rate(gnmic_interfaces_interface_state_counters_out_octets[5m]) * 8 / 10000000000) * 100
    

Troubleshooting

Issue: No metrics in Prometheus

Check:

# Verify gnmic is collecting
docker logs gnmic

# Check gnmic metrics endpoint
curl http://localhost:9804/metrics

# Verify Prometheus can scrape
docker logs prometheus | grep gnmic

Issue: Flow diagram not rendering

Check:

  1. Flow Charting plugin installed: Settings → Plugins → search "agenty"
  2. Prometheus datasource configured: Configuration → Data Sources
  3. Metric queries returning data in Explore view
  4. Browser console for JavaScript errors

Issue: Missing BGP metrics

Check:

# SSH to a switch
ssh admin@172.16.0.1

# Verify gNMI is enabled
show management api gnmi

If not enabled on switches, add to configs:

management api gnmi
   transport grpc default

References

Summary

This configuration review has transformed your monitoring stack from using an archived plugin with limited visibility to a modern, comprehensive telemetry solution:

  • Better Plugin: Active Flow Charting vs archived weathermap
  • More Data: 5 subscription types vs 2 (interfaces, system, BGP, VXLAN, MLAG)
  • Better Filtering: Explicit metric keeping vs overly restrictive regex
  • Health Checks: Automated service health monitoring
  • Production Ready: Comprehensive visibility of underlay AND overlay

The stack is now aligned with industry best practices as demonstrated in the Nokia SRL telemetry lab, adapted specifically for Arista cEOS switches.