Add Grafana monitoring stack with gNMI telemetry and Network Weathermap #17

Closed
Damien wants to merge 28 commits from feature/grafana-monitoring into main
Showing only changes of commit bcb3160c9b - Show all commits

246
monitoring/QUICKSTART.md Normal file
View File

@@ -0,0 +1,246 @@
# Quick Start Guide - EVPN-VXLAN Monitoring Stack
## Prerequisites
1. **ContainerLab topology deployed** with management network named `evpn-mgmt`
2. **Docker and Docker Compose** installed
3. **gNMI enabled on all switches** (should already be configured)
## Deployment Steps
### 1. Deploy the Monitoring Stack
```bash
# Navigate to monitoring directory
cd monitoring
# Start all services
docker-compose up -d
# Verify all services are running
docker-compose ps
# Expected output:
# NAME STATUS PORTS
# gnmic Up (healthy) 0.0.0.0:9804->9804/tcp
# prometheus Up (healthy) 0.0.0.0:9090->9090/tcp
# grafana Up (healthy) 0.0.0.0:3000->3000/tcp
```
### 2. Verify gnmic is Collecting Metrics
```bash
# Check gnmic logs
docker logs gnmic
# Should see successful subscription messages like:
# "starting connection to target 'spine1'"
# "target 'spine1' gNMI connection established"
# Check metrics endpoint
curl http://localhost:9804/metrics | grep gnmic_interfaces | head -5
# Should see interface metrics:
# gnmic_interfaces_interface_state_counters_in_octets{...} 12345
# gnmic_interfaces_interface_state_counters_out_octets{...} 67890
```
### 3. Verify Prometheus is Scraping
```bash
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job, health}'
# Should show gnmic target as "up":
# {
# "job": "gnmic",
# "health": "up"
# }
# Query a specific metric
curl -G http://localhost:9090/api/v1/query \
--data-urlencode 'query=gnmic_interfaces_interface_state_counters_out_octets{source="spine1"}' \
| jq '.data.result[0]'
```
### 4. Access Grafana
1. **Open browser**: http://localhost:3000
2. **Login** (optional): admin/admin
- Or use anonymous access (Viewer role)
3. **Navigate to dashboards**:
- Dashboards → Browse
- Select "EVPN-VXLAN Fabric Flow Topology"
### 5. Generate Traffic (Optional)
To see bandwidth visualization in action:
```bash
# From your lab directory (not monitoring/)
cd ..
# Generate traffic between clients
# (Assumes you have traffic generation scripts)
bash scripts/generate-traffic.sh
```
## Accessing the Stack
### Service URLs
| Service | URL | Credentials |
|---------|-----|-------------|
| Grafana | http://localhost:3000 | admin/admin or anonymous |
| Prometheus | http://localhost:9090 | None |
| gnmic metrics | http://localhost:9804/metrics | None |
### Available Dashboards
1. **EVPN-VXLAN Fabric Flow Topology** (`fabric-flow-topology.json`)
- Interactive flowchart of fabric topology
- Real-time bandwidth overlays on links
- Spine and leaf interface graphs
2. **Fabric Overview** (`fabric-overview.json`)
- General fabric statistics
- Device health overview
## Troubleshooting
### Problem: gnmic not collecting data
**Check switch gNMI configuration:**
```bash
# SSH to any switch
ssh admin@172.16.0.1
# Verify gNMI is enabled
show management api gnmi
# Should show:
# Enabled: yes
# Transport: GRPC
```
**If not enabled, add to switch configs:**
```
management api gnmi
transport grpc default
```
### Problem: Prometheus shows no data
**Check:**
```bash
# 1. Verify gnmic is exposing metrics
curl http://localhost:9804/metrics | grep gnmic
# 2. Check Prometheus logs
docker logs prometheus | tail -20
# 3. Check Prometheus config is valid
docker exec prometheus promtool check config /etc/prometheus/prometheus.yml
```
### Problem: Grafana dashboard shows "No Data"
**Check:**
1. **Prometheus datasource**: Configuration → Data Sources → Prometheus
- URL should be: http://prometheus:9090
- Click "Save & Test" - should show green "Data source is working"
2. **Query in Explore**:
- Menu → Explore
- Select "Prometheus" datasource
- Run query: `gnmic_interfaces_interface_state_counters_out_octets`
- Should return results
3. **Time range**: Ensure dashboard time range shows recent data (last 1h)
### Problem: Flow diagram not rendering
**Check:**
1. **Plugin installed**:
```bash
docker exec grafana grafana-cli plugins ls | grep agenty
```
Should show: agenty-flowcharting-panel
2. **If missing, reinstall**:
```bash
docker-compose down
docker-compose up -d
```
## Stopping the Stack
```bash
# Stop all services
docker-compose down
# Stop and remove volumes (fresh start)
docker-compose down -v
```
## Updating Configuration
### Update gnmic subscriptions
1. Edit `gnmic/gnmic.yaml`
2. Restart gnmic:
```bash
docker-compose restart gnmic
```
### Update Prometheus scrape config
1. Edit `prometheus/prometheus.yml`
2. Reload Prometheus (no restart needed):
```bash
curl -X POST http://localhost:9090/-/reload
```
### Update Grafana dashboards
1. Edit JSON files in `grafana/dashboards/`
2. Restart Grafana:
```bash
docker-compose restart grafana
```
OR update via UI and export
## Next Steps
1. **Explore metrics**: Use Prometheus Explore to see all available metrics
2. **Create custom dashboards**: Build specific views for your use cases
3. **Add alerting**: Configure Prometheus alerting rules
4. **Add more visualizations**: Enhanced BGP, VXLAN, and MLAG dashboards
## Useful Commands
```bash
# View logs for all services
docker-compose logs -f
# View logs for specific service
docker-compose logs -f gnmic
# Restart specific service
docker-compose restart prometheus
# Check resource usage
docker stats gnmic prometheus grafana
# Execute command in container
docker exec -it gnmic sh
```
## Support
- **gnmic**: https://gnmic.openconfig.net
- **Prometheus**: https://prometheus.io/docs
- **Grafana**: https://grafana.com/docs
- **Flow Plugin**: https://grafana.com/grafana/plugins/agenty-flowcharting-panel/
For issues specific to this lab, check the main repository documentation.