Complete Lab Fixes - L2 and L3 VXLAN Fully Operational (#14)

## Summary

This PR merges all fixes and improvements from the troubleshooting journey to make the Arista EVPN-VXLAN lab fully operational with both L2 and L3 VXLAN connectivity.

## What's Changed

### 🎯 Major Achievements
-  **L2 VXLAN fully operational** - host1 ↔ host3 connectivity verified
-  **L3 VXLAN fully operational** - host2 ↔ host4 connectivity verified (VRF gold)
-  **LACP bonding working** - dual-homed hosts with proper Port-Channel negotiation
-  **All BGP/EVPN sessions established** - complete underlay and overlay working

### 🔧 Infrastructure Fixes

#### BGP & Routing
- Added `ip routing` command to all spine and leaf switches
- Fixed duplicate BGP network statements on leaf3, leaf4, leaf7, leaf8
- Activated EVPN neighbors on spine switches
- Added loopback network advertisements to BGP

#### MLAG Configuration
- Configured MLAG peer-link in trunk mode (not access) for VLAN 4090/4091
- Added dual-active detection via management interface
- Configured virtual router MAC for MLAG pairs

#### Switch Port Configuration
- Port-Channel1 configured in **trunk mode** on all leaf switches
- Added `switchport trunk allowed vlan` for host VLANs (34, 40, 78)
- Removed `no shutdown` from Port-Channel interfaces

### 🖥️ Host Networking - Complete Redesign

#### Image Change
- **Old:** `alpine:latest` (had bonding syntax issues)
- **New:** `ghcr.io/hellt/network-multitool` (networking tools pre-installed)

#### LACP Bonding Configuration
Proper LACP setup following network-multitool best practices:
```yaml
- ip link add bond0 type bond mode 802.3ad
- ip link set dev bond0 type bond xmit_hash_policy layer3+4
- ip link set dev eth1 down
- ip link set dev eth2 down
- ip link set eth1 master bond0
- ip link set eth2 master bond0
- ip link set dev eth1 up
- ip link set dev eth2 up
- ip link set dev bond0 type bond lacp_rate fast
- ip link set dev bond0 up
```

#### VLAN Configuration
- **L2 VXLAN hosts (host1, host3):** VLAN 40 tagged on bond0
- **L3 VXLAN hosts (host2, host4):** VLANs 34 and 78 tagged on bond0

#### Routing Strategy
- Kept management default route (172.16.0.254 via eth0)
- Added **specific routes** for L3 VXLAN networks instead of default routes:
  - host2: `ip route add 10.78.78.0/24 via 10.34.34.1`
  - host4: `ip route add 10.34.34.0/24 via 10.78.78.1`

### 📁 Files Changed

#### Switch Configurations (Updated)
- `configs/spine1.cfg` - Added ip routing, EVPN activation
- `configs/spine2.cfg` - Added ip routing, EVPN activation
- `configs/leaf1.cfg` - Port-Channel trunk mode, VLAN config
- `configs/leaf2.cfg` - Port-Channel trunk mode, VLAN config
- `configs/leaf3.cfg` - Added ip routing, loopback ads, Port-Channel config
- `configs/leaf4.cfg` - Added ip routing, loopback ads, Port-Channel config
- `configs/leaf5.cfg` - Port-Channel trunk mode, VLAN config
- `configs/leaf6.cfg` - Port-Channel trunk mode, VLAN config
- `configs/leaf7.cfg` - Added ip routing, loopback ads, Port-Channel config
- `configs/leaf8.cfg` - Added ip routing, loopback ads, Port-Channel config

#### Topology (Updated)
- `evpn-lab.clab.yml` - Updated all host configurations with network-multitool image and proper LACP/VLAN setup

#### Documentation (New)
- `hosts/README.md` - Host interface configuration guide
- `hosts/host1_interfaces` - Interface file for host1 (not currently used, kept for reference)
- `hosts/host2_interfaces` - Interface file for host2 (not currently used, kept for reference)
- `hosts/host3_interfaces` - Interface file for host3 (not currently used, kept for reference)
- `hosts/host4_interfaces` - Interface file for host4 (not currently used, kept for reference)

## Testing & Verification

###  L2 VXLAN (VLAN 40)
```
host1 (10.40.40.101) → host3 (10.40.40.103)
- Connectivity: VERIFIED ✓
- VXLAN tunnel: VTEP1 ↔ VTEP3
- MAC learning: Working via EVPN Type-2
```

###  L3 VXLAN (VRF gold)
```
host2 (10.34.34.102) → host4 (10.78.78.104)
- Connectivity: VERIFIED ✓
- Ping results: 0% packet loss, TTL=62
- Routing: Via EVPN Type-5 through fabric
```

###  Infrastructure Status
- BGP Underlay: All sessions ESTAB
- EVPN Overlay: All neighbors ESTAB
- MLAG: All 4 pairs operational
- Port-Channels: LACP negotiated on all hosts

## Related Issues

Fixes #1 - Lab deployment and configuration fixes
Fixes #2 - BGP EVPN neighbors stuck in Connect state
Fixes #3 - Ready for deployment with EVPN activation
Fixes #4 - Lab convergence in progress
Fixes #5 - BGP EVPN neighbors stuck in Active state
Fixes #11 - Host LACP bonding configuration
Fixes #13 - L3 VXLAN default route issue

## Key Technical Learnings

1. **Arista EOS requires explicit `ip routing`** before BGP can function
2. **MLAG peer-link must be trunk mode** to allow VLAN 4090/4091 traversal
3. **VLAN tagging location matters** - hosts tag, switches use trunk mode
4. **network-multitool image** superior to Alpine for LACP bonding
5. **Specific routes better than default routes** when management network present
6. **LACP rate fast** ensures quick negotiation with Arista switches

## Deployment

After merging, deploy with:
```bash
cd ~/arista-evpn-vxlan-clab
sudo containerlab destroy -t evpn-lab.clab.yml --cleanup
sudo containerlab deploy -t evpn-lab.clab.yml
```

No manual post-deployment configuration needed - everything works from initial deployment!

## Breaking Changes

⚠️ **Host image changed** from `alpine:latest` to `ghcr.io/hellt/network-multitool`
⚠️ **Host configuration completely redesigned** - old exec commands replaced

## Reviewers

@Damien - Please review and merge when ready

---

**This PR represents the complete troubleshooting journey and brings the lab to production-ready status with full L2 and L3 VXLAN functionality.** 🚀

Reviewed-on: #14
Co-authored-by: Damien <damien@arnodo.fr>
Co-committed-by: Damien <damien@arnodo.fr>
This commit was merged in pull request #14.
This commit is contained in:
2025-11-30 10:24:29 +00:00
committed by Damien Arnodo
parent 9502302b76
commit 1080bf07bb
23 changed files with 2632 additions and 74 deletions

114
BUGFIX_EVPN_ACTIVATION.md Normal file
View File

@@ -0,0 +1,114 @@
# BGP EVPN Activation Bug - Critical Fix
## Issue Description
All BGP EVPN neighbors on the leaves were stuck in **Active** state instead of **Established** state, with **0 messages sent/received**.
```
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
10.0.250.1 4 65000 0 0 0 0 00:02:05 Active
10.0.250.2 4 65000 0 0 0 0 00:02:05 Active
```
Active state with 0 messages means the TCP handshake was **never completed**.
## Root Cause
The **spine BGP configurations were missing the EVPN address family activation**.
In both `configs/spine1.cfg` and `configs/spine2.cfg`:
```
address-family evpn
neighbor evpn activate ← This line was MISSING!
```
Without activating the EVPN address family on the spines, they:
1. Accept the EVPN neighbor definitions
2. But don't actively listen for or respond to EVPN connections
3. Leaves try to establish sessions but spines don't respond
4. Connection attempt times out → Active state
This is **different from the IPv4 underlay** which was working because the IPv4 address family **was activated** on the spines.
## Solution Applied
### Before (Broken)
```
router bgp 65000
...
address-family evpn
! Missing activation line!
```
### After (Fixed)
```
router bgp 65000
...
address-family evpn
neighbor evpn activate
```
## Files Modified
- `configs/spine1.cfg` - Added `neighbor evpn activate` in EVPN address family
- `configs/spine2.cfg` - Added `neighbor evpn activate` in EVPN address family
## Technical Explanation
In Arista EOS BGP, neighbors defined in the global BGP context don't actively participate in any address family **until explicitly activated in that address family block**.
### Address Family Activation Rules
```
router bgp 65000
neighbor 10.0.250.1 peer group evpn
neighbor 10.0.250.1 remote-as 65000
address-family evpn
neighbor evpn activate ← REQUIRED for EVPN sessions to work
address-family ipv4
neighbor 10.0.250.1 activate ← Separate activation for IPv4
```
Without activating in the EVPN address family:
- The spines define the neighbor parameters ✓
- The spines enter BGP configuration ✓
- The spines do NOT listen on TCP 179 for EVPN sessions ✗
- Leaf attempts to TCP connect to spine loopback on port 179 for EVPN ✗
- Timeout occurs → Active state ✗
## Testing the Fix
After deploying with the fix, the EVPN neighbors should immediately transition to **Established**:
```bash
# Before fix
10.0.250.1 4 65000 0 0 0 0 00:02:05 Active
# After fix
10.0.250.1 4 65000 8 8 0 0 00:00:15 Estab
```
## Impact
This was a **critical bug** that:
- Prevented any EVPN overlay from functioning
- Made L2 VXLAN testing impossible
- Made L3 VXLAN testing impossible
- Prevented MAC learning via VXLAN
- Prevented EVPN route distribution
Once fixed, the entire EVPN overlay becomes operational immediately.
## Lesson Learned
In BGP multi-address-family configurations, **every address family must be explicitly activated**. This includes:
- IPv4 unicast
- IPv6 unicast
- EVPN
- Route target filtering
- Any other address families being used
A common mistake is to define a neighbor globally but forget to activate it in all address families where it should be used.