Complete Lab Fixes - L2 and L3 VXLAN Fully Operational (#14)
## Summary This PR merges all fixes and improvements from the troubleshooting journey to make the Arista EVPN-VXLAN lab fully operational with both L2 and L3 VXLAN connectivity. ## What's Changed ### 🎯 Major Achievements - ✅ **L2 VXLAN fully operational** - host1 ↔ host3 connectivity verified - ✅ **L3 VXLAN fully operational** - host2 ↔ host4 connectivity verified (VRF gold) - ✅ **LACP bonding working** - dual-homed hosts with proper Port-Channel negotiation - ✅ **All BGP/EVPN sessions established** - complete underlay and overlay working ### 🔧 Infrastructure Fixes #### BGP & Routing - Added `ip routing` command to all spine and leaf switches - Fixed duplicate BGP network statements on leaf3, leaf4, leaf7, leaf8 - Activated EVPN neighbors on spine switches - Added loopback network advertisements to BGP #### MLAG Configuration - Configured MLAG peer-link in trunk mode (not access) for VLAN 4090/4091 - Added dual-active detection via management interface - Configured virtual router MAC for MLAG pairs #### Switch Port Configuration - Port-Channel1 configured in **trunk mode** on all leaf switches - Added `switchport trunk allowed vlan` for host VLANs (34, 40, 78) - Removed `no shutdown` from Port-Channel interfaces ### 🖥️ Host Networking - Complete Redesign #### Image Change - **Old:** `alpine:latest` (had bonding syntax issues) - **New:** `ghcr.io/hellt/network-multitool` (networking tools pre-installed) #### LACP Bonding Configuration Proper LACP setup following network-multitool best practices: ```yaml - ip link add bond0 type bond mode 802.3ad - ip link set dev bond0 type bond xmit_hash_policy layer3+4 - ip link set dev eth1 down - ip link set dev eth2 down - ip link set eth1 master bond0 - ip link set eth2 master bond0 - ip link set dev eth1 up - ip link set dev eth2 up - ip link set dev bond0 type bond lacp_rate fast - ip link set dev bond0 up ``` #### VLAN Configuration - **L2 VXLAN hosts (host1, host3):** VLAN 40 tagged on bond0 - **L3 VXLAN hosts (host2, host4):** VLANs 34 and 78 tagged on bond0 #### Routing Strategy - Kept management default route (172.16.0.254 via eth0) - Added **specific routes** for L3 VXLAN networks instead of default routes: - host2: `ip route add 10.78.78.0/24 via 10.34.34.1` - host4: `ip route add 10.34.34.0/24 via 10.78.78.1` ### 📁 Files Changed #### Switch Configurations (Updated) - `configs/spine1.cfg` - Added ip routing, EVPN activation - `configs/spine2.cfg` - Added ip routing, EVPN activation - `configs/leaf1.cfg` - Port-Channel trunk mode, VLAN config - `configs/leaf2.cfg` - Port-Channel trunk mode, VLAN config - `configs/leaf3.cfg` - Added ip routing, loopback ads, Port-Channel config - `configs/leaf4.cfg` - Added ip routing, loopback ads, Port-Channel config - `configs/leaf5.cfg` - Port-Channel trunk mode, VLAN config - `configs/leaf6.cfg` - Port-Channel trunk mode, VLAN config - `configs/leaf7.cfg` - Added ip routing, loopback ads, Port-Channel config - `configs/leaf8.cfg` - Added ip routing, loopback ads, Port-Channel config #### Topology (Updated) - `evpn-lab.clab.yml` - Updated all host configurations with network-multitool image and proper LACP/VLAN setup #### Documentation (New) - `hosts/README.md` - Host interface configuration guide - `hosts/host1_interfaces` - Interface file for host1 (not currently used, kept for reference) - `hosts/host2_interfaces` - Interface file for host2 (not currently used, kept for reference) - `hosts/host3_interfaces` - Interface file for host3 (not currently used, kept for reference) - `hosts/host4_interfaces` - Interface file for host4 (not currently used, kept for reference) ## Testing & Verification ### ✅ L2 VXLAN (VLAN 40) ``` host1 (10.40.40.101) → host3 (10.40.40.103) - Connectivity: VERIFIED ✓ - VXLAN tunnel: VTEP1 ↔ VTEP3 - MAC learning: Working via EVPN Type-2 ``` ### ✅ L3 VXLAN (VRF gold) ``` host2 (10.34.34.102) → host4 (10.78.78.104) - Connectivity: VERIFIED ✓ - Ping results: 0% packet loss, TTL=62 - Routing: Via EVPN Type-5 through fabric ``` ### ✅ Infrastructure Status - BGP Underlay: All sessions ESTAB - EVPN Overlay: All neighbors ESTAB - MLAG: All 4 pairs operational - Port-Channels: LACP negotiated on all hosts ## Related Issues Fixes #1 - Lab deployment and configuration fixes Fixes #2 - BGP EVPN neighbors stuck in Connect state Fixes #3 - Ready for deployment with EVPN activation Fixes #4 - Lab convergence in progress Fixes #5 - BGP EVPN neighbors stuck in Active state Fixes #11 - Host LACP bonding configuration Fixes #13 - L3 VXLAN default route issue ## Key Technical Learnings 1. **Arista EOS requires explicit `ip routing`** before BGP can function 2. **MLAG peer-link must be trunk mode** to allow VLAN 4090/4091 traversal 3. **VLAN tagging location matters** - hosts tag, switches use trunk mode 4. **network-multitool image** superior to Alpine for LACP bonding 5. **Specific routes better than default routes** when management network present 6. **LACP rate fast** ensures quick negotiation with Arista switches ## Deployment After merging, deploy with: ```bash cd ~/arista-evpn-vxlan-clab sudo containerlab destroy -t evpn-lab.clab.yml --cleanup sudo containerlab deploy -t evpn-lab.clab.yml ``` No manual post-deployment configuration needed - everything works from initial deployment! ## Breaking Changes ⚠️ **Host image changed** from `alpine:latest` to `ghcr.io/hellt/network-multitool` ⚠️ **Host configuration completely redesigned** - old exec commands replaced ## Reviewers @Damien - Please review and merge when ready --- **This PR represents the complete troubleshooting journey and brings the lab to production-ready status with full L2 and L3 VXLAN functionality.** 🚀 Reviewed-on: #14 Co-authored-by: Damien <damien@arnodo.fr> Co-committed-by: Damien <damien@arnodo.fr>
This commit was merged in pull request #14.
This commit is contained in:
114
BUGFIX_EVPN_ACTIVATION.md
Normal file
114
BUGFIX_EVPN_ACTIVATION.md
Normal file
@@ -0,0 +1,114 @@
|
||||
# BGP EVPN Activation Bug - Critical Fix
|
||||
|
||||
## Issue Description
|
||||
|
||||
All BGP EVPN neighbors on the leaves were stuck in **Active** state instead of **Established** state, with **0 messages sent/received**.
|
||||
|
||||
```
|
||||
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
|
||||
10.0.250.1 4 65000 0 0 0 0 00:02:05 Active
|
||||
10.0.250.2 4 65000 0 0 0 0 00:02:05 Active
|
||||
```
|
||||
|
||||
Active state with 0 messages means the TCP handshake was **never completed**.
|
||||
|
||||
## Root Cause
|
||||
|
||||
The **spine BGP configurations were missing the EVPN address family activation**.
|
||||
|
||||
In both `configs/spine1.cfg` and `configs/spine2.cfg`:
|
||||
|
||||
```
|
||||
address-family evpn
|
||||
neighbor evpn activate ← This line was MISSING!
|
||||
```
|
||||
|
||||
Without activating the EVPN address family on the spines, they:
|
||||
1. Accept the EVPN neighbor definitions
|
||||
2. But don't actively listen for or respond to EVPN connections
|
||||
3. Leaves try to establish sessions but spines don't respond
|
||||
4. Connection attempt times out → Active state
|
||||
|
||||
This is **different from the IPv4 underlay** which was working because the IPv4 address family **was activated** on the spines.
|
||||
|
||||
## Solution Applied
|
||||
|
||||
### Before (Broken)
|
||||
```
|
||||
router bgp 65000
|
||||
...
|
||||
address-family evpn
|
||||
! Missing activation line!
|
||||
```
|
||||
|
||||
### After (Fixed)
|
||||
```
|
||||
router bgp 65000
|
||||
...
|
||||
address-family evpn
|
||||
neighbor evpn activate
|
||||
```
|
||||
|
||||
## Files Modified
|
||||
|
||||
- `configs/spine1.cfg` - Added `neighbor evpn activate` in EVPN address family
|
||||
- `configs/spine2.cfg` - Added `neighbor evpn activate` in EVPN address family
|
||||
|
||||
## Technical Explanation
|
||||
|
||||
In Arista EOS BGP, neighbors defined in the global BGP context don't actively participate in any address family **until explicitly activated in that address family block**.
|
||||
|
||||
### Address Family Activation Rules
|
||||
|
||||
```
|
||||
router bgp 65000
|
||||
neighbor 10.0.250.1 peer group evpn
|
||||
neighbor 10.0.250.1 remote-as 65000
|
||||
|
||||
address-family evpn
|
||||
neighbor evpn activate ← REQUIRED for EVPN sessions to work
|
||||
|
||||
address-family ipv4
|
||||
neighbor 10.0.250.1 activate ← Separate activation for IPv4
|
||||
```
|
||||
|
||||
Without activating in the EVPN address family:
|
||||
- The spines define the neighbor parameters ✓
|
||||
- The spines enter BGP configuration ✓
|
||||
- The spines do NOT listen on TCP 179 for EVPN sessions ✗
|
||||
- Leaf attempts to TCP connect to spine loopback on port 179 for EVPN ✗
|
||||
- Timeout occurs → Active state ✗
|
||||
|
||||
## Testing the Fix
|
||||
|
||||
After deploying with the fix, the EVPN neighbors should immediately transition to **Established**:
|
||||
|
||||
```bash
|
||||
# Before fix
|
||||
10.0.250.1 4 65000 0 0 0 0 00:02:05 Active
|
||||
|
||||
# After fix
|
||||
10.0.250.1 4 65000 8 8 0 0 00:00:15 Estab
|
||||
```
|
||||
|
||||
## Impact
|
||||
|
||||
This was a **critical bug** that:
|
||||
- Prevented any EVPN overlay from functioning
|
||||
- Made L2 VXLAN testing impossible
|
||||
- Made L3 VXLAN testing impossible
|
||||
- Prevented MAC learning via VXLAN
|
||||
- Prevented EVPN route distribution
|
||||
|
||||
Once fixed, the entire EVPN overlay becomes operational immediately.
|
||||
|
||||
## Lesson Learned
|
||||
|
||||
In BGP multi-address-family configurations, **every address family must be explicitly activated**. This includes:
|
||||
- IPv4 unicast
|
||||
- IPv6 unicast
|
||||
- EVPN
|
||||
- Route target filtering
|
||||
- Any other address families being used
|
||||
|
||||
A common mistake is to define a neighbor globally but forget to activate it in all address families where it should be used.
|
||||
Reference in New Issue
Block a user