## Summary This PR merges all fixes and improvements from the troubleshooting journey to make the Arista EVPN-VXLAN lab fully operational with both L2 and L3 VXLAN connectivity. ## What's Changed ### 🎯 Major Achievements - ✅ **L2 VXLAN fully operational** - host1 ↔ host3 connectivity verified - ✅ **L3 VXLAN fully operational** - host2 ↔ host4 connectivity verified (VRF gold) - ✅ **LACP bonding working** - dual-homed hosts with proper Port-Channel negotiation - ✅ **All BGP/EVPN sessions established** - complete underlay and overlay working ### 🔧 Infrastructure Fixes #### BGP & Routing - Added `ip routing` command to all spine and leaf switches - Fixed duplicate BGP network statements on leaf3, leaf4, leaf7, leaf8 - Activated EVPN neighbors on spine switches - Added loopback network advertisements to BGP #### MLAG Configuration - Configured MLAG peer-link in trunk mode (not access) for VLAN 4090/4091 - Added dual-active detection via management interface - Configured virtual router MAC for MLAG pairs #### Switch Port Configuration - Port-Channel1 configured in **trunk mode** on all leaf switches - Added `switchport trunk allowed vlan` for host VLANs (34, 40, 78) - Removed `no shutdown` from Port-Channel interfaces ### 🖥️ Host Networking - Complete Redesign #### Image Change - **Old:** `alpine:latest` (had bonding syntax issues) - **New:** `ghcr.io/hellt/network-multitool` (networking tools pre-installed) #### LACP Bonding Configuration Proper LACP setup following network-multitool best practices: ```yaml - ip link add bond0 type bond mode 802.3ad - ip link set dev bond0 type bond xmit_hash_policy layer3+4 - ip link set dev eth1 down - ip link set dev eth2 down - ip link set eth1 master bond0 - ip link set eth2 master bond0 - ip link set dev eth1 up - ip link set dev eth2 up - ip link set dev bond0 type bond lacp_rate fast - ip link set dev bond0 up ``` #### VLAN Configuration - **L2 VXLAN hosts (host1, host3):** VLAN 40 tagged on bond0 - **L3 VXLAN hosts (host2, host4):** VLANs 34 and 78 tagged on bond0 #### Routing Strategy - Kept management default route (172.16.0.254 via eth0) - Added **specific routes** for L3 VXLAN networks instead of default routes: - host2: `ip route add 10.78.78.0/24 via 10.34.34.1` - host4: `ip route add 10.34.34.0/24 via 10.78.78.1` ### 📁 Files Changed #### Switch Configurations (Updated) - `configs/spine1.cfg` - Added ip routing, EVPN activation - `configs/spine2.cfg` - Added ip routing, EVPN activation - `configs/leaf1.cfg` - Port-Channel trunk mode, VLAN config - `configs/leaf2.cfg` - Port-Channel trunk mode, VLAN config - `configs/leaf3.cfg` - Added ip routing, loopback ads, Port-Channel config - `configs/leaf4.cfg` - Added ip routing, loopback ads, Port-Channel config - `configs/leaf5.cfg` - Port-Channel trunk mode, VLAN config - `configs/leaf6.cfg` - Port-Channel trunk mode, VLAN config - `configs/leaf7.cfg` - Added ip routing, loopback ads, Port-Channel config - `configs/leaf8.cfg` - Added ip routing, loopback ads, Port-Channel config #### Topology (Updated) - `evpn-lab.clab.yml` - Updated all host configurations with network-multitool image and proper LACP/VLAN setup #### Documentation (New) - `hosts/README.md` - Host interface configuration guide - `hosts/host1_interfaces` - Interface file for host1 (not currently used, kept for reference) - `hosts/host2_interfaces` - Interface file for host2 (not currently used, kept for reference) - `hosts/host3_interfaces` - Interface file for host3 (not currently used, kept for reference) - `hosts/host4_interfaces` - Interface file for host4 (not currently used, kept for reference) ## Testing & Verification ### ✅ L2 VXLAN (VLAN 40) ``` host1 (10.40.40.101) → host3 (10.40.40.103) - Connectivity: VERIFIED ✓ - VXLAN tunnel: VTEP1 ↔ VTEP3 - MAC learning: Working via EVPN Type-2 ``` ### ✅ L3 VXLAN (VRF gold) ``` host2 (10.34.34.102) → host4 (10.78.78.104) - Connectivity: VERIFIED ✓ - Ping results: 0% packet loss, TTL=62 - Routing: Via EVPN Type-5 through fabric ``` ### ✅ Infrastructure Status - BGP Underlay: All sessions ESTAB - EVPN Overlay: All neighbors ESTAB - MLAG: All 4 pairs operational - Port-Channels: LACP negotiated on all hosts ## Related Issues Fixes #1 - Lab deployment and configuration fixes Fixes #2 - BGP EVPN neighbors stuck in Connect state Fixes #3 - Ready for deployment with EVPN activation Fixes #4 - Lab convergence in progress Fixes #5 - BGP EVPN neighbors stuck in Active state Fixes #11 - Host LACP bonding configuration Fixes #13 - L3 VXLAN default route issue ## Key Technical Learnings 1. **Arista EOS requires explicit `ip routing`** before BGP can function 2. **MLAG peer-link must be trunk mode** to allow VLAN 4090/4091 traversal 3. **VLAN tagging location matters** - hosts tag, switches use trunk mode 4. **network-multitool image** superior to Alpine for LACP bonding 5. **Specific routes better than default routes** when management network present 6. **LACP rate fast** ensures quick negotiation with Arista switches ## Deployment After merging, deploy with: ```bash cd ~/arista-evpn-vxlan-clab sudo containerlab destroy -t evpn-lab.clab.yml --cleanup sudo containerlab deploy -t evpn-lab.clab.yml ``` No manual post-deployment configuration needed - everything works from initial deployment! ## Breaking Changes ⚠️ **Host image changed** from `alpine:latest` to `ghcr.io/hellt/network-multitool` ⚠️ **Host configuration completely redesigned** - old exec commands replaced ## Reviewers @Damien - Please review and merge when ready --- **This PR represents the complete troubleshooting journey and brings the lab to production-ready status with full L2 and L3 VXLAN functionality.** 🚀 Reviewed-on: #14 Co-authored-by: Damien <damien@arnodo.fr> Co-committed-by: Damien <damien@arnodo.fr>
115 lines
3.2 KiB
Markdown
115 lines
3.2 KiB
Markdown
# BGP EVPN Activation Bug - Critical Fix
|
|
|
|
## Issue Description
|
|
|
|
All BGP EVPN neighbors on the leaves were stuck in **Active** state instead of **Established** state, with **0 messages sent/received**.
|
|
|
|
```
|
|
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
|
|
10.0.250.1 4 65000 0 0 0 0 00:02:05 Active
|
|
10.0.250.2 4 65000 0 0 0 0 00:02:05 Active
|
|
```
|
|
|
|
Active state with 0 messages means the TCP handshake was **never completed**.
|
|
|
|
## Root Cause
|
|
|
|
The **spine BGP configurations were missing the EVPN address family activation**.
|
|
|
|
In both `configs/spine1.cfg` and `configs/spine2.cfg`:
|
|
|
|
```
|
|
address-family evpn
|
|
neighbor evpn activate ← This line was MISSING!
|
|
```
|
|
|
|
Without activating the EVPN address family on the spines, they:
|
|
1. Accept the EVPN neighbor definitions
|
|
2. But don't actively listen for or respond to EVPN connections
|
|
3. Leaves try to establish sessions but spines don't respond
|
|
4. Connection attempt times out → Active state
|
|
|
|
This is **different from the IPv4 underlay** which was working because the IPv4 address family **was activated** on the spines.
|
|
|
|
## Solution Applied
|
|
|
|
### Before (Broken)
|
|
```
|
|
router bgp 65000
|
|
...
|
|
address-family evpn
|
|
! Missing activation line!
|
|
```
|
|
|
|
### After (Fixed)
|
|
```
|
|
router bgp 65000
|
|
...
|
|
address-family evpn
|
|
neighbor evpn activate
|
|
```
|
|
|
|
## Files Modified
|
|
|
|
- `configs/spine1.cfg` - Added `neighbor evpn activate` in EVPN address family
|
|
- `configs/spine2.cfg` - Added `neighbor evpn activate` in EVPN address family
|
|
|
|
## Technical Explanation
|
|
|
|
In Arista EOS BGP, neighbors defined in the global BGP context don't actively participate in any address family **until explicitly activated in that address family block**.
|
|
|
|
### Address Family Activation Rules
|
|
|
|
```
|
|
router bgp 65000
|
|
neighbor 10.0.250.1 peer group evpn
|
|
neighbor 10.0.250.1 remote-as 65000
|
|
|
|
address-family evpn
|
|
neighbor evpn activate ← REQUIRED for EVPN sessions to work
|
|
|
|
address-family ipv4
|
|
neighbor 10.0.250.1 activate ← Separate activation for IPv4
|
|
```
|
|
|
|
Without activating in the EVPN address family:
|
|
- The spines define the neighbor parameters ✓
|
|
- The spines enter BGP configuration ✓
|
|
- The spines do NOT listen on TCP 179 for EVPN sessions ✗
|
|
- Leaf attempts to TCP connect to spine loopback on port 179 for EVPN ✗
|
|
- Timeout occurs → Active state ✗
|
|
|
|
## Testing the Fix
|
|
|
|
After deploying with the fix, the EVPN neighbors should immediately transition to **Established**:
|
|
|
|
```bash
|
|
# Before fix
|
|
10.0.250.1 4 65000 0 0 0 0 00:02:05 Active
|
|
|
|
# After fix
|
|
10.0.250.1 4 65000 8 8 0 0 00:00:15 Estab
|
|
```
|
|
|
|
## Impact
|
|
|
|
This was a **critical bug** that:
|
|
- Prevented any EVPN overlay from functioning
|
|
- Made L2 VXLAN testing impossible
|
|
- Made L3 VXLAN testing impossible
|
|
- Prevented MAC learning via VXLAN
|
|
- Prevented EVPN route distribution
|
|
|
|
Once fixed, the entire EVPN overlay becomes operational immediately.
|
|
|
|
## Lesson Learned
|
|
|
|
In BGP multi-address-family configurations, **every address family must be explicitly activated**. This includes:
|
|
- IPv4 unicast
|
|
- IPv6 unicast
|
|
- EVPN
|
|
- Route target filtering
|
|
- Any other address families being used
|
|
|
|
A common mistake is to define a neighbor globally but forget to activate it in all address families where it should be used.
|