татко Крокодил

2025-08-17 Adventures in Linux bonding

by Vasil Kolev

All this comes from our work on getting a stable and redundant link for servers without resorting to things like routing protocols, and is the product of some years of “research” (hitting the next problem and solving it).

The Linux kernel has a (deceptively simple) mechanism for “bonding” two interfaces as one, to use for load-balancing and fail-over. It is really well documented in https://www.kernel.org/doc/Documentation/networking/bonding.txt and if some of my explanations are lacking, that’s the proper document to look into.

There are all kinds of modes, but there are two that make sense in Ethernet networks – active-backup and 802.1ad. 802.1ad requires support from the switch side, and for proper redundancy when you’re connecting to two switches to have them stacked or clustered. Stacked switches crash together, clustered also, but a bit less. So, for proper redundancy you’d like to have the two switches you connect to as independent as possible, which rules out 802.1ad.
(802.1ad is also known as LACP, link aggregation control protocol. When connecting to two different switches, it’s known as MLAG, multi-chassis link aggregation.)

For the sake of illustration, let’s say we have a network with two switches (SW1 and SW2) and two hosts, A and B. A is connected to SW1 and SW2, so is B.

By default the active-backup bond just monitors the state of the link (“miimon”). This should work in a better world, where switches are supposed to die completely when they can’t forward packets, but that’s not the case in the real world. Even without the “fun” failure modes where a switch stops forwarding without killing the links, there’s also the option that someone mis-configures a VLAN and blackholes the traffic you care about. As miimon can’t detect any such issues and fail-over to the other interfaces, it’s not very useful in Ethernet networks.

There is another option, something called “ARP monitor” – the bond sends ARP requests to a set of pre-configured targets and listens for the replies, if the replies stop, the interface is deemed dead and failed-over. This is a step in the right direction.

To have this properly configured, you need to have more than one target (so the death of that one target does not make the node think it has no connectivity), and to configure it for any node in the targets to be reachable to make the link active (the arp_all_targets option). All these are somewhat obvious and easy to set up, together with the preferred arp_interval to know how often to send packets.

And all this would’ve been fine if not for the weird default behavior of the Linux bonding to take any received ARP traffic as an indication that the link is OK. So if you get disconnected from the targets, but there are some other hosts in the partition created on that “side” of the network, you’ll never switch to the other interface, never mind that you can’t reach the targets via the current active interface.

The next option is called “arp_validate”. It tells the bond to actually look in the received ARP traffic if it’s from the targets (well, almost, but we’ll get to that). It can be told to look to “validate” no interface, the active one, or both. I have no idea why the validation is not “on” by default, and I haven’t dug in the history enough to check, but I’m guessing it’s something related to the weak host model and the issues I’ve had with ARP.

“arp_validate=active” takes care of checking what’s received on the currently active interface, so it should resolve most issues. However, it doesn’t help when the backup interface is not correctly connected and receives some ARP traffic, and it’s not possible to check /proc/net/bonding/bondX and know if it’s safe to switch the active interface for some reason (so you switch, and hopefully bounce back pretty quickly). To allow for monitoring the state of the backup interface, you should use “arp_validate=all”.

Which would be great if that worked. This part took me some reading of the sources to get right, see this part of bond_arp_rcv() in drivers/net/bonding/bond_main.c, especially the three calls to bond_validate_arp() on lines 3284, 3288 and 3290. Is there something weird there?

That’s correct, in the check on line 3288 the “tip” and “sip” arguments (source and destination) are swapped. A closer reading of the documentation and the source shows that if you validate the traffic to the backup interface, the ONLY traffic that will make this interface “up” is our own broadcast traffic send to the targets, not for example broadcasts received from the targets themselves (which I expected to be the case). I’m thinking about submitting a patch to the documentation in that regard.

So, in general this should work, but we had a lot of cases where the backup interface refused to hear the broadcasts sent via the active interface. After some time we were able to trace these cases to two specific NIC drivers – i40e and ice, both Intel NICs.

Small bit of theory before explaining what the issue was – the Linux bond sets the same MAC address on all Ethernet interfaces in the same bond.

And, there’s this small option called source pruning, which roughly translates to “drop all received packets that have my MAC address as source”. Which might make sense in some situations, but definitely not in a bond with “arp_validate=all”. So, for i40e this is just an ethtool option, but for ice NICs this looks to be hardcoded, and I’ve opened an issue for this to be configurable, so hopefully in some years it’ll get fixed.

postscript 1:

bpftrace is a wonderful tool that helped a lot to attach to the bond_arp_rcv() and other functions and see what’s going on. For any issue which strace can’t see (like “something returns EINVAL and you have no idea why), this looks to be the next step.

postscript 2:

I noted that two bonding mechanisms make sense in Ethernet network. There are some more modes, which are either “broadcast” (send all packets via all links) or “balance the traffic based on a function”. Both these types of modes will weird out the switches, because they’ll see traffic from the same MAC address coming from two different places and would constantly be updating their tables where these MACs are, with very fun consequences.

These modes made a lot more sense for serial links back in the days, adding two SLIP or PPP links to a bond with the default mode (balance with round-robin) was the cheapest way to get twice the bandwidth without any extra complexity.

The most resilient (but hardest to implement and more complex) solution for anything like this would be full-mesh BFD and BGP, which would make sure you only use paths that you can communicate on (doesn’t help with MTU blackholes, but pretty much nothing does). I’ve never seen it done, though :)

Update: OF COURSE it couldn’t be so easy. As soon as we deployed the disabling of source pruning at a place with LACP, we got:

spbond0: (slave lacp0): An illegal loopback occurred on slave
Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports

Turns out that if you have an i40e NIC, Virtual Functions (VFs) and a LACP bond, and disable source pruning, the card starts returning a copy of all broadcast, LLDP, LACP and fsck knows what other packets. This has at least one fun side effect of killing DHCP for VMs connected via the Linux bridge. The bridge has a switch inside with MAC learning, and as soon as the client sends a broadcast packet, the bridge learns (because of the copy sent back) that the VM’s MAC is on the external port, not the internal port, thus effectively blackholing all VM traffic. And this is very hard to see for already running VMs, as their traffic is almost entirely unicast, and any unicast packet “fixes” the problem.
So, for example it’s impossible for DHCP to work, or, even better, for the VM to initially get the ARP reply for the MAC address of the default gateway.

Update 2: Actually, the LACP bond has no implications whatsoever, you just need VFs and source pruning disabled for the bug with broadcasts-sent-back to trigger. So, this is not fixable until Intel decide to get this fixed…

Argh.

Tags: работа

This entry was posted on Sunday, August 17th, 2025 at 08:07 and is filed under General. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

2 Responses to “2025-08-17 Adventures in Linux bonding”

Иван Says:
August 19th, 2025 at 09:05
BFD + iBGP prob with route reflectors и няма нужда да се преоткрива топлата вода освен това е в пъти по-лесно за подръжла, maintainance & troubleshooting even in multi-datacenter setup (imho разбира се)
Vasil Kolev Says:
August 19th, 2025 at 09:58
Route reflector-а не ти решава partial connectivity проблеми и разни подобни гадости, за това full mesh съм писал (или просто да го говориш със switch-овете). Също така значи, че минаваш на layer 3, което за всички, дето предоставят L2 service-и (“вътрешна мрежа за клиента”) значи vxlan-и и всякакви по-сложни неща.

И вече кое е по-лесно е много свързано с какво хората са пипали, а за съжаление BGP се пипа само по ISP-та и подобни, service provider-ите като цяло рядко си го правят сами ….

2025-08-17 Adventures in Linux bonding

2 Responses to “2025-08-17 Adventures in Linux bonding”

Leave a Reply