2025-08-17 Adventures in Linux bonding
by Vasil KolevAll this comes from our work on getting a stable and redundant link for servers without resorting to things like routing protocols, and is the product of some years of “research” (hitting the next problem and solving it).
The Linux kernel has a (deceptively simple) mechanism for “bonding” two interfaces as one, to use for load-balancing and fail-over. It is really well documented in https://www.kernel.org/doc/Documentation/networking/bonding.txt and if some of my explanations are lacking, that’s the proper document to look into.
There are all kinds of modes, but there are two that make sense in Ethernet networks – active-backup and 802.1ad. 802.1ad requires support from the switch side, and for proper redundancy when you’re connecting to two switches to have them stacked or clustered. Stacked switches crash together, clustered also, but a bit less. So, for proper redundancy you’d like to have the two switches you connect to as independent as possible, which rules out 802.1ad.
(802.1ad is also known as LACP, link aggregation control protocol. When connecting to two different switches, it’s known as MLAG, multi-chassis link aggregation.)
For the sake of illustration, let’s say we have a network with two switches (SW1 and SW2) and two hosts, A and B. A is connected to SW1 and SW2, so is B.
By default the active-backup bond just monitors the state of the link (“miimon”). This should work in a better world, where switches are supposed to die completely when they can’t forward packets, but that’s not the case in the real world. Even without the “fun” failure modes where a switch stops forwarding without killing the links, there’s also the option that someone mis-configures a VLAN and blackholes the traffic you care about. As miimon can’t detect any such issues and fail-over to the other interfaces, it’s not very useful in Ethernet networks.
There is another option, something called “ARP monitor” – the bond sends ARP requests to a set of pre-configured targets and listens for the replies, if the replies stop, the interface is deemed dead and failed-over. This is a step in the right direction.
To have this properly configured, you need to have more than one target (so the death of that one target does not make the node think it has no connectivity), and to configure it for any node in the targets to be reachable to make the link active (the arp_all_targets option). All these are somewhat obvious and easy to set up, together with the preferred arp_interval to know how often to send packets.
And all this would’ve been fine if not for the weird default behavior of the Linux bonding to take any received ARP traffic as an indication that the link is OK. So if you get disconnected from the targets, but there are some other hosts in the partition created on that “side” of the network, you’ll never switch to the other interface, never mind that you can’t reach the targets via the current active interface.
The next option is called “arp_validate”. It tells the bond to actually look in the received ARP traffic if it’s from the targets (well, almost, but we’ll get to that). It can be told to look to “validate” no interface, the active one, or both. I have no idea why the validation is not “on” by default, and I haven’t dug in the history enough to check, but I’m guessing it’s something related to the weak host model and the issues I’ve had with ARP.
“arp_validate=active” takes care of checking what’s received on the currently active interface, so it should resolve most issues. However, it doesn’t help when the backup interface is not correctly connected and receives some ARP traffic, and it’s not possible to check /proc/net/bonding/bondX and know if it’s safe to switch the active interface for some reason (so you switch, and hopefully bounce back pretty quickly). To allow for monitoring the state of the backup interface, you should use “arp_validate=all”.
Which would be great if that worked. This part took me some reading of the sources to get right, see this part of bond_arp_rcv() in drivers/net/bonding/bond_main.c, especially the three calls to bond_validate_arp() on lines 3284, 3288 and 3290. Is there something weird there?
That’s correct, in the check on line 3288 the “tip” and “sip” arguments (source and destination) are swapped. A closer reading of the documentation and the source shows that if you validate the traffic to the backup interface, the ONLY traffic that will make this interface “up” is our own broadcast traffic send to the targets, not for example broadcasts received from the targets themselves (which I expected to be the case). I’m thinking about submitting a patch to the documentation in that regard.
So, in general this should work, but we had a lot of cases where the backup interface refused to hear the broadcasts sent via the active interface. After some time we were able to trace these cases to two specific NIC drivers – i40e and ice, both Intel NICs.
Small bit of theory before explaining what the issue was – the Linux bond sets the same MAC address on all Ethernet interfaces in the same bond.
And, there’s this small option called source pruning, which roughly translates to “drop all received packets that have my MAC address as source”. Which might make sense in some situations, but definitely not in a bond with “arp_validate=all”. So, for i40e this is just an ethtool option, but for ice NICs this looks to be hardcoded, and I’ve opened an issue for this to be configurable, so hopefully in some years it’ll get fixed.
postscript 1:
bpftrace is a wonderful tool that helped a lot to attach to the bond_arp_rcv() and other functions and see what’s going on. For any issue which strace can’t see (like “something returns EINVAL and you have no idea why), this looks to be the next step.
postscript 2:
I noted that two bonding mechanisms make sense in Ethernet network. There are some more modes, which are either “broadcast” (send all packets via all links) or “balance the traffic based on a function”. Both these types of modes will weird out the switches, because they’ll see traffic from the same MAC address coming from two different places and would constantly be updating their tables where these MACs are, with very fun consequences.
These modes made a lot more sense for serial links back in the days, adding two SLIP or PPP links to a bond with the default mode (balance with round-robin) was the cheapest way to get twice the bandwidth without any extra complexity.
The most resilient (but hardest to implement and more complex) solution for anything like this would be full-mesh BFD and BGP, which would make sure you only use paths that you can communicate on (doesn’t help with MTU blackholes, but pretty much nothing does). I’ve never seen it done, though :)
Tags: работа