2008-12-19 update protocol

by Vasil Kolev

At Venski’s birthday with Bachiyski we started an interesting conversation for a mass update of already deployed software (written in a scripting language, but that doesn’t matter that much). He wanted to lower his deployment time from his current 45 seconds, and I’m thinking about describing one of the ideas which I got, so Pentchev (who got in the discussion too) can find flaws.

The main principle is as follows – one server sends over multicast packets with the update (in diff form, for example), and the machines take it and apply it. It should work as follows:

1. The main server sends one hello packet to the group, to say that an update follows.
2. It waits for a timeout T1 and if someone doesn’t answer (it has a list of the machines) resends.
3. Step 2 is repeated three times, if there’s a machine that doesn’t answer, it’s removed from the list and a notification is sent to the admins.
4. The update is sent, split in packets, and every packet:
4.1. Includes ID of the update (consecutive), and each client has a list of received updates.
4.2. A sequence number of the packet (to track packet loss)
4.3. A signature of the packet (md5sum of the content+shared secret)
4.4. A diff in some format.
5. Every client that has received the update:
5.1. Checks if that one is already applied and if yes, doesn’t apply it, just acknowledges it.
5.2. If there are gaps in it, doesn’t send an acknowledgment.
5.3. If the update is not consecutive after the previous one applied, doesn’t acknowledge it and sends a notification to the admins.
5.4. If the whole update was received, checks if it can be applied. If not – sends a notification and doesn’t acknowledge.
5.5. If the whole update was received and is applicable – it’s applied, noted and an acknowledgement is sent.
6. If the central server for timeout T2 doesn’t receive a notification from everyone, repeats step 4 three times. For those, from which an acknowledgement is not received, the admins are notified.

The question is, does this protocol have any serious flaws and can it be implemented for a few days by a normal programmer (not a lazy bastard like me). Also, is there already such thing implemented :)

Leave a Reply