High Availability

Thu Apr 15 15:12:38 IDT 2010

On 15 April 2010 17:50, guy keren <choo at actcom.co.il> wrote:
>
> and - test test test.....
>
> many people fail to test their "highly-available" setup, and as a result,
> think they are 'highly available" when they are not.

Great point!

Earlier at my current position we failed to test that the fail-over
works on one of our servers after an upgrade and got beaten the next
time the primary went south.

Since then it's our standard practice to:
1. update the current stand-by
2. switch over (effectively updates the service to the new version).
3. Wait a while and see that everything is fine (the current stand-by
still ready with the old proven version).
4. Update the current secondary.
5. Switch back to make sure HA works right.
(6. For good measure - switch again).

We also do all of this through an elaborate set of scripts around xen,
kickstart and puppet (keeping the xen guest images in LV's, came handy
when we had to roll-back) so all this deployment procedure is
completely automated and repeatable so we exercise not just the
in-house built software and its configuration setup but also the
deployment procedure itself.

We reached a stage where a single operations engineer upgrades our
entire production system of about 50 virtual servers across 18
physical servers in three mornings of running automatic scripts (i.e.
no need for manual configuration changes).

(We stick to mornings because of a rule not to schedule production
changes in the afternoons or before a weekend, unless absolutely
necessary).

--Amos