High Availability

Thu Apr 15 22:33:26 IDT 2010

The answer to the original post is "It's not HA, it's MA, however..."

Middle-Availability is not high-availability, of course, but it's a good
start.
HA is a very nice buzz word. HA should protect you against what? software
failure? OS failure? hardware failure? network failure? storage failure?
site failure? Internet failure?

Each of these add some small (or huge) amount of money to the total
solution. You can protect yourself from everything you would ever consider,
but I presume a fully continent-distributed HA solution with zero loss is
not cheap, and probably, more than you need.

To give an example you could relate to - if you're going to buy a car, you
will not buy yourself (probably) a Farari. Not because it's not a good car -
it's excellent car, probably way better than what you would have imagined
yourself driving in - it's excellent, but you need "good enough".

I tend to divide HA solutions into several groups, each has its own price
tag and level of protection. They should match the requirements, and more
than that - the customer should be aware of the protection he gets. Don't
expect it to do miracles, just know what's expected of it. HA is much like
backup, or like insurance. You want it there, but you don't want to use it,
and again - you would not buy the one protecting your teeth with 4M$.

The original question level of protection is against these:
Application failure
OS failure
Application misconfiguration
It does not protect against network failure (maybe against link failure if
it's over bonding), or storage failure (I presume it's local storage).
However, it's a cheap solution and easy to understand.
The pros of this specific solution is that the infrastructure is already
there, Investing the required amount of money, and he can split it into two
separated machines, with shared storage device, and gain some hardware
failure protection, without any logical change to the infrastructure or the
design of the current systems.
This is a solution he can grow with to everything he wants. This is a good
solution, based on the expected expenses, and the fact that most companies
cannot invest the capital required for enterprise-class HA solutions. Not
everyone can afford EMC DMX4 with SRDF to another EMC, the leased line
between them, the geo-cluster software and the distributed and fully
redundant network devices involved. Compromises are common, and as long as
the cluster functionality is tested, and the behavior is expected and
documented, this can be called MA, which is poor-people's-HA.

I say - if you know what you're doing - this is good enough.

Ez

P.S - Amos, sorry for sending this directly to you. The list behavior
dictates that "reply" would send to the person sent the mail, and not the
list.
Sorry.

Ez

On Thu, Apr 15, 2010 at 3:12 PM, Amos Shapira <amos.shapira at gmail.com>wrote:

> On 15 April 2010 17:50, guy keren <choo at actcom.co.il> wrote:
> >
> > and - test test test.....
> >
> > many people fail to test their "highly-available" setup, and as a result,
> > think they are 'highly available" when they are not.
>
> Great point!
>
> Earlier at my current position we failed to test that the fail-over
> works on one of our servers after an upgrade and got beaten the next
> time the primary went south.
>
> Since then it's our standard practice to:
> 1. update the current stand-by
> 2. switch over (effectively updates the service to the new version).
> 3. Wait a while and see that everything is fine (the current stand-by
> still ready with the old proven version).
> 4. Update the current secondary.
> 5. Switch back to make sure HA works right.
> (6. For good measure - switch again).
>
> We also do all of this through an elaborate set of scripts around xen,
> kickstart and puppet (keeping the xen guest images in LV's, came handy
> when we had to roll-back) so all this deployment procedure is
> completely automated and repeatable so we exercise not just the
> in-house built software and its configuration setup but also the
> deployment procedure itself.
>
> We reached a stage where a single operations engineer upgrades our
> entire production system of about 50 virtual servers across 18
> physical servers in three mornings of running automatic scripts (i.e.
> no need for manual configuration changes).
>
> (We stick to mornings because of a rule not to schedule production
> changes in the afternoons or before a weekend, unless absolutely
> necessary).
>
> --Amos
>
> _______________________________________________
> Linux-il mailing list
> Linux-il at cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.huji.ac.il/pipermail/linux-il/attachments/20100415/42f9cf46/attachment-0001.html>