High Availability
guy keren
choo at actcom.co.il
Thu Apr 15 10:50:29 IDT 2010
Marc Volovic wrote:
> A number of issues:
>
> First - what. You need to replicate (a) links, (b) storage, (c) service
> machines.
>
> Links are internal and external. Multipath internet connexions.
> Multipath LAN connexions. Multipath storage links. Redund network
> infrastructure (switches, routers, firewalls, IDS/IPS).
>
> Replicate storage. If you use SAN with dedicated links, multipath links
> and storage. Redund storage hardware and add storage replication. Add
> auto-promotion, takeover, and (if possible) partition prevention
> mechanisms. Use STONITH.
>
> Service machines are the easiest to replicte. Simple heartbeat will
> provide a significant level of failover and/or failback. Here, likewise,
> use STONITH or other partition prevention mechanisms.
>
> Under-utilize. 70% duty cycle is good.
>
> Expect costing hikes.
and - test test test.....
many people fail to test their "highly-available" setup, and as a
result, think they are 'highly available" when they are not.
testing should include various types of scenarios that will show you
bugs in various tools as well as configuration errors.
examples: you set up multi-path to the storage, but the default I/O
timeouts are too large -> this easily causes multi-path fail over taking
several minutes in some scenarios.
you set up heartbeat and think eerything is ok - but then you find that
it doesn't really notice failure in access to the storage system, and
when there's a connectivity problem just to your SAN system from the
active node - it doesn't fail over to the passive node.
only with rigorous testing you'll find these issues - and usually not on
the first time you test (because this testing is tedious, and because
some problems are not easy to simulate - e.g. try to simulate a
hard-disk failure - plus, sometimes there are races - and a given test
type will fail only once every few attempts...)
--guy
More information about the Linux-il
mailing list