High Availability

Thu Apr 15 10:50:29 IDT 2010

Marc Volovic wrote:
> A number of issues:
> 
> First - what. You need to replicate (a) links, (b) storage, (c) service 
> machines.
> 
> Links are internal and external. Multipath internet connexions. 
> Multipath LAN connexions. Multipath storage links. Redund network 
> infrastructure (switches, routers, firewalls, IDS/IPS).
> 
> Replicate storage. If you use SAN with dedicated links, multipath links 
> and storage. Redund storage hardware and add storage replication. Add 
> auto-promotion, takeover, and (if possible) partition prevention 
> mechanisms. Use STONITH.
> 
> Service machines are the easiest to replicte. Simple heartbeat will 
> provide a significant level of failover and/or failback. Here, likewise, 
> use STONITH or other partition prevention mechanisms.
> 
> Under-utilize. 70% duty cycle is good.
> 
> Expect costing hikes.

and - test test test.....

many people fail to test their "highly-available" setup, and as a 
result, think they are 'highly available" when they are not.

testing should include various types of scenarios that will show you 
bugs in various tools as well as configuration errors.

examples: you set up multi-path to the storage, but the default I/O 
timeouts are too large -> this easily causes multi-path fail over taking 
several minutes in some scenarios.

you set up heartbeat and think eerything is ok - but then you find that 
it doesn't really notice failure in access to the storage system, and 
when there's a connectivity problem just to your SAN system from the 
active node - it doesn't fail over to the passive node.

only with rigorous testing you'll find these issues - and usually not on 
the first time you test (because this testing is tedious, and because 
some problems are not easy to simulate - e.g. try to simulate a 
hard-disk failure - plus, sometimes there are races - and a given test 
type will fail only once every few attempts...)

--guy