<html style="direction: ltr;">
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<style type="text/css">body p { margin-bottom: 0.2cm; margin-top: 0pt; } </style>
</head>
<body style="direction: ltr;"
bidimailui-detected-decoding-type="UTF-8" bgcolor="#FFFFFF"
text="#000000">
<div class="moz-cite-prefix">On 17/05/13 10:13, Ghiora Drori wrote:<br>
</div>
<blockquote
cite="mid:CAFR4KAni1mCg48z=cbitBderoMVTmUJfmGyrf0wkJTNJh_weww@mail.gmail.com"
type="cite">
<div dir="ltr"><br>
<div>As to reliability: (This is effectively a contract):<br>
</div>
</div>
</blockquote>
No, it isn't (see below).<br>
<blockquote
cite="mid:CAFR4KAni1mCg48z=cbitBderoMVTmUJfmGyrf0wkJTNJh_weww@mail.gmail.com"
type="cite">
<div dir="ltr">
<div><a moz-do-not-send="true"
href="https://aws.amazon.com/glacier/#highlights">https://aws.amazon.com/glacier/#highlights</a><br>
Quote: "Amazon Glacier is designed to provide average annual
durability of 99.999999999% " <br>
</div>
<div>If this is not good enough for you too bad.<br>
<br>
</div>
</div>
</blockquote>
When you see someone, anyone, saying such a thing, run. As fast and
as far as you can.<br>
<br>
This level of assurance is called "nine nines"(henceforth 9*9). It
amounts to one thousandth of a second of downtime a year. Amazon is
talking out of their asses in offering it.<br>
<br>
First, even if their service is 100% reliable, you will not get 9*9
of service. You home internet connection is not that reliable. The
fiber connecting Israel to the world is not that reliable. The BGP
protocol that is meant to keep the internet alive should a link go
down is not that reliable. No matter what Amazon are doing, nine
nines is not the SLA you will be getting.<br>
<br>
Now, you might claim that that is not Amazon's fault. THEY are
providing 9*9, and it is the rest of the internet that is not
reliable enough. This claim is bullshit. They are not.<br>
<br>
No single server can provide 9*9. Servers fail. Hard disks fail.
Memory fails. NICs fail. Network switches fail. In order to provide
a 9*9 SLA, you must be able to detect each and every one of those
failures + provide an alternative path <b>in less than 1 millisecond</b>,
plus assure that only one such failure happens in a year for every
customer. It is not impossible to build such a system, but it will
not be affordable. The very fact that Amazon is affordable means
that they are not providing 9*9, nor anything even close.<br>
<br>
Just to give you a taste of how expensive such a system might be,
take head of the following interesting fact. I just ran a ping
between two computers connected via a crossed ethernet cable over a
1Gb/s link. The average ping time was 0.431ms. In other words, just
the round-trip time (including kernel wakeup and related activities)
between two computers connected over a 3 meter cable is half the
time you have at your disposal to react to a downtime <b>per year</b>.
At this rate, you cannot afford to ping a second time in the hope
that the machine was just slightly busy, or that the packet was
lost. If you do not get a reply within half a millisecond, you must
act. You only have half a millisecond to set up the actual
diversion.<br>
<br>
What about further away computers? From my home, pinging a server
located at the server farm of the same ISP I'm connected to takes
17ms. This means I cannot react to a server downtime in less time
than half that no matter what. If the server is down, it will take
me no less than 8ms to even find out about it. That is, by the time
I find out about the server down, I am already violating my SLA by a
factor of 8. The only way to have redundancy is to be on the same
segment and use specialized low-latency equipment. Since the ISP's
link itself might go down, and since BGP is nowhere fast enough to
recover, <b>the only way to provide a 9*9 service is to build a
duplicate of the internet in order to do so</b>.<br>
<br>
I think we can all agree that Amazon did not do that, or their
service would have been, by several orders of magnitude, more
expensive than it is. However, supposing that money was no object,
would that work? The answer is "no".<br>
<br>
The reason the answer is no is that external factors were not taken
into account. A 9*9 SLA means that the chances of a problem are less
than 1:10^11. The chances of a Reichter 8+ earthquake, tsunami,
volcano eruption or meteorite striking are all higher than that.<br>
<br>
TLDR version:<br>
The SLA is not a contractual question. Especially when counting
nines, it is a technological infrastructure question. Amazon is not
providing the nine nines it seems to be promising, and is therefor
lying on its SLA.<br>
<blockquote
cite="mid:CAFR4KAni1mCg48z=cbitBderoMVTmUJfmGyrf0wkJTNJh_weww@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>( I do not work for Amazon)<br>
</div>
</div>
</blockquote>
I do not work for Amazon either. I did use to run a service that was
a (very humble) competitor to this one (in which we did not offer
SLA for service availability at all, only for the actual data). I
currently work for Akamai, for which Amazon is a competitor (though
not this particular service).<br>
<br>
It should be clear that I do not speak on behalf of my employer. All
opinions are my own, and only my own.<br>
<br>
Shachar
</body>
</html>