<div dir="ltr">Interesting thread about ZFS and large disks bit-rot...<div><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Zenaan Harkness</b> <span dir="ltr"><<a href="mailto:zen@freedbms.net">zen@freedbms.net</a>></span><br>Date: 10 June 2015 at 11:52<br>Subject: [SLUG] Fwd: 8TiB HDD, 10^14 bit error rate, approaching certainty of error for each "drive of data" read<br>To: <a href="mailto:slug@slug.org.au">slug@slug.org.au</a><br><br><br>FYI<br>
<br>
---------- Forwarded message ----------<br>
From: Zenaan Harkness<br>
Date: Wed, 10 Jun 2015 11:50:48 +1000<br>
Subject: 8TiB HDD, 10^14 bit error rate, approaching certainty of<br>
error for each "drive of data" read<br>
To: <a href="mailto:d-community-offtopic@lists.alioth.debian.org">d-community-offtopic@lists.alioth.debian.org</a><br>
<br>
Seems ZFS' and BTRFS' time has come. ZFS on Linux (ZFSoL) seems more<br>
stable to me, and has 10 years of deployment under its belt too.<br>
<br>
Any news on Debian GNU/Linux distributing ZFSoL? We see ZFS on Debian<br>
GNU/kFreeBSD being distributed by Debian...<br>
<br>
FYI<br>
Zenaan<br>
<br>
<br>
---------- Forwarded message ----------<br>
From: Zenaan Harkness<br>
Date: Tue, 26 May 2015 20:31:41 +1000<br>
Subject: Re: Thank Ramen for ddrescue!!!<br>
<br>
On 5/25/15, Michael wrote:<br>
> The LVM volumes on the external drives are ok.<br>
<br>
Reminds me, also that I've been reading heaps about zfs over the last<br>
couple days, HDD error rates are close to biting us with current gen<br>
filesystems (like ext4). Armour plate your arse with some ZFS- or<br>
possibly the less battle tested BTRFS- armour.<br>
<br>
At one URE (UnRecoverable Errors) rate in 10^14 bits read from a drive<br>
(most consumer drives are 10^14 - one advertises 2^15, and enterprise<br>
drives are usually 2^16), we're talking 1 bit flip, on average, in<br>
10^14 bits read, whilst:<br>
<br>
8TiB drive =<br>
8 * 1024^4 * 8bits =<br>
70368744177664 bits<br>
<br>
So if we read each bit once, say in a mirror recovery/ disk rebuild<br>
situation, where that mirror disk has failed and a new one has been<br>
connected and refilled with the data of the sole surviving disk, there<br>
is an (8 * 1024^4 * 8) / 10^14, or ~70% chance that that "whole disk<br>
read" (of the "good" disk) will itself produce an unrecoverable<br>
bit-flip error, and so if you're using RAID hardware, you're now<br>
officially rooted - you can't rebuild your mirror (RAID1) disk array.<br>
<br>
Now think about a 4-disk (8TiB disks) RAID5 array (one parity disk),<br>
and it's as good as an absolute certainty that when (not if) one disk<br>
fails in that array, you will simply never recover/ rebuild the array,<br>
due to one of the remaining disks producing its own error - and at the<br>
point the first drive fails, the remaining drives are quite likely<br>
closer to failure anyway...<br>
<br>
Concerning stuff for data junkies like myself.<br>
<br>
Thus RAID6, RAID7, or better yet the ZFS solutions to this problem -<br>
RAIDZ2 and RAIDZ3 - where you have 2 or 3 parity disks respectively<br>
and funky ZFS magic built in (disk scrubbing, hot spare disks and<br>
more, all on commodity consumer disks and dumb controllers), where<br>
-any- 2 (or 3) disks in your "raid" set can fail, and the set can<br>
still rebuild itself - or if it's just sectors failing (random bit<br>
flips), ZFS will automatically detect and repair those sectors with<br>
bit flips, and warn you in the logs that this is happening - and it<br>
will otherwise keep using a drive that's on the way out until you<br>
replace it.<br>
<br>
See here to wake us all up:<br>
<a href="http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/" target="_blank">http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/</a><br>
<br>
<a href="http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/1/" target="_blank">http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/1/</a><br>
<br>
(That second article slags ZFS with (what seems to me as) a claim that<br>
ZFS COW (copy on write) functionality is per-file, not per-block,<br>
which AIUI is total bollocks - ZFS most certainly is a per-block COW<br>
filesystem, not per-file, but that's just a reflection of the bold<br>
assumptions and lack of fact checking of that article's author -<br>
otherwise I think the article is useful!)<br>
<br>
Z<br>
<br>
---------- Forwarded message ----------<br>
From: Zenaan Harkness<br>
Date: Tue, 26 May 2015 22:34:50 +1000<br>
Subject: Re: Thank Ramen for ddrescue!!!<br>
<br>
> On 26 May 2015 12:31, "Zenaan Harkness" wrote:<br>
>> Reminds me, also that I've been reading heaps about zfs over the last<br>
>> couple days, HDD error rates are close to biting us with current gen<br>
>> filesystems (like ext4). Armour plate your arse with some ZFS- or<br>
>> possibly the less battle tested BTRFS- armour.<br>
>><br>
>> At one URE (UnRecoverable Errors) rate in 10^14 bits read from a drive<br>
>> (most consumer drives are 10^14 - one advertises 2^15, and enterprise<br>
>> drives are usually 2^16), we're talking 1 bit flip, on average, in<br>
>> 10^14 bits read, whilst:<br>
>><br>
><br>
> Base 10 or base 2? It's an order of magnitude of difference here, or one<br>
> thousand more errors, so kinda a big deal...<br>
<br>
Base 10. And the difference is much more than an order of magnitude:<br>
2^14 = 16384<br>
10^14 = 100000000000000<br>
<br>
Unless I'm not understanding what you're asking...<br>
<br>
For current HDDs:<br>
10^15 URE rate means an order of magnitude less likely to have a problem.<br>
10^16, one O better again.<br>
<br>
The problem is, 10^14, with a 10T drive, is now at certainty - you are<br>
all but guaranteed an random unrecoverable read error on that drive,<br>
every time you read it - or rather, everytime you read a drives worth<br>
of data off of that drive, which could be "quite a bit worse in<br>
practice" depending on your usage environment for the drive.<br>
<br>
I believe the URE rate's been roughly the same since forever - the<br>
only "problem" is that we've gone from 10MB drives, to (very soon)<br>
10TB drives - i.e. 6 orders of magnitude increase in storage capacity,<br>
with no corresponding improvement in the read error rate, or in that<br>
ballpark anyway.<br>
<br>
Z<br>
<br>
---------- Forwarded message ----------<br>
From: Zenaan Harkness<br>
Date: Wed, 27 May 2015 00:34:44 +1000<br>
Subject: Re: Thank Ramen for ddrescue!!!<br>
<br>
> On 05/26/2015 08:45 AM, Zenaan Harkness wrote:<br>
>> ZFS is f*ing awesome! Even for a single drive that's large enough to<br>
>> guarantee errors, ZFS makes the pain go away. I think BTRFS is<br>
>> designed to have similar functionality - but it's got a ways to go yet<br>
>> on various fronts, even though ultimately it may end up a "better"<br>
>> filesystem than ZFS (but who knows).<br>
>><br>
>> Z<br>
>> I guess that's Z for ZFS then ehj? :)<br>
><br>
> What about XFS?? It's being recommended on the Proxmox list as requiring<br>
> less memory. I know next to nothing about this. Ric<br>
<br>
Yesterday I read that that's a long standing falsity about ZFS - the<br>
only situation in ZFS where RAM becomes significant (for performance)<br>
is in Data deduplication - which is different again from COW and its<br>
benefits. See here:<br>
<a href="http://en.wikipedia.org/wiki/ZFS#Deduplication" target="_blank">http://en.wikipedia.org/wiki/ZFS#Deduplication</a><br>
<br>
These days an SSD for storing the deduplication tables is an easy way<br>
to handle this situation if memory (and performance) is precious in<br>
your deployment [[and you want to enable deduplication]].<br>
<br>
Either way, it appears just about everything including memory use is<br>
configurable - so it would make sense to get at least a little<br>
familiar with it if you made your root filesystem ZFS.<br>
<br>
I can't speak to XFS - it may be better for a single user workstation<br>
root drive, I don't know sorry. I do know that for large disks (by<br>
today's standards), ZFS nails the "certainty of bitrot" problem -<br>
which, if one's data or photos or whatever is precious, is probably<br>
significant no matter how small the storage is, just that with a small<br>
dataset, it's easy to duplicate manually, but even then, automatically<br>
provided (e.g. ZFS time-period scrubbing) is less error prone than<br>
manual backups of course [[when combined with some form of ZFS<br>
ZRAID]].<br>
<br>
These pages seemed quite useful yesterday:<br>
<a href="http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/" target="_blank">http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/</a><br>
<a href="https://calomel.org/zfs_raid_speed_capacity.html" target="_blank">https://calomel.org/zfs_raid_speed_capacity.html</a><br>
<a href="http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide" target="_blank">http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide</a><br>
<br>
Z<br>
<br>
---------- Forwarded message ----------<br>
From: Zenaan Harkness<br>
Date: Wed, 27 May 2015 00:46:29 +1000<br>
Subject: Re: Thank Ramen for ddrescue!!!<br>
<br>
> On 26 May 2015 14:34, "Zenaan Harkness" <<a href="mailto:zen@freedbms.net">zen@freedbms.net</a>> wrote:<br>
>> > On 26 May 2015 12:31, "Zenaan Harkness" <<a href="mailto:zen@freedbms.net">zen@freedbms.net</a>> wrote:<br>
>> >> Reminds me, also that I've been reading heaps about zfs over the last<br>
>> >> couple days, HDD error rates are close to biting us with current gen<br>
>> >> filesystems (like ext4). Armour plate your arse with some ZFS- or<br>
>> >> possibly the less battle tested BTRFS- armour.<br>
>> >><br>
>> >> At one URE (UnRecoverable Errors) rate in 10^14 bits read from a drive<br>
>> >> (most consumer drives are 10^14 - one advertises 2^15, and enterprise<br>
>> >> drives are usually 2^16), we're talking 1 bit flip, on average, in<br>
>> >> 10^14 bits read, whilst:<br>
>> ><br>
>> > Base 10 or base 2? It's an order of magnitude of difference here, or<br>
>> > one<br>
>> > thousand more errors, so kinda a big deal...<br>
>><br>
>> Base 10. And the difference is much more than an order of magnitude:<br>
>> 2^14 = 16384<br>
>> 10^14 = 100000000000000<br>
>><br>
>> Unless I'm not understanding what you're asking...<br>
><br>
> You've used both bases in your post, and it's not clear whether you meant<br>
> that or it was a typo.<br>
<br>
Indeed. The numbers are staggering. And the fact we can now buy<br>
consumer 8TB drives, which essentially guarantee the buyer a bit flip<br>
on reading (and or bit rot as stored) every drive's worth of data is<br>
really mind blowing - also that such error-guarantees are not yet<br>
widely discussed or realized - I guess the "average home user" just<br>
dumps photos, music and movies on their drives, and relatively rarely<br>
reads them back off, and so the awareness is just not there.<br>
<br>
And up until yesterday I've been an average home user from a drive URE<br>
rate perspective - been all but oblivious. It's sorta been like "oh<br>
yeah, I know they include error rates if you look at the specs, but<br>
this is like, you know, an engineered product, and products have you<br>
know, at least one year warranties, and it's all engineering<br>
tolerances and stuff and those engineers know what they're doing, so I<br>
don't have to worry. Right? Well, turns out we need to worry, and in<br>
fact these bit flips are now all but a certainty.<br>
<br>
There's the odd web page about where a fastidious individual has kept<br>
a record over the years of corrupt files. Those error rates are actual<br>
- neither optimistic nor pessimistic it seems. Of course they're<br>
average and they're rates, but from everything I've read in the last<br>
two days, they're relatively accurate engineering guarantees. It used<br>
to be a guarantee that you would get no bit flips, on average, except<br>
if you'd read/written simply enormous amounts. Now that engineering<br>
amount is equal to about one (large) drive of data!<br>
<br>
I just keep shaking my head, having never realized the significance of<br>
all this prior to, oh idk, roughly say, yesterday. Might have been<br>
about 11pm. Although it's now tomorrow, so if my engineering<br>
calculations are right, that may have actually been the day before. I<br>
think I need sleep.<br>
<br>
:)<br>
<span class="HOEnZb"><font color="#888888">Z<br>
--<br>
SLUG - Sydney Linux User's Group Mailing List - <a href="http://slug.org.au/" target="_blank">http://slug.org.au/</a><br>
Subscription info and FAQs: <a href="http://slug.org.au/faq/mailinglists.html" target="_blank">http://slug.org.au/faq/mailinglists.html</a><br>
</font></span></div><div class="gmail_signature"><div dir="ltr"><br></div></div>
</div></div>