<div dir="ltr">Interesting thread about ZFS and large disks bit-rot...<div><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Zenaan Harkness</b> <span dir="ltr"><<a href="mailto:zen@freedbms.net">zen@freedbms.net</a>></span><br>Date: 10 June 2015 at 11:52<br>Subject: [SLUG] Fwd: 8TiB HDD, 10^14 bit error rate, approaching certainty of error for each "drive of data" read<br>To: <a href="mailto:slug@slug.org.au">slug@slug.org.au</a><br><br><br>FYI<br>

<br>

---------- Forwarded message ----------<br>

From: Zenaan Harkness<br>

Date: Wed, 10 Jun 2015 11:50:48 +1000<br>

Subject: 8TiB HDD, 10^14 bit error rate, approaching certainty of<br>

error for each "drive of data" read<br>

To: <a href="mailto:d-community-offtopic@lists.alioth.debian.org">d-community-offtopic@lists.alioth.debian.org</a><br>

<br>

Seems ZFS' and BTRFS' time has come. ZFS on Linux (ZFSoL) seems more<br>

stable to me, and has 10 years of deployment under its belt too.<br>

<br>

Any news on Debian GNU/Linux distributing ZFSoL? We see ZFS on Debian<br>

GNU/kFreeBSD being distributed by Debian...<br>

<br>

FYI<br>

Zenaan<br>

<br>

<br>

---------- Forwarded message ----------<br>

From: Zenaan Harkness<br>

Date: Tue, 26 May 2015 20:31:41 +1000<br>

Subject: Re: Thank Ramen for ddrescue!!!<br>

<br>

On 5/25/15, Michael wrote:<br>

> The LVM volumes on the external drives are ok.<br>

<br>

Reminds me, also that I've been reading heaps about zfs over the last<br>

couple days, HDD error rates are close to biting us with current gen<br>

filesystems (like ext4). Armour plate your arse with some ZFS- or<br>

possibly the less battle tested BTRFS- armour.<br>

<br>

At one URE (UnRecoverable Errors) rate in 10^14 bits read from a drive<br>

(most consumer drives are 10^14 - one advertises 2^15, and enterprise<br>

drives are usually 2^16), we're talking 1 bit flip, on average, in<br>

10^14 bits read, whilst:<br>

<br>

8TiB drive =<br>

8 * 1024^4 * 8bits =<br>

70368744177664 bits<br>

<br>

So if we read each bit once, say in a mirror recovery/ disk rebuild<br>

situation, where that mirror disk has failed and a new one has been<br>

connected and refilled with the data of the sole surviving disk, there<br>

is an (8 * 1024^4 * 8) / 10^14, or ~70% chance that that "whole disk<br>

read" (of the "good" disk) will itself produce an unrecoverable<br>

bit-flip error, and so if you're using RAID hardware, you're now<br>

officially rooted - you can't rebuild your mirror (RAID1) disk array.<br>

<br>

Now think about a 4-disk (8TiB disks) RAID5 array (one parity disk),<br>

and it's as good as an absolute certainty that when (not if) one disk<br>

fails in that array, you will simply never recover/ rebuild the array,<br>

due to one of the remaining disks producing its own error - and at the<br>

point the first drive fails, the remaining drives are quite likely<br>

closer to failure anyway...<br>

<br>

Concerning stuff for data junkies like myself.<br>

<br>

Thus RAID6, RAID7, or better yet the ZFS solutions to this problem -<br>

RAIDZ2 and RAIDZ3 - where you have 2 or 3 parity disks respectively<br>

and funky ZFS magic built in (disk scrubbing, hot spare disks and<br>

more, all on commodity consumer disks and dumb controllers), where<br>

-any- 2 (or 3) disks in your "raid" set can fail, and the set can<br>

still rebuild itself - or if it's just sectors failing (random bit<br>

flips), ZFS will automatically detect and repair those sectors with<br>

bit flips, and warn you in the logs that this is happening - and it<br>

will otherwise keep using a drive that's on the way out until you<br>

replace it.<br>

<br>

See here to wake us all up:<br>

<a href="http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/" target="_blank">http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/</a><br>

<br>

<a href="http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/1/" target="_blank">http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/1/</a><br>

<br>

(That second article slags ZFS with (what seems to me as) a claim that<br>

ZFS COW (copy on write) functionality is per-file, not per-block,<br>

which AIUI is total bollocks - ZFS most certainly is a per-block COW<br>

filesystem, not per-file, but that's just a reflection of the bold<br>

assumptions and lack of fact checking of that article's author -<br>

otherwise I think the article is useful!)<br>

<br>

Z<br>

<br>

---------- Forwarded message ----------<br>

From: Zenaan Harkness<br>

Date: Tue, 26 May 2015 22:34:50 +1000<br>

Subject: Re: Thank Ramen for ddrescue!!!<br>

<br>

> On 26 May 2015 12:31, "Zenaan Harkness" wrote:<br>

>> Reminds me, also that I've been reading heaps about zfs over the last<br>

>> couple days, HDD error rates are close to biting us with current gen<br>

>> filesystems (like ext4). Armour plate your arse with some ZFS- or<br>

>> possibly the less battle tested BTRFS- armour.<br>

>><br>

>> At one URE (UnRecoverable Errors) rate in 10^14 bits read from a drive<br>

>> (most consumer drives are 10^14 - one advertises 2^15, and enterprise<br>

>> drives are usually 2^16), we're talking 1 bit flip, on average, in<br>

>> 10^14 bits read, whilst:<br>

>><br>

><br>

> Base 10 or base 2? It's an order of magnitude of difference here, or one<br>

> thousand more errors, so kinda a big deal...<br>

<br>

Base 10. And the difference is much more than an order of magnitude:<br>

2^14 = 16384<br>

10^14 = 100000000000000<br>

<br>

Unless I'm not understanding what you're asking...<br>

<br>

For current HDDs:<br>

10^15 URE rate means an order of magnitude less likely to have a problem.<br>

10^16, one O better again.<br>

<br>

The problem is, 10^14, with a 10T drive, is now at certainty - you are<br>

all but guaranteed an random unrecoverable read error on that drive,<br>

every time you read it - or rather, everytime you read a drives worth<br>

of data off of that drive, which could be "quite a bit worse in<br>

practice" depending on your usage environment for the drive.<br>

<br>

I believe the URE rate's been roughly the same since forever - the<br>

only "problem" is that we've gone from 10MB drives, to (very soon)<br>

10TB drives - i.e. 6 orders of magnitude increase in storage capacity,<br>

with no corresponding improvement in the read error rate, or in that<br>

ballpark anyway.<br>

<br>

Z<br>

<br>

---------- Forwarded message ----------<br>

From: Zenaan Harkness<br>

Date: Wed, 27 May 2015 00:34:44 +1000<br>

Subject: Re: Thank Ramen for ddrescue!!!<br>

<br>

> On 05/26/2015 08:45 AM, Zenaan Harkness wrote:<br>

>> ZFS is f*ing awesome! Even for a single drive that's large enough to<br>

>> guarantee errors, ZFS makes the pain go away. I think BTRFS is<br>

>> designed to have similar functionality - but it's got a ways to go yet<br>

>> on various fronts, even though ultimately it may end up a "better"<br>

>> filesystem than ZFS (but who knows).<br>

>><br>

>> Z<br>

>> I guess that's Z for ZFS then ehj? :)<br>

><br>

> What about XFS?? It's being recommended on the Proxmox list as requiring<br>

> less memory. I know next to nothing about this. Ric<br>

<br>

Yesterday I read that that's a long standing falsity about ZFS - the<br>

only situation in ZFS where RAM becomes significant (for performance)<br>

is in Data deduplication - which is different again from COW and its<br>

benefits. See here:<br>

<a href="http://en.wikipedia.org/wiki/ZFS#Deduplication" target="_blank">http://en.wikipedia.org/wiki/ZFS#Deduplication</a><br>

<br>

These days an SSD for storing the deduplication tables is an easy way<br>

to handle this situation if memory (and performance) is precious in<br>

your deployment [[and you want to enable deduplication]].<br>

<br>

Either way, it appears just about everything including memory use is<br>

configurable - so it would make sense to get at least a little<br>

familiar with it if you made your root filesystem ZFS.<br>

<br>

I can't speak to XFS - it may be better for a single user workstation<br>

root drive, I don't know sorry. I do know that for large disks (by<br>

today's standards), ZFS nails the "certainty of bitrot" problem -<br>

which, if one's data or photos or whatever is precious, is probably<br>

significant no matter how small the storage is, just that with a small<br>

dataset, it's easy to duplicate manually, but even then, automatically<br>

provided (e.g. ZFS time-period scrubbing) is less error prone than<br>

manual backups of course [[when combined with some form of ZFS<br>

ZRAID]].<br>

<br>

These pages seemed quite useful yesterday:<br>

<a href="http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/" target="_blank">http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/</a><br>

<a href="https://calomel.org/zfs_raid_speed_capacity.html" target="_blank">https://calomel.org/zfs_raid_speed_capacity.html</a><br>

<a href="http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide" target="_blank">http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide</a><br>

<br>

Z<br>

<br>

---------- Forwarded message ----------<br>

From: Zenaan Harkness<br>

Date: Wed, 27 May 2015 00:46:29 +1000<br>

Subject: Re: Thank Ramen for ddrescue!!!<br>

<br>

> On 26 May 2015 14:34, "Zenaan Harkness" <<a href="mailto:zen@freedbms.net">zen@freedbms.net</a>> wrote:<br>

>> > On 26 May 2015 12:31, "Zenaan Harkness" <<a href="mailto:zen@freedbms.net">zen@freedbms.net</a>> wrote:<br>

>> >> Reminds me, also that I've been reading heaps about zfs over the last<br>

>> >> couple days, HDD error rates are close to biting us with current gen<br>

>> >> filesystems (like ext4). Armour plate your arse with some ZFS- or<br>

>> >> possibly the less battle tested BTRFS- armour.<br>

>> >><br>

>> >> At one URE (UnRecoverable Errors) rate in 10^14 bits read from a drive<br>

>> >> (most consumer drives are 10^14 - one advertises 2^15, and enterprise<br>

>> >> drives are usually 2^16), we're talking 1 bit flip, on average, in<br>

>> >> 10^14 bits read, whilst:<br>

>> ><br>

>> > Base 10 or base 2? It's an order of magnitude of difference here, or<br>

>> > one<br>

>> > thousand more errors, so kinda a big deal...<br>

>><br>

>> Base 10. And the difference is much more than an order of magnitude:<br>

>> 2^14 = 16384<br>

>> 10^14 = 100000000000000<br>

>><br>

>> Unless I'm not understanding what you're asking...<br>

><br>

> You've used both bases in your post, and it's not clear whether you meant<br>

> that or it was a typo.<br>

<br>

Indeed. The numbers are staggering. And the fact we can now buy<br>

consumer 8TB drives, which essentially guarantee the buyer a bit flip<br>

on reading (and or bit rot as stored) every drive's worth of data is<br>

really mind blowing - also that such error-guarantees are not yet<br>

widely discussed or realized - I guess the "average home user" just<br>

dumps photos, music and movies on their drives, and relatively rarely<br>

reads them back off, and so the awareness is just not there.<br>

<br>

And up until yesterday I've been an average home user from a drive URE<br>

rate perspective - been all but oblivious. It's sorta been like "oh<br>

yeah, I know they include error rates if you look at the specs, but<br>

this is like, you know, an engineered product, and products have you<br>

know, at least one year warranties, and it's all engineering<br>

tolerances and stuff and those engineers know what they're doing, so I<br>

don't have to worry. Right? Well, turns out we need to worry, and in<br>

fact these bit flips are now all but a certainty.<br>

<br>

There's the odd web page about where a fastidious individual has kept<br>

a record over the years of corrupt files. Those error rates are actual<br>

- neither optimistic nor pessimistic it seems. Of course they're<br>

average and they're rates, but from everything I've read in the last<br>

two days, they're relatively accurate engineering guarantees. It used<br>

to be a guarantee that you would get no bit flips, on average, except<br>

if you'd read/written simply enormous amounts. Now that engineering<br>

amount is equal to about one (large) drive of data!<br>

<br>

I just keep shaking my head, having never realized the significance of<br>

all this prior to, oh idk, roughly say, yesterday. Might have been<br>

about 11pm. Although it's now tomorrow, so if my engineering<br>

calculations are right, that may have actually been the day before. I<br>

think I need sleep.<br>

<br>

:)<br>

<span class="HOEnZb"><font color="#888888">Z<br>

--<br>

SLUG - Sydney Linux User's Group Mailing List - <a href="http://slug.org.au/" target="_blank">http://slug.org.au/</a><br>

Subscription info and FAQs: <a href="http://slug.org.au/faq/mailinglists.html" target="_blank">http://slug.org.au/faq/mailinglists.html</a><br>

</font></span></div><div class="gmail_signature"><div dir="ltr"><br></div></div>

</div></div>