Corrupt file system: Replace hard disk or not?

Corrupt file system: Replace hard disk or not?

Eli Billauer eli at billauer.co.il
Fri Sep 22 12:11:00 IDT 2017


Hello all,

TL;DR: My hard disk's filesystem was corrupt, but the SMART statistics 
is perfect. Should I replace the hard disk?

Full version:

It seems like one of my hard disks has passed its own premature Yom 
Kippur verdict. Rebooting my computer this morning, it failed to mount, 
saying "Group descriptor 32768 checksum is invalid" and forced me into a 
shell.

I made the mistake (?) of running fsck and then aborting it with a 
(proper CTRL-ALT-DEL) reboot, as it took ages. This is a 3 TB disk, 
which isn't necessary for booting, so I removed it from /etc/fstab, and 
brought up the computer fine.

Then I ran fsck on that disk, which generated a log of 125 MB, and 
basically threw everything into /lost+found, leaving nothing in the root 
directory. Hurray.

It's a Western Digital WDC WD30EZRX-00DC0B0, with one big ext4 over LUKS 
over LVM, 4 years in service, containing stuff that doesn't deserve a 
backup. So the damage is limited, but I wonder if I should replace the disk.

Despite its age, this disk's SMART status is perfect: No bad sectors, no 
reallocated sectors, nothing. No parameter can be better. I know there's 
a "don't trust SMART" word around, but had a sector failed, I would 
expect that to appear in the statistics. I mean, I do understand that 
SMART can't predict a failure, but doesn't it mean anything?

And there's another thing: The reason a rebooted the computer was that I 
found the screen frozen, but the mouse pointer moved. The time stood 
still at 3:01 (AM). This is highly unusual on my computer, which usually 
runs of months with zero issues.

So I connected with ssh, and saw nothing suspicious: Not in 
/var/log/messages, not in dmesg, not in .xsession-errors. No process was 
busy in particular. From the remote terminal, I couldn't have guessed 
something was wrong. So I issued a reboot from remote, which failed as I 
mentioned above.

Bottom line: The panic instinct is to replace the disk, even though the 
whole computer is due for replacement within a year or so. Money left 
aside, it's a bit of an effort, and involves a lot of scary commands as 
root, which are a risk factor by themselves. I'm not implying that I'm 
stupid enough to mke2fs the wrong disk. Not me. I never err. ;)

Insights are welcome.

Shana Tova,
    Eli

-- 
Web: http://www.billauer.co.il




More information about the Linux-il mailing list