Corrupt file system: Replace hard disk or not?

Corrupt file system: Replace hard disk or not?

Omer Zak w1 at zak.co.il
Fri Sep 22 12:38:53 IDT 2017


Once in a while, there is some rare Linux kernel bug which has the
effect of corrupting filesystems.
Maybe it is what has bitten you?

How to check:
1. Which version of the kernel is running on the PC?
2. Are there any reports of filesystem corruption for this version of
the kernel?

Since you are going to replace the computer in a year anyway, and the
data on the hard disk in question is not essential, my advice would be
to put the hard disk back to service.

Also, configure your system to run fsck frequently on the hard disk (say
the shorter of a week and each 5 boots, instead of each 30 boots).

--- Omer Zak


On Fri, 2017-09-22 at 12:11 +0300, Eli Billauer wrote:
> Hello all,
> 
> TL;DR: My hard disk's filesystem was corrupt, but the SMART statistics 
> is perfect. Should I replace the hard disk?
> 
> Full version:
> 
> It seems like one of my hard disks has passed its own premature Yom 
> Kippur verdict. Rebooting my computer this morning, it failed to mount, 
> saying "Group descriptor 32768 checksum is invalid" and forced me into a 
> shell.
> 
> I made the mistake (?) of running fsck and then aborting it with a 
> (proper CTRL-ALT-DEL) reboot, as it took ages. This is a 3 TB disk, 
> which isn't necessary for booting, so I removed it from /etc/fstab, and 
> brought up the computer fine.
> 
> Then I ran fsck on that disk, which generated a log of 125 MB, and 
> basically threw everything into /lost+found, leaving nothing in the root 
> directory. Hurray.
> 
> It's a Western Digital WDC WD30EZRX-00DC0B0, with one big ext4 over LUKS 
> over LVM, 4 years in service, containing stuff that doesn't deserve a 
> backup. So the damage is limited, but I wonder if I should replace the disk.
> 
> Despite its age, this disk's SMART status is perfect: No bad sectors, no 
> reallocated sectors, nothing. No parameter can be better. I know there's 
> a "don't trust SMART" word around, but had a sector failed, I would 
> expect that to appear in the statistics. I mean, I do understand that 
> SMART can't predict a failure, but doesn't it mean anything?
> 
> And there's another thing: The reason a rebooted the computer was that I 
> found the screen frozen, but the mouse pointer moved. The time stood 
> still at 3:01 (AM). This is highly unusual on my computer, which usually 
> runs of months with zero issues.
> 
> So I connected with ssh, and saw nothing suspicious: Not in 
> /var/log/messages, not in dmesg, not in .xsession-errors. No process was 
> busy in particular. From the remote terminal, I couldn't have guessed 
> something was wrong. So I issued a reboot from remote, which failed as I 
> mentioned above.
> 
> Bottom line: The panic instinct is to replace the disk, even though the 
> whole computer is due for replacement within a year or so. Money left 
> aside, it's a bit of an effort, and involves a lot of scary commands as 
> root, which are a risk factor by themselves. I'm not implying that I'm 
> stupid enough to mke2fs the wrong disk. Not me. I never err. ;)




More information about the Linux-il mailing list