Corrupt file system & SMART: Replace hard disk or not?

Corrupt file system & SMART: Replace hard disk or not?

Eli Billauer eli at billauer.co.il
Fri Sep 22 14:47:59 IDT 2017


Hello Omer,

The computer in question is always up, and has quite a lot of disk 
activity for a domestic desktop. The kernel is 3.12.20, and I've never 
had any filesystem issue with it. Actually, I can't recall experiencing 
a filesystem fault ever (except for problems between chair and keyboard).

And checking on boot doesn't help much. I can and probably will check 
this specific filesystem every now and then, since there's no issue 
umounting it for a few hours.

Thanks,
    Eli

On 22/09/17 12:38, Omer Zak wrote:
> Once in a while, there is some rare Linux kernel bug which has the
> effect of corrupting filesystems.
> Maybe it is what has bitten you?
>
> How to check:
> 1. Which version of the kernel is running on the PC?
> 2. Are there any reports of filesystem corruption for this version of
> the kernel?
>
> Since you are going to replace the computer in a year anyway, and the
> data on the hard disk in question is not essential, my advice would be
> to put the hard disk back to service.
>
> Also, configure your system to run fsck frequently on the hard disk (say
> the shorter of a week and each 5 boots, instead of each 30 boots).
>
> --- Omer Zak
>
>
> On Fri, 2017-09-22 at 12:11 +0300, Eli Billauer wrote:
>    
>> Hello all,
>>
>> TL;DR: My hard disk's filesystem was corrupt, but the SMART statistics
>> is perfect. Should I replace the hard disk?
>>
>> Full version:
>>
>> It seems like one of my hard disks has passed its own premature Yom
>> Kippur verdict. Rebooting my computer this morning, it failed to mount,
>> saying "Group descriptor 32768 checksum is invalid" and forced me into a
>> shell.
>>
>> I made the mistake (?) of running fsck and then aborting it with a
>> (proper CTRL-ALT-DEL) reboot, as it took ages. This is a 3 TB disk,
>> which isn't necessary for booting, so I removed it from /etc/fstab, and
>> brought up the computer fine.
>>
>> Then I ran fsck on that disk, which generated a log of 125 MB, and
>> basically threw everything into /lost+found, leaving nothing in the root
>> directory. Hurray.
>>
>> It's a Western Digital WDC WD30EZRX-00DC0B0, with one big ext4 over LUKS
>> over LVM, 4 years in service, containing stuff that doesn't deserve a
>> backup. So the damage is limited, but I wonder if I should replace the disk.
>>
>> Despite its age, this disk's SMART status is perfect: No bad sectors, no
>> reallocated sectors, nothing. No parameter can be better. I know there's
>> a "don't trust SMART" word around, but had a sector failed, I would
>> expect that to appear in the statistics. I mean, I do understand that
>> SMART can't predict a failure, but doesn't it mean anything?
>>
>> And there's another thing: The reason a rebooted the computer was that I
>> found the screen frozen, but the mouse pointer moved. The time stood
>> still at 3:01 (AM). This is highly unusual on my computer, which usually
>> runs of months with zero issues.
>>
>> So I connected with ssh, and saw nothing suspicious: Not in
>> /var/log/messages, not in dmesg, not in .xsession-errors. No process was
>> busy in particular. From the remote terminal, I couldn't have guessed
>> something was wrong. So I issued a reboot from remote, which failed as I
>> mentioned above.
>>
>> Bottom line: The panic instinct is to replace the disk, even though the
>> whole computer is due for replacement within a year or so. Money left
>> aside, it's a bit of an effort, and involves a lot of scary commands as
>> root, which are a risk factor by themselves. I'm not implying that I'm
>> stupid enough to mke2fs the wrong disk. Not me. I never err. ;)
>>      
>
> _______________________________________________
> Linux-il mailing list
> Linux-il at cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>
>    


-- 
Web: http://www.billauer.co.il




More information about the Linux-il mailing list