Date: Mon, 23 May 2022 00:31:56 -0500 Device: /dev/sdg [SAT], 2 Currently unreadable (pending) sectors Hitachi HDS723020BLA642, S/N:MN1220F317TJTD, WWN:5-000cca-369d1a23a, FW:MN6OA580, 2.00 TB
Date: Mon, 23 May 2022 00:23:32 -0500 Device: /dev/sdh [SAT], 2 Currently unreadable (pending) sectors Hitachi HDS723020BLA642, S/N:MN1220F317TJTD, WWN:5-000cca-369d1a23a, FW:MN6OA580, 2.00 TB
No reboot in between... Note the SNs are identical. That's impossible. The drive models are probably truly the same, but the SNs certainly are not. I think I found a bug in smart, unless someone can think of a reason why I'd be seeing this.
The problem is, I just relied on a smart email like this to decide which drive to pull and replace. Now I'm not so sure the smart reporting is telling me anything correct.
This is on a fairly new Fedora and smartmontools-7.2-11
From today forward I would recommend double-checking smart reports
before acting on them. (smartctl -a, hdparm -i, etc)
One scenario where you can see this is e.g. on the muug.ca server, where the drives are multipathed - i.e. two physical SAS channels reaching each drive. Linux handles this by having two sdX nodes, then multipathd creates a single /dev/mapper/XXX device for you to use.
On a non-multipath box, this could happen if the drive went offline and then recovered. I've seen it happen, but I don't know how to reproduce it.
My guess is it's the same drive, and the kernel decided it needed a new device name for some reason. "dmesg|grep sd[gh]" might show you something useful?
-Adam
-----Original Message----- From: Roundtable roundtable-bounces@muug.ca On Behalf Of Trevor Cordes Sent: Monday, May 23, 2022 1:43 PM To: MUUG RndTbl roundtable@muug.ca Subject: [RndTbl] Bug in smart reporting?
Date: Mon, 23 May 2022 00:31:56 -0500 Device: /dev/sdg [SAT], 2 Currently unreadable (pending) sectors Hitachi HDS723020BLA642, S/N:MN1220F317TJTD, WWN:5-000cca-369d1a23a, FW:MN6OA580, 2.00 TB
Date: Mon, 23 May 2022 00:23:32 -0500 Device: /dev/sdh [SAT], 2 Currently unreadable (pending) sectors Hitachi HDS723020BLA642, S/N:MN1220F317TJTD, WWN:5-000cca-369d1a23a, FW:MN6OA580, 2.00 TB
No reboot in between... Note the SNs are identical. That's impossible. The drive models are probably truly the same, but the SNs certainly are not. I think I found a bug in smart, unless someone can think of a reason why I'd be seeing this.
The problem is, I just relied on a smart email like this to decide which drive to pull and replace. Now I'm not so sure the smart reporting is telling me anything correct.
This is on a fairly new Fedora and smartmontools-7.2-11
From today forward I would recommend double-checking smart reports
before acting on them. (smartctl -a, hdparm -i, etc)
_______________________________________________ Roundtable mailing list Roundtable@muug.ca https://muug.ca/mailman/listinfo/roundtable
On 2022-05-23 Adam Thompson wrote:
One scenario where you can see this is e.g. on the muug.ca server, where the drives are multipathed - i.e. two physical SAS channels reaching each drive. Linux handles this by having two sdX nodes, then multipathd creates a single /dev/mapper/XXX device for you to use.
It is the old muug server, but I don't see anything in mapper and I don't think it's multipath'd(?). Each drive gets its own sata cable direct to the board.
On a non-multipath box, this could happen if the drive went offline and then recovered. I've seen it happen, but I don't know how to reproduce it.
Almost certainly not the case in this instance. The drives are very stable, with just these semi-bad smart errors happening off and on for months. The array never went degraded nor resynced. I get panic phone alerts if that happens. :-)
My guess is it's the same drive, and the kernel decided it needed a new device name for some reason. "dmesg|grep sd[gh]" might show you something useful?
I'll try that next time it happens, as I've since rebooted and /v/l/messages doesn't seem to be doing all the kernel logs on this box for some reason (even with trying to defeat all the journald stuff).
I'm sure I won't have to wait long... I'm just miffed I may have replaced the wrong drive in my RAID6 last night... but the resync was 100% ok, so no lasting harm done.
Yeah, systemd really messes up logging. You just have to rely on "journalctl" instead of /var/log/messages, at some point, no matter what you've done to make it look like the old way. ☹
...or run Devuan, I suppose? I think you can also still build Gentoo without systemd, and there's always *BSD. OpenBSD has a partial systemd-compatibility layer now, not sure about the others, but they all still use honest-to-god dmesg & syslog. Actually, since *BSD all implement SMART slightly differently (and all VERY differently from Linux), you could make a bootable OpenBSD USB stick and use its SMART utilities to cross-check what Linux's smartctl says if you wanted?
I think systemd separates kernel stuff into /var/log/dmesg.log, at least on the system I'm looking at right now. Fedora could be different. And you've customized things anyway, so YMMV here.
Good luck, anyway.
-Adam
-----Original Message----- From: Trevor Cordes trevor@tecnopolis.ca Sent: Monday, May 23, 2022 2:29 PM To: Adam Thompson athompso@athompso.net Cc: Continuation of Round Table discussion roundtable@muug.ca Subject: Re: [RndTbl] Bug in smart reporting?
On 2022-05-23 Adam Thompson wrote:
One scenario where you can see this is e.g. on the muug.ca server, where the drives are multipathed - i.e. two physical SAS channels reaching each drive. Linux handles this by having two sdX nodes, then multipathd creates a single /dev/mapper/XXX device for you to use.
It is the old muug server, but I don't see anything in mapper and I don't think it's multipath'd(?). Each drive gets its own sata cable direct to the board.
On a non-multipath box, this could happen if the drive went offline and then recovered. I've seen it happen, but I don't know how to reproduce it.
Almost certainly not the case in this instance. The drives are very stable, with just these semi-bad smart errors happening off and on for months. The array never went degraded nor resynced. I get panic phone alerts if that happens. :-)
My guess is it's the same drive, and the kernel decided it needed a new device name for some reason. "dmesg|grep sd[gh]" might show you something useful?
I'll try that next time it happens, as I've since rebooted and /v/l/messages doesn't seem to be doing all the kernel logs on this box for some reason (even with trying to defeat all the journald stuff).
I'm sure I won't have to wait long... I'm just miffed I may have replaced the wrong drive in my RAID6 last night... but the resync was 100% ok, so no lasting harm done.