Bug in smart reporting?

List overview All Threads
Download

newer

older

Roguelike Celebration 2022 Call...

Collection of Bash one-liners

Trevor Cordes

23 May 2022 23 May '22

6:43 p.m.

Date: Mon, 23 May 2022 00:31:56 -0500 Device: /dev/sdg [SAT], 2 Currently unreadable (pending) sectors Hitachi HDS723020BLA642, S/N:MN1220F317TJTD, WWN:5-000cca-369d1a23a, FW:MN6OA580, 2.00 TB

Date: Mon, 23 May 2022 00:23:32 -0500 Device: /dev/sdh [SAT], 2 Currently unreadable (pending) sectors Hitachi HDS723020BLA642, S/N:MN1220F317TJTD, WWN:5-000cca-369d1a23a, FW:MN6OA580, 2.00 TB

No reboot in between... Note the SNs are identical. That's impossible. The drive models are probably truly the same, but the SNs certainly are not. I think I found a bug in smart, unless someone can think of a reason why I'd be seeing this.

The problem is, I just relied on a smart email like this to decide which drive to pull and replace. Now I'm not so sure the smart reporting is telling me anything correct.

This is on a fairly new Fedora and smartmontools-7.2-11

...

From today forward I would recommend double-checking smart reports

before acting on them. (smartctl -a, hdparm -i, etc)

Show replies by date

Adam Thompson

23 May 23 May

6:49 p.m.

One scenario where you can see this is e.g. on the muug.ca server, where the drives are multipathed - i.e. two physical SAS channels reaching each drive. Linux handles this by having two sdX nodes, then multipathd creates a single /dev/mapper/XXX device for you to use.

On a non-multipath box, this could happen if the drive went offline and then recovered. I've seen it happen, but I don't know how to reproduce it.

My guess is it's the same drive, and the kernel decided it needed a new device name for some reason. "dmesg|grep sd[gh]" might show you something useful?

-Adam

-----Original Message----- From: Roundtable roundtable-bounces@muug.ca On Behalf Of Trevor Cordes Sent: Monday, May 23, 2022 1:43 PM To: MUUG RndTbl roundtable@muug.ca Subject: [RndTbl] Bug in smart reporting?

Date: Mon, 23 May 2022 00:31:56 -0500 Device: /dev/sdg [SAT], 2 Currently unreadable (pending) sectors Hitachi HDS723020BLA642, S/N:MN1220F317TJTD, WWN:5-000cca-369d1a23a, FW:MN6OA580, 2.00 TB

Date: Mon, 23 May 2022 00:23:32 -0500 Device: /dev/sdh [SAT], 2 Currently unreadable (pending) sectors Hitachi HDS723020BLA642, S/N:MN1220F317TJTD, WWN:5-000cca-369d1a23a, FW:MN6OA580, 2.00 TB

The problem is, I just relied on a smart email like this to decide which drive to pull and replace. Now I'm not so sure the smart reporting is telling me anything correct.

This is on a fairly new Fedora and smartmontools-7.2-11

...

From today forward I would recommend double-checking smart reports

before acting on them. (smartctl -a, hdparm -i, etc)

_______________________________________________ Roundtable mailing list Roundtable@muug.ca https://muug.ca/mailman/listinfo/roundtable

Trevor Cordes

7:28 p.m.

On 2022-05-23 Adam Thompson wrote:

...

One scenario where you can see this is e.g. on the muug.ca server, where the drives are multipathed - i.e. two physical SAS channels reaching each drive. Linux handles this by having two sdX nodes, then multipathd creates a single /dev/mapper/XXX device for you to use.

It is the old muug server, but I don't see anything in mapper and I don't think it's multipath'd(?). Each drive gets its own sata cable direct to the board.

...

On a non-multipath box, this could happen if the drive went offline and then recovered. I've seen it happen, but I don't know how to reproduce it.

Almost certainly not the case in this instance. The drives are very stable, with just these semi-bad smart errors happening off and on for months. The array never went degraded nor resynced. I get panic phone alerts if that happens. :-)

...

My guess is it's the same drive, and the kernel decided it needed a new device name for some reason. "dmesg|grep sd[gh]" might show you something useful?

I'll try that next time it happens, as I've since rebooted and /v/l/messages doesn't seem to be doing all the kernel logs on this box for some reason (even with trying to defeat all the journald stuff).

I'm sure I won't have to wait long... I'm just miffed I may have replaced the wrong drive in my RAID6 last night... but the resync was 100% ok, so no lasting harm done.

Adam Thompson

7:43 p.m.

Yeah, systemd really messes up logging. You just have to rely on "journalctl" instead of /var/log/messages, at some point, no matter what you've done to make it look like the old way. ☹

...or run Devuan, I suppose? I think you can also still build Gentoo without systemd, and there's always *BSD. OpenBSD has a partial systemd-compatibility layer now, not sure about the others, but they all still use honest-to-god dmesg & syslog. Actually, since *BSD all implement SMART slightly differently (and all VERY differently from Linux), you could make a bootable OpenBSD USB stick and use its SMART utilities to cross-check what Linux's smartctl says if you wanted?

I think systemd separates kernel stuff into /var/log/dmesg.log, at least on the system I'm looking at right now. Fedora could be different. And you've customized things anyway, so YMMV here.

Good luck, anyway.

-Adam

-----Original Message----- From: Trevor Cordes trevor@tecnopolis.ca Sent: Monday, May 23, 2022 2:29 PM To: Adam Thompson athompso@athompso.net Cc: Continuation of Round Table discussion roundtable@muug.ca Subject: Re: [RndTbl] Bug in smart reporting?

On 2022-05-23 Adam Thompson wrote:

...

One scenario where you can see this is e.g. on the muug.ca server, where the drives are multipathed - i.e. two physical SAS channels reaching each drive. Linux handles this by having two sdX nodes, then multipathd creates a single /dev/mapper/XXX device for you to use.

It is the old muug server, but I don't see anything in mapper and I don't think it's multipath'd(?). Each drive gets its own sata cable direct to the board.

...

On a non-multipath box, this could happen if the drive went offline and then recovered. I've seen it happen, but I don't know how to reproduce it.

...

My guess is it's the same drive, and the kernel decided it needed a new device name for some reason. "dmesg|grep sd[gh]" might show you something useful?

I'm sure I won't have to wait long... I'm just miffed I may have replaced the wrong drive in my RAID6 last night... but the resync was 100% ok, so no lasting harm done.

1167

Age (days ago)

1167

Last active (days ago)

roundtable@muug.ca

3 comments

2 participants

tags (0)

participants (2)

Adam Thompson
Trevor Cordes