sshd: Corrupted MAC on input.

List overview All Threads
Download

newer

older

Battery replacement service

hung mysql transaction locks after...

Gilbert E. Detillieux

29 Jul 2020 29 Jul '20

4:37 p.m.

I've been getting errors such as the following on one of my systems recently...

Jul 29 11:15:33 localhost sshd[26456]: Corrupted MAC on input. Jul 29 11:15:33 localhost sshd[26456]: ssh_dispatch_run_fatal: Connection from 123.123.12.34 port 61436: message authentication code incorrect

These tend to happen during large rsync runs, which causes rsync to abort.

Following some advice I found on The Google, I tried setting various different values for the "MACs" keyword in sshd_config, but it doesn't seem to change the behaviour. (And, yes, I did remember to restart the sshd service after each config change. Running "ssh -vvv" to the host shows that it's negotiating the new MAC protocol.)

What's the likely cause of this? A bad NIC? Bad RAM? (I'm guessing something is corrupting the packets once in a while, but I'm not sure what. If so, it seems to get past TCP's error correcting.)

Has anyone else come across this before, and have a suggestion?

Thanks, Gilbert

-- Gilbert E. Detillieux E-mail: gedetil@cs.umanitoba.ca Dept. of Computer Science Web: http://www.cs.umanitoba.ca/~gedetil/ University of Manitoba Phone: (204)474-8161 Winnipeg MB CANADA R3T 2N2 Fax: (204)474-7609

Show replies by date

Trevor Cordes

30 Jul 30 Jul

1:31 a.m.

On 2020-07-29 Gilbert E. Detillieux wrote:

...

I would try the same type of transfer using a different client to the same server. Then try a different server for the same client. If you can get the same behavior with a different server, that would be extremely useful.

You could also try using nc from /dev/zero from the server to the client into a file, then use a script (or something) to check if the file is all zeros. It would be neat to see the actual corruption that occurs. Make sure nc is using TCP (though UDP would be an interesting test as well, but not critical or required).

You're right that TCP shouldn't really allow such (line) errors to get through to the ssh layer.

If your NIC has TCP checksum offloading, try turning it off (ethtool is what I used to use for that, not sure if it's still "the way"). That will eliminate the NIC and bus from the equation, leaving you with RAM/CPU and/or mobo between the two (but not out to the cards/bridge).

If you turn off offloading and the problem goes away, your transfer performance should tank because it'll be doing TCP retries each time.

My guess, as always, is... wait for it... bad caps on the board, likely near the NIC slot, or, if onboard, near the NIC onboard chip. I've had weird NIC behavior before and it's always turned out to be the caps near the card slot, usually 1000uf little jobbers.

I just decommissioned my main workstation I used since 2008(!) that was starting to get occasional VGA lockups, and lo and behold, the caps near the slots were just starting to get puffy (on a very high end Intel board). I'll be repairing them soon to repurpose the system.

P.S. If a repair or replacement isn't possible for a while, sometimes moving the NIC as far away from the puffiest caps can help for a while until more caps go bad. Each 1 or 2 slots usually gets its own cap(s). Also, putting in a junkier NIC might help if it draws less power. These cap problems are always exacerbated by higher (transient/peak) power draws.

Keep us posted!

Gilbert E. Detillieux

3:07 p.m.

On 2020-07-29 8:31 p.m., Trevor Cordes wrote:

...

On 2020-07-29 Gilbert E. Detillieux wrote:

...
What's the likely cause of this? A bad NIC? Bad RAM? (I'm guessing something is corrupting the packets once in a while, but I'm not sure what. If so, it seems to get past TCP's error correcting.)

I would try the same type of transfer using a different client to the same server. Then try a different server for the same client. If you can get the same behavior with a different server, that would be extremely useful.

This is from a local backup server to an off-site backup. I can easily try a different local server, but won't be able to exactly replicate the rsync, though I can try with other large file(s). As for a different remote destination, that's not easily replicated, but I'd at least know if the problem is limited to the off-site data path and/or server.

...

You could also try using nc from /dev/zero from the server to the client into a file, then use a script (or something) to check if the file is all zeros.

A script? Just using "od" would tell me that. :)

...

It would be neat to see the actual corruption that occurs. Make sure nc is using TCP (though UDP would be an interesting test as well, but not critical or required).

You're right that TCP shouldn't really allow such (line) errors to get through to the ssh layer.

TCP checksums aren't perfect, and with very large transfers, there is a statistically significant probability of errors getting through, if the underlying layers aren't doing their job. (Normally, Ethernet frame errors are more likely to weed out the bad packets than TCP checksums, but I remember in the days of PPP over dial-up, that TCP checksums were often inadequate. If we've got problems with something in the Ethernet data path letting through bad packets, sshd could be seeing errors that TCP misses.)

...

If your NIC has TCP checksum offloading, try turning it off (ethtool is what I used to use for that, not sure if it's still "the way"). That will eliminate the NIC and bus from the equation, leaving you with RAM/CPU and/or mobo between the two (but not out to the cards/bridge).

If you turn off offloading and the problem goes away, your transfer performance should tank because it'll be doing TCP retries each time.

Good suggestion. This is an onboard Intel NIC, and on another server, I had to do this...

# Prevent Intel e1000e hangs/resets due to buggy GSO, GRO and TSO. # As suggested here... # https://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-d...

ethtool -K em1 gso off gro off tso off

It's a different chipset here, and I'm not seeing this specific error, but it could be something chipset-related anyway.

...

My guess, as always, is... wait for it... bad caps on the board, likely near the NIC slot, or, if onboard, near the NIC onboard chip. I've had weird NIC behavior before and it's always turned out to be the caps near the card slot, usually 1000uf little jobbers.

I just decommissioned my main workstation I used since 2008(!) that was starting to get occasional VGA lockups, and lo and behold, the caps near the slots were just starting to get puffy (on a very high end Intel board). I'll be repairing them soon to repurpose the system.

P.S. If a repair or replacement isn't possible for a while, sometimes moving the NIC as far away from the puffiest caps can help for a while until more caps go bad. Each 1 or 2 slots usually gets its own cap(s). Also, putting in a junkier NIC might help if it draws less power. These cap problems are always exacerbated by higher (transient/peak) power draws.

I had thought of just putting in a network card, and disabling the onboard NIC, but I didn't want to do that until I was sure it was the NIC and not something software related or MB related. And since this is an off-site system (albeit still on campus), I have to coordinate with someone else who's normally working from home these days.

So, looking for things I can test remotely, at the moment...

...

Keep us posted!

Will do.

Gilbert

Trevor Cordes

31 Jul 31 Jul

9:10 a.m.

On 2020-07-30 Gilbert E. Detillieux wrote:

...

...
You could also try using nc from /dev/zero from the server to the client into a file, then use a script (or something) to check if the file is all zeros.

A script? Just using "od" would tell me that. :)

Ya but not if your test nc/file is several GB or TB! :-) You need something that will tell you if there are non-zero bytes in the final file. Well, maybe od can do that if you output in lines and can then use grep to match non-all-zero lines. I'm pretty sure I could make a solution faster in a perl -e one-liner than with od and grep! :-)

Hmm... or maybe cmp -l file /dev/zero .....

...

TCP checksums aren't perfect, and with very large transfers, there is a statistically significant probability of errors getting through, if the underlying layers aren't doing their job. (Normally, Ethernet frame errors are more likely to weed out the bad packets than TCP checksums, but I remember in the days of PPP over dial-up, that TCP checksums were often inadequate. If we've got problems with something in the Ethernet data path letting through bad packets, sshd could be seeing errors that TCP misses.)

Oh, they should be "perfect enough" so you don't get what you're seeing on a regular basis. Maybe someone can whip off the spec and we can do the math. I think they are 32-bit checksums in TCP? Yes, between the lower layer checksums and TCP my gut says errors should be rare. Maybe the math will spell differently though on really junky connections... However short of wireless, no one should really have that junky a connection anymore.

...

So, looking for things I can test remotely, at the moment...

Computer manufacturers should start including a camera and light *inside* every case pointing down to the mobo so one can inspect the caps at will remotely! :-)

Either that or spend the extra 5c per cap and not use the no-name Chinese caps in the first place!

Hartmut W Sager

1:44 p.m.

...

...
TCP checksums aren't perfect, and with very large transfers, there is a statistically significant probability of errors getting through,

...

I think they are 32-bit checksums in TCP? Yes, between the

lower layer checksums and TCP my gut says errors should be rare. Maybe the math will spell differently though on really junky connections... However short of wireless, no one should really have that junky a connection anymore.

If those "32-bit checksums" are truly just checksums, i.e., sums mod 2^32, then my math knowledge definitely agrees with Trevor. A simple bit-flip of two bits in the same bit position throughout the entire file would cancel each other and go undetected. And for a large file, that probability is very significant. As for the unlikelihood of "really junky connections" nowadays, well, that's what checksums, CRCs, etc., are for, as a "just in case" the unlikely event does occur, in addition to the occurrence of the more likely events.

Hartmut W Sager - Tel +1-204-339-8331

On Fri, 31 Jul 2020 at 04:11, Trevor Cordes trevor@tecnopolis.ca wrote:

...

On 2020-07-30 Gilbert E. Detillieux wrote:

Hartmut W Sager

1:57 p.m.

Oops, *sorry Gilbert*, I looked at this thread again, and it was *your* position on checksums that I'm supporting. Maybe I have some bad/failing Chinese capacitors in my head. :)

Hartmut W Sager - Tel +1-204-339-8331

On Fri, 31 Jul 2020 at 08:44, Hartmut W Sager hwsager@marityme.net wrote:

...

Trevor Cordes

1 Aug 1 Aug

4:42 a.m.

On 2020-07-31 Hartmut W Sager wrote:

...

Your creator should have sprung for the 5c better caps! ;-)

Looks like you guys are right. TCP only has a 16-bit checksum, and it's a simple sum then 1's complement over (most) of the whole packet.

Some post says microsoft says (paraphrased): "Basically transmit 100MB+ over a typical Internet connection and you are very likely to see a silent failure."

I don't know about that! But, yes, even if you get 1 error through every 1GB TCP, that's pretty awful to contemplate.

I guess rsync detects/corrects for this automatically, but unfortunately Gilbert is seeing errors in the ssh wrapper layer, in a place where ssh is sensitive to errors and wants to barf instead of retry. It almost would be better if ssh would just pass up junk to rsync at let it deal with it.

Gilbert, to confirm, your bug hits after you have transferred lots of data, right? It's not giving this error right at the beginning upon connection, right? Do you have any stats on approx how much data goes across each time before the error hits? Is it consistent or all over the map?

Gilbert E. Detillieux

4 Aug 4 Aug

2:56 p.m.

On 2020-07-31 11:42 p.m., Trevor Cordes wrote:

...

That's what I thought, but it's been a while since I looked at TCP headers in detail.

...

As I said earlier, the Ethernet frame layer catches most bit errors for you in the typical network setup. I think there's a 32-bit CRC there...

https://en.wikipedia.org/wiki/Frame_check_sequence

We were often seeing undetected TCP bit errors in PPP over serial (modem) connections (ages ago), where there was no Ethernet frame layer to do the heavy lifting.

But if the NIC isn't doing its work correctly (either at the Ethernet frame level or TCP checksum offloads), this could result in bad data getting up the food chain.

...

I haven't found a way to disable the MAC support in ssh, only a way to select protocols at both the client and server end.

...

Yes, these are on large file transfers. Yes, they are occasional, and random. But I had to restart a large (41-ish GB?) file transfer 3 times last week due to repeated errors. A typical nightly backup results in 160 GB or so to transfer, and lately, the rsync fails (somewhere along the way) more often than not.

Gilbert

Gilbert E. Detillieux

4:14 p.m.

Just FYI, I had an rsync fail just a while ago about 17 seconds (and 253MB) into a file transfer. Same error on the remote side.

After restarting, it died again, this time after 3m07s (and 2.6GB).

So, it is fairly random, yet consistent! :P

Gilbert

On 2020-08-04 9:56 a.m., Gilbert E. Detillieux wrote:

...

Adam Thompson

5:55 p.m.

It's bizarre that you're getting this so consistently - the FCS plus the checksum between them should trigger TCP retries long before the errors percolate up to the application layer. I can't remember, did you try disabling HW offload on both sending and receiving ends already? (Either end could trigger the SSH abort.) -Adam

On 2020-08-04 11:14, Gilbert E. Detillieux wrote:

...

David Milton

6:04 p.m.

It may be worth having a look at the switch interface statistics for that server. Such repeatable errors must be showing up somewhere other than the servers. Also, can you replicate the problem locally?

Sent from my phone.

...

Gilbert E. Detillieux

6:31 p.m.

On 2020-08-04 1:04 p.m., David Milton wrote:

...

It may be worth having a look at the switch interface statistics for that server. Such repeatable errors must be showing up somewhere other than the servers.

I'll have to check with the person who maintains those, as this is the "off-site" backup, and I don't have access to the switch(es).

...

Also, can you replicate the problem locally?

Haven't been able to so far. Of course, I don't have other systems at either end that are identical or even similar enough.

Gilbert

...

...
On Aug 4, 2020, at 12:55, Adam Thompson athompso@athompso.net wrote:

It's bizarre that you're getting this so consistently - the FCS plus the checksum between them should trigger TCP retries long before the errors percolate up to the application layer. I can't remember, did you try disabling HW offload on both sending and receiving ends already? (Either end could trigger the SSH abort.) -Adam

...
On 2020-08-04 11:14, Gilbert E. Detillieux wrote: Just FYI, I had an rsync fail just a while ago about 17 seconds (and 253MB) into a file transfer. Same error on the remote side. After restarting, it died again, this time after 3m07s (and 2.6GB). So, it is fairly random, yet consistent! :P Gilbert

...
On 2020-08-04 9:56 a.m., Gilbert E. Detillieux wrote: On 2020-07-31 11:42 p.m., Trevor Cordes wrote:

...
Gilbert, to confirm, your bug hits after you have transferred lots of data, right? It's not giving this error right at the beginning upon connection, right? Do you have any stats on approx how much data goes across each time before the error hits? Is it consistent or all over the map?

Yes, these are on large file transfers. Yes, they are occasional, and random. But I had to restart a large (41-ish GB?) file transfer 3 times last week due to repeated errors. A typical nightly backup results in 160 GB or so to transfer, and lately, the rsync fails (somewhere along the way) more often than not. Gilbert

Gilbert E. Detillieux

6:28 p.m.

On 2020-08-04 12:55 p.m., Adam Thompson wrote:

...

It's bizarre that you're getting this so consistently - the FCS plus the checksum between them should trigger TCP retries long before the errors percolate up to the application layer.

Which is why I was leaning toward either a NIC problem, or bad RAM. (There's no ECC on this box.)

...

I can't remember, did you try disabling HW offload on both sending and receiving ends already? (Either end could trigger the SSH abort.)

I hadn't yet. (I was trying a few other things first, such as changing MAC algorithms, and rebooting with the older kernel, neither of which seemed to affect things.)

I've now disabled both rx and tx checksum offloading. We'll see if that makes a difference.

Gilbert

...

On 2020-08-04 11:14, Gilbert E. Detillieux wrote:

...
Just FYI, I had an rsync fail just a while ago about 17 seconds (and 253MB) into a file transfer. Same error on the remote side.

After restarting, it died again, this time after 3m07s (and 2.6GB).

So, it is fairly random, yet consistent! :P

Gilbert

On 2020-08-04 9:56 a.m., Gilbert E. Detillieux wrote:

...
On 2020-07-31 11:42 p.m., Trevor Cordes wrote:

...
Gilbert, to confirm, your bug hits after you have transferred lots of data, right? It's not giving this error right at the beginning upon connection, right? Do you have any stats on approx how much data goes across each time before the error hits? Is it consistent or all over the map?

Yes, these are on large file transfers. Yes, they are occasional, and random. But I had to restart a large (41-ish GB?) file transfer 3 times last week due to repeated errors. A typical nightly backup results in 160 GB or so to transfer, and lately, the rsync fails (somewhere along the way) more often than not.

Gilbert

Gilbert E. Detillieux

10 Aug 10 Aug

3:57 p.m.

On 2020-08-04 1:28 p.m., Gilbert E. Detillieux wrote:

...

So, after almost 6 days running with rx and tx checksum offloading disabled, not a single "Corrupted MAC on input" error! My overnight rsync now runs to completion.

I hope this isn't premature, but I think we found the problem! (Who would have thought it could make such a difference?!)

It also doesn't seem to have caused a noticeable performance hit. I'm thinking we were disk I/O bound on the remote (receiving) end, anyway, so if the network I/O is a bit slower, we wouldn't see it.

Thanks, everyone, for your suggestions!

Gilbert

Alberto Abrao

4:19 p.m.

I don't run any heavy workload such as this one, but I would not be surprised if that fixes your problem for good. I had quite a few issues with Intel NICs on Linux - especially with the 3.x kernel - that were solved by disabling GSO/RSO/TSO.

Kind regards,

Alberto Abrao 204-202-1778 204-558-6886 www.abrao.net

On 2020-08-10 10:57 a.m., Gilbert E. Detillieux wrote:

...

Gilbert E. Detillieux

4:28 p.m.

I have one system where that was an issue. It resulted in a more distinct error message, which I was able to search for, and find the solution you mention.

This one is a different Intel chipset, and a different error. (I don't think this NIC even supports TSO; and the GSO/GRO options don't seem to be the problem.)

Thanks, Gilbert

On 2020-08-10 11:19 a.m., Alberto Abrao wrote:

...

Trevor Cordes

11 Aug 11 Aug

6:17 a.m.

On 2020-08-10 Gilbert E. Detillieux wrote:

...

The offload is barely noticeable with modern-ish CPUs with a zillion cores. I guess if you're maxing out 10Gb links maybe you'd notice it a touch.

As for the problem being solved... I can now virtually guarantee you that somewhere on your board is bad caps. If the system used to run this load without problem, then started having this problem (as opposed to the system always having this problem), then it's caps. You may have solved this weird issue for now, but the caps will slowly get worse and other weird things will occur. If you ever take a peek inside, let us know!

Trevor Cordes

8 Aug 8 Aug

3:32 a.m.

...

I haven't found a way to disable the MAC support in ssh, only a way to select protocols at both the client and server end.

I think the MAC is integral(?) so I doubt you'll be able to "turn it off". IIRC in encryption you have the privacy (encryption) but you need the MAC to ensure integrity (not tampered with). Something like that, anyhow. It's integral to how ssh works as it promises both (I do believe). Besides, without MAC you'd have written corrupt backups to disk! :-)

And, no ECC? For shame!! ;-)

1745

Age (days ago)

1758

Last active (days ago)

roundtable@muug.ca

17 comments

6 participants

tags (0)

participants (6)

Adam Thompson
Alberto Abrao
David Milton
Gilbert E. Detillieux
Hartmut W Sager
Trevor Cordes