On 2020-07-29 Gilbert E. Detillieux wrote:
What's the likely cause of this? A bad NIC? Bad RAM? (I'm guessing something is corrupting the packets once in a while, but I'm not sure what. If so, it seems to get past TCP's error correcting.)
I would try the same type of transfer using a different client to the same server. Then try a different server for the same client. If you can get the same behavior with a different server, that would be extremely useful.
You could also try using nc from /dev/zero from the server to the client into a file, then use a script (or something) to check if the file is all zeros. It would be neat to see the actual corruption that occurs. Make sure nc is using TCP (though UDP would be an interesting test as well, but not critical or required).
You're right that TCP shouldn't really allow such (line) errors to get through to the ssh layer.
If your NIC has TCP checksum offloading, try turning it off (ethtool is what I used to use for that, not sure if it's still "the way"). That will eliminate the NIC and bus from the equation, leaving you with RAM/CPU and/or mobo between the two (but not out to the cards/bridge).
If you turn off offloading and the problem goes away, your transfer performance should tank because it'll be doing TCP retries each time.
My guess, as always, is... wait for it... bad caps on the board, likely near the NIC slot, or, if onboard, near the NIC onboard chip. I've had weird NIC behavior before and it's always turned out to be the caps near the card slot, usually 1000uf little jobbers.
I just decommissioned my main workstation I used since 2008(!) that was starting to get occasional VGA lockups, and lo and behold, the caps near the slots were just starting to get puffy (on a very high end Intel board). I'll be repairing them soon to repurpose the system.
P.S. If a repair or replacement isn't possible for a while, sometimes moving the NIC as far away from the puffiest caps can help for a while until more caps go bad. Each 1 or 2 slots usually gets its own cap(s). Also, putting in a junkier NIC might help if it draws less power. These cap problems are always exacerbated by higher (transient/peak) power draws.
Keep us posted!