On 2020-07-31 11:42 p.m., Trevor Cordes wrote:
On 2020-07-31 Hartmut W Sager wrote:
Oops, *sorry Gilbert*, I looked at this thread again, and it was *your* position on checksums that I'm supporting. Maybe I have some bad/failing Chinese capacitors in my head. :)
Your creator should have sprung for the 5c better caps! ;-)
Looks like you guys are right. TCP only has a 16-bit checksum, and it's a simple sum then 1's complement over (most) of the whole packet.
That's what I thought, but it's been a while since I looked at TCP headers in detail.
Some post says microsoft says (paraphrased): "Basically transmit 100MB+ over a typical Internet connection and you are very likely to see a silent failure."
I don't know about that! But, yes, even if you get 1 error through every 1GB TCP, that's pretty awful to contemplate.
As I said earlier, the Ethernet frame layer catches most bit errors for you in the typical network setup. I think there's a 32-bit CRC there...
https://en.wikipedia.org/wiki/Frame_check_sequence
We were often seeing undetected TCP bit errors in PPP over serial (modem) connections (ages ago), where there was no Ethernet frame layer to do the heavy lifting.
But if the NIC isn't doing its work correctly (either at the Ethernet frame level or TCP checksum offloads), this could result in bad data getting up the food chain.
I guess rsync detects/corrects for this automatically, but unfortunately Gilbert is seeing errors in the ssh wrapper layer, in a place where ssh is sensitive to errors and wants to barf instead of retry. It almost would be better if ssh would just pass up junk to rsync at let it deal with it.
I haven't found a way to disable the MAC support in ssh, only a way to select protocols at both the client and server end.
Gilbert, to confirm, your bug hits after you have transferred lots of data, right? It's not giving this error right at the beginning upon connection, right? Do you have any stats on approx how much data goes across each time before the error hits? Is it consistent or all over the map?
Yes, these are on large file transfers. Yes, they are occasional, and random. But I had to restart a large (41-ish GB?) file transfer 3 times last week due to repeated errors. A typical nightly backup results in 160 GB or so to transfer, and lately, the rsync fails (somewhere along the way) more often than not.
Gilbert