On 2016-06-17 Trevor Cordes wrote:
Whoa! Indeed we have a cache issue. Disks are near full speed when writing direct to disk, but slow when going through the cache.
Finally, I believe the end is in sight. This must be the actual solution...
Thanks to the club's roundtable discussion I was able to refine my google queries and discovered that the problem is certainly... drum roll please... wait for it... shock of shockers... PAE!
For posterity, for those in a hurry, solution is one of: - reinstall to x86_64 (64 bit) kernel (and in most distros that means userland as well) - add mem=8G to /etc/default/grub GRUB_CMDLINE_LINUX option, regen your grub2.conf (in Fedora that's grub2-mkconfig -o /boot/grub2/grub.cfg) and reboot - set /proc/sys/vm/highmem_is_dirtyable to 1 (I didn't test this)
(Excruciating) Details:
It looks like there's a magic line at 8GB of RAM that if you cross with PAE you start getting the kernel making very strange VM/cache tuning choices. This would explain why all my other nearly identical boxes with 4 and 8GB of RAM don't have this problem, but this newest box with 16G does. (I didn't want 16GB, but at the time that was the only cost effective ECC RAM choice.)
When using 16GB RAM, the kernel tunes thusly: cat /proc/vmstat | grep -P "dirty.*thresh" nr_dirty_threshold 0 nr_dirty_background_threshold 0
A great workaround/test was to add to the kernel boot cmd line: mem=8G I did that and rebooted. Now the box thinks it has only 8G, not 16G. Doesn't matter as the box would be fine with only 4G. It's now been over 4 days of uptime and my dd tests are still showing 100% full speed with none of the usual slowdown after 8-20 hours of uptime.
Set at 8G, the kernel auto-tuning shows: #cat /proc/vmstat | grep -P "dirty.*thresh" nr_dirty_threshold 19280 nr_dirty_background_threshold 9640
(Thanks for the idea go to http://stackoverflow.com/questions/30519417/why-linux-disables-disk-write-bu...)
I don't fully understand this tunable or precisely what's going on here, but the fact that they are getting set to 0 is screwing something up in the kernel when you've written enough data to fill up something or other, resulting in abysmal buffered write speeds as low as 1MB/s for some of the other guys reporting this bug (I only ever saw a low of 5MB/s).
Apparently you can also workaround the problem with: echo 1 > /proc/sys/vm/highmem_is_dirtyable But I didn't test that as I'm quite happy with the 8G solution for now. Not sure if that tunable will further tune the dirty*threshold parms above.
Another useful hit was: http://www.linuxquestions.org/questions/slackware-14/slow-disk-write-in-slac...
The real solution (once again) is to ditch PAE and use full 64-bit on large-RAM boxes, with "large-RAM" being a moving target. This new PAE bug (and I do believe it is a bug) would seem to indicate largeRAM=8G. But finally I find some other people who "get it" in some of these google hits... they are sticking with 32-bit either because their box can't do 64, their weird apps won't support 64, or it's otherwise a major pain to switch to 64. In my case, the agreement I have with this one customer means that I would have to upgrade the box for free, a not insignificant fact considering a complete Fedora reinstall and resetup would take around .5 to 1 day and possibly introduce some issues if I forget little somethings. Besides not having to deal with these PAE kernel bugs, there is literally no other reason to upgrade the OS in this case. Of course, these "PAE bugs" are starting to get frequent enough to almost push me to just bite the bullet...
At least in Fedora (the distro I use everywhere), there is no way to "upgrade" to 64 bit without wiping and starting clean (there used to be a trick way but it's been cutoff a few Fedoras ago). Also, Fedora does not support the interesting solution of using a 64 bit kernel with a 32 bit userland (though I'm sure some other distros do). Since I don't want to fight with yum/dnf every time I update the kernel (which often is every week), that's not an option.
Since I have maybe 2-3 boxes that are good candidates for upgrading 32->64 I'm thinking of writing some scripts to semi-automate the process and hopefully catch any "oops, forgot" things. In my head I think making this process "easy" is possible. I would post my scripts/results. And surely in the future I'll h/w upgrade more boxes to the state where 64 makes sense for them also.
Lastly, I find it interesting, and disconcerting, that nowhere is it really stated whether or not you *can* use PAE on any particular setup. Many distros still default to PAE if you use the 32-bit version. The kernel doesn't warn you on boot ("hey, idiot, you shouldn't use PAE with xGB of RAM"), I've never read a magazine article or book that says "don't do it". Just anecdotes and old wives' tales. The impression given by PAE is (still) that it's a first class citizen, and there are many google hits with people defending that notion. I'm sure PAE adds a heck of a lot of IFDEFs to the kernel source, so the fact it's still maintained means that it's intended to be used.
But, on the other hand, I once again came across Torvalds bashing PAE https://cl4ssic4l.wordpress.com/2011/05/24/linus-torvalds-about-pae/ summarized as "PAE really really sucks". In it he basically says never use PAE above 4G, and that 32-bit really was never designed to be used above 1G! Great little trove of info though, comparing PAE to all the HIMEM issues we all loved so much when fighting with DOS boot disks to just run a stupid game on a 386 back in the day. I also read somewhere else that Linus refuses bugs/fixes against PAE in large RAM setups, though I'm not sure if that's true.
In any case, I'm not in the mood to battle for this bug to get fixed (and a battle it would be) so I leave it to this email (along with my zillion previous ones working up to this revelation) and google to help people who hit this bug in the future, as surely there will be more of them.
For google: original problem search keys: 32-bit PAE kernel with large amounts of RAM, like 10GB, 12GB, 16GB, etc, runs fine after reboot but after a certain amount of writes (or simply uptime, like 1 or 2 days or 6-20 hours, it's variable) starts having extremely poor disk write performance. Disk read remains unaffected. Doesn't matter if you are writing to SSD or spinning rust HDD. Write speed stays high (say 100MB/s) for most of the time before problem hits, then they drop off along a curve quite quickly (80MB/s to 40MB/s in an hour) down to lows (10 - 20MB/s) where they hover, but still get worse very slowly. I've seen as low as 5MB/s. I've read reports of 1MB/s. The speed seems to be related to the max write speed of your disk, so the SSDs still write faster than the rust disks when the bug hits. If you reboot the problem goes away for another few hours/days until it hits again. I'm sure if you had a way to measure total system writes since boot you'd see that the bug hits deterministically every time after X MB of writes, and the curve probably is predictable. You can test with dd conv=fdatasync option and watch the MB/s drop off. Interestingly, with oflag=direct dd option there is no slowdown when the bug hits, so the problem must be with kernel buffer/cache. Lastly, no data corruption ever occurs, things just slow down.