[RndTbl] Trevor's wacky slow 10MB/s disk I/O update [SOLVED]

Thu Jun 23 02:21:31 CDT 2016

On 2016-06-17 Trevor Cordes wrote:
> 
> Whoa!  Indeed we have a cache issue.  Disks are near full speed when 
> writing direct to disk, but slow when going through the cache.

Finally, I believe the end is in sight.  This must be the actual
solution...

Thanks to the club's roundtable discussion I was able to refine my
google queries and discovered that the problem is certainly... drum
roll please... wait for it... shock of shockers... PAE!

For posterity, for those in a hurry, solution is one of:
- reinstall to x86_64 (64 bit) kernel (and in most distros that means
  userland as well)
- add mem=8G to /etc/default/grub GRUB_CMDLINE_LINUX option, regen your
  grub2.conf (in Fedora that's grub2-mkconfig -o /boot/grub2/grub.cfg)
  and reboot
- set /proc/sys/vm/highmem_is_dirtyable to 1 (I didn't test this)

(Excruciating) Details:

It looks like there's a magic line at 8GB of RAM that if you cross with
PAE you start getting the kernel making very strange VM/cache tuning
choices.  This would explain why all my other nearly identical boxes
with 4 and 8GB of RAM don't have this problem, but this newest box with
16G does.  (I didn't want 16GB, but at the time that was the only cost
effective ECC RAM choice.)

When using 16GB RAM, the kernel tunes thusly:
cat /proc/vmstat | grep -P "dirty.*thresh"
nr_dirty_threshold 0
nr_dirty_background_threshold 0

A great workaround/test was to add to the kernel boot cmd line:
mem=8G
I did that and rebooted.  Now the box thinks it has only 8G, not 16G.
Doesn't matter as the box would be fine with only 4G.  It's now been
over 4 days of uptime and my dd tests are still showing 100% full speed
with none of the usual slowdown after 8-20 hours of uptime.

Set at 8G, the kernel auto-tuning shows:
#cat /proc/vmstat | grep -P "dirty.*thresh"
nr_dirty_threshold 19280
nr_dirty_background_threshold 9640

(Thanks for the idea go to
http://stackoverflow.com/questions/30519417/why-linux-disables-disk-write-buffer-when-system-ram-is-greater-than-8gb)

I don't fully understand this tunable or precisely what's going on
here, but the fact that they are getting set to 0 is screwing something
up in the kernel when you've written enough data to fill up something
or other, resulting in abysmal buffered write speeds as low as 1MB/s
for some of the other guys reporting this bug (I only ever saw a low of
5MB/s).

Apparently you can also workaround the problem with:
echo 1 > /proc/sys/vm/highmem_is_dirtyable
But I didn't test that as I'm quite happy with the 8G solution for
now.  Not sure if that tunable will further tune the dirty*threshold
parms above.

Another useful hit was:
http://www.linuxquestions.org/questions/slackware-14/slow-disk-write-in-slackware-14-1-a-4175489580/page2.html

The real solution (once again) is to ditch PAE and use full 64-bit on
large-RAM boxes, with "large-RAM" being a moving target.  This new
PAE bug (and I do believe it is a bug) would seem to indicate
largeRAM=8G.  But finally I find some other people who "get it" in some
of these google hits... they are sticking with 32-bit either because
their box can't do 64, their weird apps won't support 64, or it's
otherwise a major pain to switch to 64.  In my case, the agreement I
have with this one customer means that I would have to upgrade the box
for free, a not insignificant fact considering a complete Fedora
reinstall and resetup would take around .5 to 1 day and possibly
introduce some issues if I forget little somethings.  Besides not
having to deal with these PAE kernel bugs, there is literally no other
reason to upgrade the OS in this case.  Of course, these "PAE bugs" are
starting to get frequent enough to almost push me to just bite the
bullet...

At least in Fedora (the distro I use everywhere), there is no way to
"upgrade" to 64 bit without wiping and starting clean (there used to
be a trick way but it's been cutoff a few Fedoras ago).  Also, Fedora
does not support the interesting solution of using a 64 bit kernel with
a 32 bit userland (though I'm sure some other distros do). Since I don't
want to fight with yum/dnf every time I update the kernel (which often
is every week), that's not an option.

Since I have maybe 2-3 boxes that are good candidates for upgrading
32->64 I'm thinking of writing some scripts to semi-automate the
process and hopefully catch any "oops, forgot" things.  In my head I
think making this process "easy" is possible.  I would post my
scripts/results.  And surely in the future I'll h/w upgrade more boxes
to the state where 64 makes sense for them also.

Lastly, I find it interesting, and disconcerting, that nowhere is it
really stated whether or not you *can* use PAE on any particular
setup.  Many distros still default to PAE if you use the 32-bit
version.  The kernel doesn't warn you on boot ("hey, idiot, you
shouldn't use PAE with xGB of RAM"), I've never read a magazine article
or book that says "don't do it".  Just anecdotes and old wives' tales.
The impression given by PAE is (still) that it's a first class citizen,
and there are many google hits with people defending that notion.  I'm
sure PAE adds a heck of a lot of IFDEFs to the kernel source, so the
fact it's still maintained means that it's intended to be used.

But, on the other hand, I once again came across Torvalds bashing PAE
https://cl4ssic4l.wordpress.com/2011/05/24/linus-torvalds-about-pae/
summarized as "PAE really really sucks".  In it he basically says never
use PAE above 4G, and that 32-bit really was never designed to be used
above 1G!  Great little trove of info though, comparing PAE to all the
HIMEM issues we all loved so much when fighting with DOS boot disks to
just run a stupid game on a 386 back in the day.  I also read somewhere
else that Linus refuses bugs/fixes against PAE in large RAM setups,
though I'm not sure if that's true.

In any case, I'm not in the mood to battle for this bug to get fixed
(and a battle it would be) so I leave it to this email (along with my
zillion previous ones working up to this revelation) and google to help
people who hit this bug in the future, as surely there will be more of
them.

For google: original problem search keys: 32-bit PAE kernel with large
amounts of RAM, like 10GB, 12GB, 16GB, etc, runs fine after reboot but
after a certain amount of writes (or simply uptime, like 1 or 2 days or
6-20 hours, it's variable) starts having extremely poor disk write
performance.  Disk read remains unaffected.  Doesn't matter if you are
writing to SSD or spinning rust HDD.  Write speed stays high (say
100MB/s) for most of the time before problem hits, then they drop off
along a curve quite quickly (80MB/s to 40MB/s in an hour) down to lows
(10 - 20MB/s) where they hover, but still get worse very slowly.  I've
seen as low as 5MB/s.  I've read reports of 1MB/s.  The speed seems to
be related to the max write speed of your disk, so the SSDs still write
faster than the rust disks when the bug hits.  If you reboot the
problem goes away for another few hours/days until it hits again.  I'm
sure if you had a way to measure total system writes since boot you'd
see that the bug hits deterministically every time after X MB of
writes, and the curve probably is predictable.  You can test with dd
conv=fdatasync option and watch the MB/s drop off.  Interestingly, with
oflag=direct dd option there is no slowdown when the bug hits, so the
problem must be with kernel buffer/cache.  Lastly, no data corruption
ever occurs, things just slow down.