On 2012-04-11 17:00, I wrote:
After upgrading many of our systems, both workstations and servers, from CentOS 5.x to Scientific Linux 6.x, I'm seeing higher load averages on idle systems than I used to. Under EL5, loads would drop to zero and pretty much stay there most of the time for idles systems. Under EL6, the load might drop down to 0.1, but doesn't stay there for very long, and even on seemingly idle systems, I see loads at or near 1 (sometimes even higher than 1 on some of our servers). It's also intermittent, with load averages dropping and climbing on fairly short intervals (of a few minutes or so).
Problem solved (at long last)!...
It turns out the problem was with "hald" polling the CD/DVD-ROM drive every two seconds. I had previously dismissed that as the potential problem, given that this seemed to be no different than the way hald worked under EL5 systems.
Running top, iotop, ftop, iftop, etc. doesn't really point to any major culprits. I've even run PowerTop, and implemented some of its suggested improvements, but that didn't make a difference on load.
My bad... PowerTop had indeed recommended I disable polling in hald, but I wasn't sure I wanted to disable that feature, particularly on the workstations (not really needed on the servers, though). Also, as I said above, I didn't think this was any different than in EL5, but apparently it is.
Also, hald-addon-storage (the sub-process that does the polling) wasn't sticking around long enough to show a big CPU load in "top", particularly with the default 3 second update delay, but when I dropped the delay to 1/2 a second, I was seeing it show up briefly every once in a while. (I was also seeing the irqbalance process show up as well, and mistakenly thought it might be the culprit. This seemed to make sense at the time, since I was seeing higher loads on our 16-core servers than the dual-core workstations, but that was a red herring.)
Just wondering if anyone else has seen similar behaviour with hosts running Red Hat and/or Fedora distributions? Would moving to the "tickless" kernel have anything to do with it? (I.e. does it somehow affect the way load averages are calculated?)
Still not sure if the new kernel makes a difference or not, but there must be something different about the way hald-addon-storage interacts with it to do the polling in EL6, compared to EL5. (Or have they just made the polling more aggressive, by reducing the interval?)
Or is it some system service that can be shut down? (If it is, it's not creating an obvious load on its own, that top or ftop would show, but it may be affecting something in the kernel...)
As you can see by the attached graph of the load average, disabling polling on the CD-ROM drive yesterday afternoon seems to have made all the difference. Here's the command PowerTop recommended:
hal-disable-polling --device /dev/cdrom
(Device name may vary.) The beauty of this, compared to disabling polling for all storage devices, is that you can disable it on a device basis, and keep polling enabled, e.g. for USB devices that might get inserted.