strange NTP problem on one of 3 peers

List overview All Threads
Download

newer

older

burn-in software

*Free to Choose* Your Pocket...

Gilles Detillieux

25 Apr 2012 25 Apr '12

3:53 p.m.

I have a weird problem with clock drift that just started to happen today on one of my Linux systems. I was wondering if someone on the list has some NTP experience and could help me solve this puzzle.

I have a group of 3 systems operating as peers, and they've been keeping time well for years. Yesterday I upgraded them from Scientific Linux 5.7 to 5.8 (an RHEL 5.8 clone like CentOS 5.8), and rebooted them to the latest kernel on SL 5.8, 2.6.18-308.4.1.el5. I rebooted 2 of them yesterday evening, and the last one I set an at job to reboot at 2:30 am. (It's our mail server so I didn't want to reboot it earlier.) This morning, I noticed this last system's clock was 4-5 minutes behind the others. I've stopped ntpd, reset the clock to the correct time, and restarted ntpd. I've done this twice already this morning, and each time, the clock starts slowly drifting backwards.

The syslog entries from ntpd in /var/log/messages on the 2 other systems show fairly frequent occurrences of "synchronized to <IP>, stratum <n>", where n is usually 2 or 3. But for the mail server with the drifting clock, the only ntp sync logged this week was at 21:03:03 yesterday. The last ones before that were April 10 & April 4, i.e. very irregularly. The oldest log entries I have in /var/log/messages.4 show more regular syncs (at least 1-2 a day) up to March 31. So it's possible this problem existed for a while and had nothing to do with the updates yesterday, but this is the first time the drift got so bad it drew attention to itself (some file modification times got out of sync between this server and another system).

I'd appreciate any ideas on how to tackle this problem.

Gilles

-- Gilles R. Detillieux E-mail: grdetil@scrc.umanitoba.ca Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada)

Show replies by date

Gilbert E. Detillieux

25 Apr 25 Apr

4:03 p.m.

On 2012-04-25 10:53, Gilles Detillieux wrote:

...

I have a weird problem with clock drift that just started to happen today on one of my Linux systems. I was wondering if someone on the list has some NTP experience and could help me solve this puzzle.

I have a group of 3 systems operating as peers, and they've been keeping time well for years. Yesterday I upgraded them from Scientific Linux 5.7 to 5.8 (an RHEL 5.8 clone like CentOS 5.8), and rebooted them to the latest kernel on SL 5.8, 2.6.18-308.4.1.el5. I rebooted 2 of them yesterday evening, and the last one I set an at job to reboot at 2:30 am. (It's our mail server so I didn't want to reboot it earlier.) This morning, I noticed this last system's clock was 4-5 minutes behind the others. I've stopped ntpd, reset the clock to the correct time, and restarted ntpd. I've done this twice already this morning, and each time, the clock starts slowly drifting backwards.

The syslog entries from ntpd in /var/log/messages on the 2 other systems show fairly frequent occurrences of "synchronized to <IP>, stratum <n>", where n is usually 2 or 3. But for the mail server with the drifting clock, the only ntp sync logged this week was at 21:03:03 yesterday. The last ones before that were April 10 & April 4, i.e. very irregularly. The oldest log entries I have in /var/log/messages.4 show more regular syncs (at least 1-2 a day) up to March 31. So it's possible this problem existed for a while and had nothing to do with the updates yesterday, but this is the first time the drift got so bad it drew attention to itself (some file modification times got out of sync between this server and another system).

I'd appreciate any ideas on how to tackle this problem.

Are you sure ntpd is running on all systems? Try running the following command, on each of your systems:

/usr/sbin/ntpq -p

This will tell you not only whether ntpd is running, but also where each one is getting its clock settings from, what the drift is, etc.

Note that if the initial clock setting is too far out of whack, ntpd may not even start properly. It's usually useful to run ntpdate first, to at least start off with a close-to-synchronized clock. For some reason, RHEL systems don't do that by default even when you enable ntpd.

-- Gilbert E. Detillieux E-mail: gedetil@muug.mb.ca Manitoba UNIX User Group Web: http://www.muug.mb.ca/ PO Box 130 St-Boniface Phone: (204)474-8161 Winnipeg MB CANADA R2H 3B4 Fax: (204)474-7609

Gilles Detillieux

4:56 p.m.

On 04/25/2012 11:03 AM, Gilbert E. Detillieux wrote:

...

Are you sure ntpd is running on all systems? Try running the following command, on each of your systems:

/usr/sbin/ntpq -p

This will tell you not only whether ntpd is running, but also where each one is getting its clock settings from, what the drift is, etc.

Note that if the initial clock setting is too far out of whack, ntpd may not even start properly. It's usually useful to run ntpdate first, to at least start off with a close-to-synchronized clock. For some reason, RHEL systems don't do that by default even when you enable ntpd.

I ran that on all 3 systems, and it shows ntpd is indeed running on all. SL 5 does seem to run ntpdate first, before starting ntpd, to get the clock sync'ed up beforehand, as long as you have systems defined in /etc/ntp/step-tickers or you put a -x in OPTIONS in /etc/sysconfig/ntpd.

But I wonder if there are some NTP servers on the net that are out of whack. When I run ntpq, the system that has the drift (cliff) shows different results than the other two:

On cliff: remote refid st t when poll reach delay offset jitter ============================================================================== caustique.anox. 209.51.161.238 2 u 5 64 73 35.302 32.315 4564.06 tb.mircx.com 64.90.182.55 2 u 2 64 77 46.530 72.637 4517.32 cliff.scrc.uman .INIT. 16 u - 64 0 0.000 0.000 0.000 larry.scrc.uman 208.80.96.70 3 u 16 64 76 0.001 6019.84 3675.39 dave2.scrc.uman 209.167.68.100 3 u 4 64 42 0.001 6140.53 3858.45 *LOCAL(0) .LOCL. 10 l 2 64 77 0.000 0.000 0.001

On larry: remote refid st t when poll reach delay offset jitter ============================================================================== +zeus.yocum.org 131.188.3.220 2 u 148 256 377 35.330 -2.723 3.170 *ellen.linuxgene 142.3.100.2 2 u 214 256 377 31.665 -1.240 2.296 cliff.scrc.uman .INIT. 16 u 17 64 0 27.461 -87882. 0.000 larry.scrc.uman .INIT. 16 u - 1024 0 0.000 0.000 0.000 +dave2.scrc.uman 209.167.68.100 3 u 204 256 377 0.400 2.594 0.890 LOCAL(0) .LOCL. 10 l 44 64 377 0.000 0.000 0.001

dave2's results are similar to larry's.

A few other things I thought I should point out: All 3 systems have ports 123/tcp and 123/udp open in iptables. The clock on cliff seems to drift whether or not ntpd is running, though that could be because the calculated drift compensation is out of whack. The /var/lib/ntp/drift file on cliff hasn't been modified since 1:58 this morning, before the reboot, while it has been on the other 2 systems. All 3 systems have an identical configuration, using 0.pool.ntp.org as the step-ticker, and 0.pool.ntp.org and 1.pool.ntp.org as stratum 1 servers.

-- Gilles R. Detillieux E-mail: grdetil@scrc.umanitoba.ca Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada)

Mike Pfaiffer

10:49 p.m.

On 12-04-25 11:56 AM, Gilles Detillieux wrote:

...

On 04/25/2012 11:03 AM, Gilbert E. Detillieux wrote:

...
Are you sure ntpd is running on all systems? Try running the following command, on each of your systems:

/usr/sbin/ntpq -p

This will tell you not only whether ntpd is running, but also where each one is getting its clock settings from, what the drift is, etc.

Note that if the initial clock setting is too far out of whack, ntpd may not even start properly. It's usually useful to run ntpdate first, to at least start off with a close-to-synchronized clock. For some reason, RHEL systems don't do that by default even when you enable ntpd.

I ran that on all 3 systems, and it shows ntpd is indeed running on all. SL 5 does seem to run ntpdate first, before starting ntpd, to get the clock sync'ed up beforehand, as long as you have systems defined in /etc/ntp/step-tickers or you put a -x in OPTIONS in /etc/sysconfig/ntpd.

But I wonder if there are some NTP servers on the net that are out of whack. When I run ntpq, the system that has the drift (cliff) shows different results than the other two:

On cliff: remote refid st t when poll reach delay offset jitter ==============================================================================

caustique.anox. 209.51.161.238 2 u 5 64 73 35.302 32.315 4564.06 tb.mircx.com 64.90.182.55 2 u 2 64 77 46.530 72.637 4517.32 cliff.scrc.uman .INIT. 16 u - 64 0 0.000 0.000 0.000 larry.scrc.uman 208.80.96.70 3 u 16 64 76 0.001 6019.84 3675.39 dave2.scrc.uman 209.167.68.100 3 u 4 64 42 0.001 6140.53 3858.45 *LOCAL(0) .LOCL. 10 l 2 64 77 0.000 0.000 0.001

On larry: remote refid st t when poll reach delay offset jitter ==============================================================================

+zeus.yocum.org 131.188.3.220 2 u 148 256 377 35.330 -2.723 3.170 *ellen.linuxgene 142.3.100.2 2 u 214 256 377 31.665 -1.240 2.296 cliff.scrc.uman .INIT. 16 u 17 64 0 27.461 -87882. 0.000 larry.scrc.uman .INIT. 16 u - 1024 0 0.000 0.000 0.000 +dave2.scrc.uman 209.167.68.100 3 u 204 256 377 0.400 2.594 0.890 LOCAL(0) .LOCL. 10 l 44 64 377 0.000 0.000 0.001

dave2's results are similar to larry's.

A few other things I thought I should point out: All 3 systems have ports 123/tcp and 123/udp open in iptables. The clock on cliff seems to drift whether or not ntpd is running, though that could be because the calculated drift compensation is out of whack. The /var/lib/ntp/drift file on cliff hasn't been modified since 1:58 this morning, before the reboot, while it has been on the other 2 systems. All 3 systems have an identical configuration, using 0.pool.ntp.org as the step-ticker, and 0.pool.ntp.org and 1.pool.ntp.org as stratum 1 servers.

I'm somewhat hesitant to reply to this thread since, compared to you guys, I'm an amateur. However the folks at the CLL lab in Winnipeg and I have noticed similar behaviour in standalone machines not connected to an ntp server. Clearly this isn't the same problem you're experiencing but it looks close. The reason it looks this way in the standalone machines is because the internal battery used to maintain the settings is running low on power. I phrased it this way because the same behaviour occasionally appears in Macs as well as PCs. The solution is to replace the batteries. However we can put off replacing the battery if we connect the machine to an ntp server. Eventually it gets to the point where the battery can't maintain ANY settings. As the charge goes down the results are similar to what you are seeing.

Considering we deal with OLD (but mostly useful) machines at the CLL I am inclined to look at hardware rather than software as the major source of problems.

I doubt this will be useful to you but it is best to check out all possibilities starting with the simple stuff first.

Later Mike

Gilles Detillieux

26 Apr 26 Apr

2:14 a.m.

On 25/04/2012 5:49 PM, Mike Pfaiffer wrote:

...

I'm somewhat hesitant to reply to this thread since, compared to 
you guys, I'm an amateur. However the folks at the CLL lab in Winnipeg and I have noticed similar behaviour in standalone machines not connected to an ntp server. Clearly this isn't the same problem you're experiencing but it looks close. The reason it looks this way in the standalone machines is because the internal battery used to maintain the settings is running low on power. I phrased it this way because the same behaviour occasionally appears in Macs as well as PCs. The solution is to replace the batteries. However we can put off replacing the battery if we connect the machine to an ntp server. Eventually it gets to the point where the battery can't maintain ANY settings. As the charge goes down the results are similar to what you are seeing.
Considering we deal with OLD (but mostly useful) machines at the 
CLL I am inclined to look at hardware rather than software as the major source of problems.
I doubt this will be useful to you but it is best to check out all 
possibilities starting with the simple stuff first.

Thanks, Mike. Some of the comments I found when researching this online suggested a weak battery as a possible cause as well, so it's a possibility I'll explore tomorrow. I would have thought that NTP would compensate for this, but I guess it will only compensate so much. I had also been under the impression that the kernel maintained the system clock internally, separately from the hardware clock, but I'm now getting the impression that current Linux systems tend to synchronize these dynamically. It's a 4.5 year old machine, so it could well be due for a new battery.

Other possible causes that have been implicated in clock drift are Xen related issues and NIC problems. I had installed Xen on the server that's giving me problems now, and I had switched to a non-xen kernel in March, but hadn't removed all the xen-related packages and configurations, so it's possible that some of that was causing problems. I've removed all the xen and virtualisation packages and libraries, and I'm going to reboot again in the wee hours to see if that helps. If not, I'll see about replacing the battery. Failing all that, I'll see about putting in a new network card to replace the onboard NIC, as I've read about someone who solved an NTP clock drift problem doing just that. The trouble with the hardware solutions is they'll require downtime on our mail and web server during the day when I'm there.

Gilles

-- Gilles R. Detillieux E-mail:grdetil@scrc.umanitoba.ca Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada)

Gilles Detillieux

6:19 p.m.

On 04/25/2012 09:14 PM, I wrote:

...

Other possible causes that have been implicated in clock drift are Xen related issues and NIC problems. I had installed Xen on the server that's giving me problems now, and I had switched to a non-xen kernel in March, but hadn't removed all the xen-related packages and configurations, so it's possible that some of that was causing problems. I've removed all the xen and virtualisation packages and libraries, and I'm going to reboot again in the wee hours to see if that helps.

Well, that seems to have done it! After cleaning up all the remaining xen cruft from the system, and rebooting last night, the clock seems to be rock-solid today. I guess the moral is if you have any of the xen virtualisation libraries or services enabled, you need to be running the xen kernel. I expect the reverse may also be true.

-- Gilles R. Detillieux E-mail: grdetil@scrc.umanitoba.ca Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada)

Dan Martin

2:23 a.m.

New subject: store changes made to web page

I have designed a web page to edit information for my own use. Using a link to a page of javascript, I can select radio buttons, change text directly in text areas, etc. The html is a useful GUI interface.

There are many inputs, but no forms. This is strictly client side processing. I may have to run this from machines where I have no server access, and no administrative control over the machine.

It appears to work well visually, but when I save the altered web page (using File : Save Page As in Firefox) the altered DOM is not saved. Not even simple text changes. It appears that the original web page is saved instead. Also, the changes do not show in Firebug.

Interestingly, there seems to be an exception with a section that adds new nodetrees rather than modifying old ones. The new nodes show in Firebug, and I can save the html by right clicking the html in Firebug and choosing "Save HTML". For the modified nodes, the original version is saved.

Does anyone have ideas on how I can save the modified DOM tree, the one that displays on the web page?

Dan Martin GP Hospital Practitioner Computer Scientist ummar143@shaw.ca (204) 831-1746 answering machine always on

Trevor Cordes

30 Apr 30 Apr

3:14 a.m.

New subject: store changes made to web page

On 2012-04-25 Dan Martin wrote:

...

Does anyone have ideas on how I can save the modified DOM tree, the one that displays on the web page?

Welcome to modern browser hell :-)

If you are saving it manually, use the "Web Developer 1.1.9" add-on in firefox and use its View Source -> View Generated Source option and then save from there.

Also, perl cpan WWW::Mechanize::Firefox should be able to save the post-DOM-changes version if you pause long enough or wait for the correct events (or trigger it manually). I use it extensively.

Dan Martin

4:28 a.m.

New subject: store changes made to web page

On 2012-04-29, at 10:14 PM, Trevor Cordes wrote:

...

On 2012-04-25 Dan Martin wrote:

...
Does anyone have ideas on how I can save the modified DOM tree, the one that displays on the web page?

Welcome to modern browser hell :-)

If you are saving it manually, use the "Web Developer 1.1.9" add-on in firefox and use its View Source -> View Generated Source option and then save from there.

Also, perl cpan WWW::Mechanize::Firefox should be able to save the post-DOM-changes version if you pause long enough or wait for the correct events (or trigger it manually). I use it extensively. _______________________________________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable

I will definitely look into Web Developer. I may review mechanize - I think ruby has a mechanize which is simply hooks to the perl mechanize, but I could be wrong.

In the meantime, I have found that if I add elements and text nodes directly, these changes seem to show in firebug and persist through a "save" operation. Since attribute changes do not show, including the value attributes of inputs, I have attached text nodes which my javascript scripts modify in addition to the values.

I cannot attach text nodes to textbox inputs. I make the skeleton DOM in ruby's REXML library. Displaying the html in text form shows that the correct tags/content are generated, but firefox rearranges things so that the text belongs to the node containing the input. I changed all of my textbox inputs to textareas, which can contain text nodes, only to find I had no way to store the results of my checkboxes.

I finally created an XML tree inside a div which is invisible, for the sole purpose of storing results. It has the added benefit of sensible tag names and is easier to scrape afterward.

Using html as my GUI turned out to be a lot harder than I anticipated, though I think I am close to getting it done now.

Dan Martin GP Hospital Practitioner Computer Scientist ummar143@shaw.ca (204) 831-1746 answering machine always on

Kevin McGregor

26 Apr 26 Apr

2:27 a.m.

If one NTP server in your list is significantly off, the NTP algorithm should notice and eliminate it from its calculations automatically. I wouldn't worry about that point. None of these are virtual, are they?

On Wed, Apr 25, 2012 at 11:56 AM, Gilles Detillieux < grdetil@scrc.umanitoba.ca> wrote:

...

On 04/25/2012 11:03 AM, Gilbert E. Detillieux wrote:

...
Are you sure ntpd is running on all systems? Try running the following command, on each of your systems:

/usr/sbin/ntpq -p

This will tell you not only whether ntpd is running, but also where each one is getting its clock settings from, what the drift is, etc.

Note that if the initial clock setting is too far out of whack, ntpd may not even start properly. It's usually useful to run ntpdate first, to at least start off with a close-to-synchronized clock. For some reason, RHEL systems don't do that by default even when you enable ntpd.

I ran that on all 3 systems, and it shows ntpd is indeed running on all. SL 5 does seem to run ntpdate first, before starting ntpd, to get the clock sync'ed up beforehand, as long as you have systems defined in /etc/ntp/step-tickers or you put a -x in OPTIONS in /etc/sysconfig/ntpd.

But I wonder if there are some NTP servers on the net that are out of whack. When I run ntpq, the system that has the drift (cliff) shows different results than the other two:

On cliff: remote refid st t when poll reach delay offset jitter ==============================**==============================** ================== caustique.anox. 209.51.161.238 2 u 5 64 73 35.302 32.315 4564.06 tb.mircx.com 64.90.182.55 2 u 2 64 77 46.530 72.637 4517.32 cliff.scrc.uman .INIT. 16 u - 64 0 0.000 0.000 0.000 larry.scrc.uman 208.80.96.70 3 u 16 64 76 0.001 6019.84 3675.39 dave2.scrc.uman 209.167.68.100 3 u 4 64 42 0.001 6140.53 3858.45 *LOCAL(0) .LOCL. 10 l 2 64 77 0.000 0.000 0.001

On larry: remote refid st t when poll reach delay offset jitter ==============================**==============================** ================== +zeus.yocum.org 131.188.3.220 2 u 148 256 377 35.330 -2.723 3.170 *ellen.linuxgene 142.3.100.2 2 u 214 256 377 31.665 -1.240 2.296 cliff.scrc.uman .INIT. 16 u 17 64 0 27.461 -87882. 0.000 larry.scrc.uman .INIT. 16 u - 1024 0 0.000 0.000 0.000 +dave2.scrc.uman 209.167.68.100 3 u 204 256 377 0.400 2.594 0.890 LOCAL(0) .LOCL. 10 l 44 64 377 0.000 0.000 0.001

dave2's results are similar to larry's.

A few other things I thought I should point out: All 3 systems have ports 123/tcp and 123/udp open in iptables. The clock on cliff seems to drift whether or not ntpd is running, though that could be because the calculated drift compensation is out of whack. The /var/lib/ntp/drift file on cliff hasn't been modified since 1:58 this morning, before the reboot, while it has been on the other 2 systems. All 3 systems have an identical configuration, using 0.pool.ntp.org as the step-ticker, and 0.pool.ntp.organd 1.pool.ntp.org as stratum 1 servers.

-- Gilles R. Detillieux E-mail: grdetil@scrc.umanitoba.ca Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada) ______________________________**_________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/**listinfo/roundtable http://www.muug.mb.ca/mailman/listinfo/roundtable

Gilles Detillieux

3:11 a.m.

Yeah, I suspect it's my own server that's getting eliminated because it's off by so much, which is why it's not logging any syncs. All my servers are separate physical machines, no VMs.

On 25/04/2012 9:27 PM, Kevin McGregor wrote:

...

If one NTP server in your list is significantly off, the NTP algorithm should notice and eliminate it from its calculations automatically. I wouldn't worry about that point. None of these are virtual, are they?

-- Gilles R. Detillieux E-mail:grdetil@scrc.umanitoba.ca Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada)

John Lange

1:40 p.m.

Just to confirm an earlier point, ntp will stop trying to sync the clock if it gets too far out.

This was a very common problem on VMs and a lot of people resorted to running ntpdate (or more recently ntpd -q which is much "gentler") on a cron at frequent intervals just to keep it close. (I haven't noticed this happening so much recently so this must be a problem that has been solved).

In your case you say your not virtualized... wait.. woah.. You're not virtualized?!?

There are settings for ntp which tell it never to give up no matter how bad the clock skews.

Take a look at the ntp man page ( http://linux.die.net/man/8/ntpd ) and search for the word "panic". Then read the sections "POLL INTERVAL CONTROL", and "THE HUFF-N'-PUFF FILTER".

Then read this: http://www.eecis.udel.edu/~mills/ntp/html/miscopt.html#tinker

Specifically, you put "tinker panic 0" in the config file.

Also, if your bios clock is way out, look at starting ntp with the "-g" option (but this only applies at startup).

And finally, run ntpd in debug mode ( -d ) and watch it in real time on the console.

ntp wins my award for the software that seemingly does the most simple task but is in fact doing something massively complex.

John

4849

Age (days ago)

4854

Last active (days ago)

roundtable@muug.ca

11 comments

7 participants

tags (0)

participants (7)

Dan Martin
Gilbert E. Detillieux
Gilles Detillieux
John Lange
Kevin McGregor
Mike Pfaiffer
Trevor Cordes