Pasky’s Log

Fixing NTP Refusing to Sync

April 26th, 2010 9 comments

I have just been confronted by NTP absolutely refusing to touch my system’s clock. The trouble with NTP is that it is absolute PITA to debug it at all since when it does not get in sync with its peers, it goes at great lengths to make its reasons as incomprehensible as possible.

For some reason, my system had absolutely massive drift – something in the order of half a second per minute, making the clock drift by several tens of minutes per day. So I installed NTP and hoped that it would magically fix up the issue, but it turns out that NTP by itself is absolutely unhelpful not only in cases of big offset, but also in cases of big drift – it will fix your clock when it is slightly inaccurate, but not when it is inaccurate a lot (…that is, when you would want to use it all the more).

First thing I did was check the hardware’s opinion. Comparing date and hwclock --show has shown that the hardware clock is doing fine, only kernel’s idea of time is drifting off. Next, it’s time to see what NTP thinks about its peers:

ntpq> peers
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 tik.cesnet.cz   .GPS.            1 u   12   64  377    0.641  8494.05 2911.29
 tak.cesnet.cz   .GPS.            1 u    2   64  377    0.636  8594.86 2945.05

NTP polls each peer every “poll” seconds, “when” is relative time of last poll; “reach” keeps track of last successful polls, 377 is best. “Delay” is network delay, this is fine. “Offset” is the offset between local and peer clock, it’s at 8.5s now – not so good, but trouble is it gets bigger quickly. But what’s the real culprit is “jitter” – it’s huge! This means that the variance of offsets is huge – to put it simply, the offset is very different each time it is measured. Since no symbols are printed in the first column of the output, there is no peer synchronization going on.

So if we know a lot about NTP already, the high jitter should hint us that the offset measurements are unreliable. But the network connection of our server is very good, it would be nice to look at the actual measurements. Instead of peers, let’s look at their associations:

ntpq> as

ind assID status  conf reach auth condition  last_event cnt
===========================================================
  1 55713  9014   yes   yes  none    reject   reachable  1
  2 55714  9014   yes   yes  none    reject   reachable  1

NTP is not liking our peers. No surprise, with the big jitter. But what we are after are the assID numbers:

ntpq> rv 55713
assID=55713 status=9014 reach, conf, 1 event, event_reach,
srcadr=tik.cesnet.cz, srcport=123, dstadr=195.113.20.142, dstport=123,
leap=00, stratum=1, precision=-20, rootdelay=0.000,
rootdispersion=0.000, refid=GPS, reach=377, unreach=0, hmode=3, pmode=4,
hpoll=6, ppoll=6, flash=400 peer_dist, keyid=0, ttl=0, offset=13041.231,
delay=0.602, dispersion=0.944, jitter=2918.331,
reftime=cf803b51.ddd3e70e  Mon, Apr 26 2010 18:18:25.866,
org=cf803b83.e9b29181  Mon, Apr 26 2010 18:19:15.912,
rec=cf803b76.df382c7c  Mon, Apr 26 2010 18:19:02.871,
xmt=cf803b76.df0d40c7  Mon, Apr 26 2010 18:19:02.871,
filtdelay=     0.60    0.64    0.60    0.51    0.82    0.67    0.69    0.64,
filtoffset= 13041.2 12385.8 11720.4 11075.2 10409.6 9774.54 9129.22 8494.06,
filtdisp=      0.00    0.98    1.97    2.93    3.92    4.86    5.82    6.77

Looking at the last three lines, the reason for the huge jitter finally seems clear! Our clock drifts so fast that the offset will go up by several seconds through our few measurements.

Unfortunately, NTP does not seem to be giving us the actual estimated drift value between local clock and the peer. This would be very useful since that’s actually what makes NTP decide whether go ahead and sync or keep its hands away from the clock; it is said that 500ppm is the max. drift value for possible synchronization, but I don’t know how to connect that to any of the other numbers I see; when the clock is already in sync, it is probably the ‘frequency’ value in ‘rv’ (and it is stored in the drift file), but this value stays untouched before synchronization. Too bad.

So, now we know the issue is that kernel clock is going too slow and that NTP is not going to fix it for ourselves. So, we must resort to manual tinkering using adjtimex:

# adjtimex -p
         mode: 0
       offset: 0
    frequency: 0
     maxerror: 0
     esterror: 0
       status: 64
time_constant: 4
    precision: 1
    tolerance: 32768000
         tick: 9900
     raw time:  1272299204s 17444us = 1272299204.017444
 return value = 5

Wow, a lot of numbers. But the one that tells how fast the clock is going is the ‘tick’ value, and you can adjust it using adjtimex -t 10000 – that will make the clock go a lot faster, and is also sort-of canonical value. Let’s just do that and restart ntpd:

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 tik.cesnet.cz   .GPS.            1 u    1   64    7    0.659  16852.5   1.840
 tak.cesnet.cz   .GPS.            1 u    2   64    7    0.665  16852.5   1.863

This is MUCH better! In fact, after few minutes NTP will decide to step the clock to compensate the offset, and after another while it will finally get in sync with the peers. If the jitter is still too big (but different), keep tweaking the tick value.

EDIT: It seems that alternatively, you can try to change your clock source – this might help especially in case of virtualization:

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm jiffies tsc
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

Hope this helps if your NTP also refuses to fix your clock.

Open questions remain:

Why was my tick value so off? I guess I will never know. Maybe a reboot would fix it too, but I wasn’t keen to do that.
How to determine drift-per-peer value to see how much out of bounds it is?
How to make NTP automatically fix even huge drifts?
Why is NTP crafted to be so hard to debug without spending tens of minutes googling, staring at bunches of floats and decoding bitmasks manually?

Thanks to prema and otis for ideas and help.

Categories: linux Tags: clock, debian, ntp

Archive

Fixing NTP Refusing to Sync

Recent Comments

Categories

Blogroll

Licence