Fixing NTP Refusing to Sync
I have just been confronted by NTP absolutely refusing to touch my system’s clock. The trouble with NTP is that it is absolute PITA to debug it at all since when it does not get in sync with its peers, it goes at great lengths to make its reasons as incomprehensible as possible.
For some reason, my system had absolutely massive drift – something in the order of half a second per minute, making the clock drift by several tens of minutes per day. So I installed NTP and hoped that it would magically fix up the issue, but it turns out that NTP by itself is absolutely unhelpful not only in cases of big offset, but also in cases of big drift – it will fix your clock when it is slightly inaccurate, but not when it is inaccurate a lot (…that is, when you would want to use it all the more).
First thing I did was check the hardware’s opinion. Comparing
hwclock --show has shown that the hardware clock is doing fine, only kernel’s idea of time is drifting off. Next, it’s time to see what NTP thinks about its peers:
ntpq> peers remote refid st t when poll reach delay offset jitter ============================================================================== tik.cesnet.cz .GPS. 1 u 12 64 377 0.641 8494.05 2911.29 tak.cesnet.cz .GPS. 1 u 2 64 377 0.636 8594.86 2945.05
NTP polls each peer every “poll” seconds, “when” is relative time of last poll; “reach” keeps track of last successful polls, 377 is best. “Delay” is network delay, this is fine. “Offset” is the offset between local and peer clock, it’s at 8.5s now – not so good, but trouble is it gets bigger quickly. But what’s the real culprit is “jitter” – it’s huge! This means that the variance of offsets is huge – to put it simply, the offset is very different each time it is measured. Since no symbols are printed in the first column of the output, there is no peer synchronization going on.
So if we know a lot about NTP already, the high jitter should hint us that the offset measurements are unreliable. But the network connection of our server is very good, it would be nice to look at the actual measurements. Instead of peers, let’s look at their associations:
ntpq> as ind assID status conf reach auth condition last_event cnt =========================================================== 1 55713 9014 yes yes none reject reachable 1 2 55714 9014 yes yes none reject reachable 1
NTP is not liking our peers. No surprise, with the big jitter. But what we are after are the assID numbers:
ntpq> rv 55713 assID=55713 status=9014 reach, conf, 1 event, event_reach, srcadr=tik.cesnet.cz, srcport=123, dstadr=188.8.131.52, dstport=123, leap=00, stratum=1, precision=-20, rootdelay=0.000, rootdispersion=0.000, refid=GPS, reach=377, unreach=0, hmode=3, pmode=4, hpoll=6, ppoll=6, flash=400 peer_dist, keyid=0, ttl=0, offset=13041.231, delay=0.602, dispersion=0.944, jitter=2918.331, reftime=cf803b51.ddd3e70e Mon, Apr 26 2010 18:18:25.866, org=cf803b83.e9b29181 Mon, Apr 26 2010 18:19:15.912, rec=cf803b76.df382c7c Mon, Apr 26 2010 18:19:02.871, xmt=cf803b76.df0d40c7 Mon, Apr 26 2010 18:19:02.871, filtdelay= 0.60 0.64 0.60 0.51 0.82 0.67 0.69 0.64, filtoffset= 13041.2 12385.8 11720.4 11075.2 10409.6 9774.54 9129.22 8494.06, filtdisp= 0.00 0.98 1.97 2.93 3.92 4.86 5.82 6.77
Looking at the last three lines, the reason for the huge jitter finally seems clear! Our clock drifts so fast that the offset will go up by several seconds through our few measurements.
Unfortunately, NTP does not seem to be giving us the actual estimated drift value between local clock and the peer. This would be very useful since that’s actually what makes NTP decide whether go ahead and sync or keep its hands away from the clock; it is said that 500ppm is the max. drift value for possible synchronization, but I don’t know how to connect that to any of the other numbers I see; when the clock is already in sync, it is probably the ‘frequency’ value in ‘rv’ (and it is stored in the drift file), but this value stays untouched before synchronization. Too bad.
So, now we know the issue is that kernel clock is going too slow and that NTP is not going to fix it for ourselves. So, we must resort to manual tinkering using adjtimex:
# adjtimex -p mode: 0 offset: 0 frequency: 0 maxerror: 0 esterror: 0 status: 64 time_constant: 4 precision: 1 tolerance: 32768000 tick: 9900 raw time: 1272299204s 17444us = 1272299204.017444 return value = 5
Wow, a lot of numbers. But the one that tells how fast the clock is going is the ‘tick’ value, and you can adjust it using
adjtimex -t 10000 – that will make the clock go a lot faster, and is also sort-of canonical value. Let’s just do that and restart ntpd:
remote refid st t when poll reach delay offset jitter ============================================================================== tik.cesnet.cz .GPS. 1 u 1 64 7 0.659 16852.5 1.840 tak.cesnet.cz .GPS. 1 u 2 64 7 0.665 16852.5 1.863
This is MUCH better! In fact, after few minutes NTP will decide to step the clock to compensate the offset, and after another while it will finally get in sync with the peers. If the jitter is still too big (but different), keep tweaking the tick value.
EDIT: It seems that alternatively, you can try to change your clock source – this might help especially in case of virtualization:
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource hpet acpi_pm jiffies tsc # cat /sys/devices/system/clocksource/clocksource0/current_clocksource hpet
Hope this helps if your NTP also refuses to fix your clock.
Open questions remain:
- Why was my tick value so off? I guess I will never know. Maybe a reboot would fix it too, but I wasn’t keen to do that.
- How to determine drift-per-peer value to see how much out of bounds it is?
- How to make NTP automatically fix even huge drifts?
- Why is NTP crafted to be so hard to debug without spending tens of minutes googling, staring at bunches of floats and decoding bitmasks manually?
Thanks to prema and otis for ideas and help.