Tuesday, July 2, 2024

Internal clock issues, TSC falling back to hpet in Linux kernel 4.X

Recently I was having some issues at work with internal clocks.

For high performance systems, you might prefer using TSC (time stamp counter) instead of HPET (high precession event timer). There's a good explanation on these two in this Red Hat documentation page.

In my case, running kernel 4.19.0-XX-amd64, I was occasionally seeing this error in the logs:

kernel: [517300.909751] clocksource: timekeeping watchdog on CPU15: hpet retried 2 times before success
This means the cpu watchdog did not get a timely answer from the CPU, for 2 attempts. This can escalate further if there are more than 3 attempts (or the max attempts defined in clocksource.max_cswd_read_retries kernel parameter). The max skew allowed for this event is defined in the kernel, with the value of WATCHDOG_MAX_SKEW:
(extract of kernel 4.19 source kernel/time/clocksource.c)
/*
* Interval: 0.5sec Threshold: 0.0625s
*/
#define WATCHDOG_INTERVAL (HZ >> 1)
#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)

/*
 * Maximum permissible delay between two readouts of the watchdog
 * clocksource surrounding a read of the clocksource being validated.
 * This delay could be due to SMIs, NMIs, or to VCPU preemptions.
 */
#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)
If the readings take longer than the defined max attempts, you may see something like this:
kernel: [4027805.681972] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 113166ns, attempt 4, marking unstable
The internal clock will change to hpet, bringing down the performance of the server. In this case the watchdog answer was about 13 ish microseconds too late.

I couldn't find the reason why this random event would happen, even on an idle server. I tried changing CPU isolation settings, idle states, cpu frequency governor settings, etc. Moreover, being an unpredictable event, it was difficult to make sure the new settings had any real effect on the problem.

Eventually I was able to manually reproduce this with the following stress-ng command:

sudo taskset -c 0-15 stress-ng --timeout 180 --times --verify --metrics-brief --ioport 32 --sysinfo 32 --aggressive --schedpolicy 40 --cpu-load-slice 100

This was on a server with a single processor, 16 cores. It would fall back to HPET within 1 minute. But it has to pin to all available cores, else it won't flip.

Eventually the solution was upgrading to kernel 5.10. It has new kernel options to interact with the clock watchdog, like clocksource.verify_n_cpus. However, just updating kernel made this issue go away for me. The source looks fairly similar on the new kernel:
(extract of kernel 5.10 source kernel/time/clocksource.c)
/*
 * Threshold: 0.0312s, when doubled: 0.0625s.
 * Also a default for cs->uncertainty_margin when registering clocks.
 */
#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 5)

/*
 * Maximum permissible delay between two readouts of the watchdog
 * clocksource surrounding a read of the clocksource being validated.
 * This delay could be due to SMIs, NMIs, or to VCPU preemptions.  Used as
 * a lower bound for cs->uncertainty_margin values when registering clocks.
 */
#define WATCHDOG_MAX_SKEW (100 * NSEC_PER_USEC)

The threshold is increased a bit, but the max skew still 100 microseconds. In kernel 6 we can see this value increased to 125. An alternative if this issue would happen again could be custom build my own kernel, increasing the max skew to something more accommodating to what my servers are reporting. Although that could complicate a bit upgrading kernels later on, and perhaps have an impact on overall performance.

In the end, besides making my clock source issue go away, I'm also observing more user time in the cpu usage space, and less IO wait. A good reminder of the benefits of trying new kernels.