Yesterday’s leap second killed
half the Internet, including Pirate
Bay, Reddit,
LinkedIn,
Gawker
Media and a host of other
sites. Even an airline.
Any Linux user processes that depends on kernel threads had
a high chance of failing. That includes MySQL and many Java servers like
webapps, Hadoop, Cassandra, etc. The symptom was the user process spinning
at 100% CPU even after being restarted. A quick fix seems to be setting
the system clock which apparently resets the bad state in the kernel
(we hope).
The underlying cause is something about how the kernel handled the extra second broke the futex locks used by threaded processes. Here’s a very detailed analysis on the failing code but I’m not sure it’s correct. According to this analysis the bug was introduced in 2008, then fixed in March 2012. But it may be the March fix is part of the problem. OTOH most of the systems that failed will be running kernels older than March so the problem must go further back. There's a kernel fix and also a detailed analysis. Time is hard, let’s go shopping. It’s frustrating that these bugs keep popping up; the theory is not so difficult. The NTP daemon tells the kernel a leap second is coming via adjtime(), the kernel should handle it by slewing or holding the clock, all is well. But it didn’t work in 2012. Didn’t work in 2009 either; a logging bug caused kernels to crash on the leap second. 2005 was better. Google’s solution of giving up on the kernel entirely and having the NTP daemon lie about what time it is seems more clever now. I got hit by this bug myself, the CrashPlan
backup daemon runs Java and got caught in a spin. And
none of my machines really kept time right because POSIX
does not account for leap
seconds. Both Ubuntu boxes just ran 23:59:59 twice, so time went
backwards on a subsecond basis. My Mac was even worse, it actually flipped
over to 00:00:00 before going backwards to 23:59:59 briefly. |