[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Massive Instant Clock Jump & Freeze domU Issue (NOT Related to Drift, Live Migration or Saving/Restoring)

Hello Xen Developers! After fully researching ourselves and talking to many Xen consultants, we have been advised to inquire here about a rare Xen bug we are possibly experiencing. Any help or advice would be much appreciated, thanks in advance! We're also open to offering some financial support to solve this problem.
Here is a summary of the problem:
-very infrequently the domU clock is instantly jumping ahead a massive amount of time and then appearing to lock on the new time (i.e. time stops)
-this has only happened 3 times since Jan. 2013 for us on two different physical rack mounted machines that are still running today with very similar parts and configuration
-the clock jumped ahead to the year 2264 in the first two occurrences and only 3 days ahead in the third

Here are more specific details:
-when the incident occurred there was no heavy load, no high temperatures, no hardware/memory/EDAC errors, no swapping, no errors reported anywhere
-the dom0 and other concurrently running domUs had no clock issues, the hardware BIOS clock remained OK as well
-the clock did not slowly skew/drift ahead nor have we ever had any skewing/drifting clock problems, it appears to have simply jumped to the new date and stopped
-the hardware has only had CentOS dom0s and domUs (PV) running for multiple years without incident, domUs have slowly been added with time
-we have many additional nearly identical production servers with multiple domUs on each with very similar setups (same motherboard, CPU, RAM, OS, etc.) that have had no clock issues yet
-the jumped domU clock can be corrected by running a "date -s" command with any value which then syncs the domU clock back up with the dom0
-we don't use live migration, no saving/restoring and no maintenance was taking place anywhere near or during time of jumps
-for all dom0s & domUs: independent_wallclock=0, ntpd is running, clocksource=jiffies, Xen version 3.1.2
-incident #1: ~Sun Jan 13 13:31:01 CST 2013 to Sun Mar  6 04:39:20 CST 2264 | dom0=Centos 5.8, Linux 2.6.18-308.20.1.el5xen | domU=Centos 5.5, Linux 2.6.18-194.8.1.el5xen
-incident #2: ~Thu Mar 28 11:54:22 CDT 2013 to Thu May 19 07:32:28 CST 2264 | dom0=Centos 5.5, Linux 2.6.18-194.11.3.el5xen | domU=Centos 5.8, Linux 2.6.18-308.24.1.el5xen
-incident #3: ~Sun Mar 31 10:42:14 CDT 2013 to Wed Apr  3 14:28:31 CDT 2013 | dom versions same as #2
-dom0 specs: TYAN S5397 w/ latest BIOS v1.07, guest count=6/3, DDR ECC RAM=48/64GB, 2 x Xeon E5420, LSI/Adaptec RAID, ~4 years old

We already do or have now done the following:
-full monitoring/logging for memory, disk, RAID, CPU, temperature, clock, log watch etc. (nothing bad to report)
-enabled XEND and XENSTORED debugging (since last failure to provide more info for potential future jumps)
-ran MemTest for hours under increased heat conditions and minor "stresstest" run, no errors reported, fsck passed as well
-visual inspection of the hardware (no corrosion, matched CPUs, identical properly slotted RAM, etc.)
-full dom0 & domU updates to CentOS 5.9, disabled ntpd on domU, kept domuU independent_wallclock=0

We have found no references to the same jump & stop clock issue on a domU given our circumstances. From other clock issue discussions, it appears that our root issue is probably with the jump itself and the clock stopping behavior is probably just the domU waiting for the dom0 time to catch up.
We initially thought and were advised that bad hardware could be to blame but that may not be true given the exact same issue surfaced on very similar but separate hardware and by the fact that the dom0 and other resident domUs were totally unaffected clock wise.
With all independent_wallclock=0 (i.e. dependent), we know NTP does not need to be running in the domU because it's getting its clock from the dom0, but we run NTP anyway in the domU to aid in our monitoring of the domU clock and it should not matter because nothing on the domU can set the clock when independent_wallclock=0.
Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.