Xen project Mailing List

Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

To: <Philippe.Simonet@xxxxxxxxxxxx>,"Keir Fraser" <keir@xxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Thu, 08 Nov 2012 09:39:52 +0000

Cc: 599161@xxxxxxxxxxxxxxx, xen-devel@xxxxxxxxxxxxx, mrsanna1@xxxxxxxxx, Ian Campbell <ijc@xxxxxxxxxxxxxx>

Delivery-date: Thu, 08 Nov 2012 09:40:26 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 07.11.12 at 18:40, Keir Fraser <keir@xxxxxxx> wrote: > On 07/11/2012 13:22, "Ian Campbell" <ijc@xxxxxxxxxxxxxx> wrote: > >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 plt_wrap=5ece12d09306 >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 >>>> tsc_stamp=e3839fcb0273 >>> >>> (below is the complete xm dmesg output) >>> >>> did that help you ? do you need more info ? >> >> I'll leave this to Keir (who wrote the debugging patch) to answer but it >> looks to me like it should be useful! > > I'm scratching my head. plt_wrap is earlier than plt_now, which should be > impossible. plt_stamp64 oddly has low 32 bits identical to new_stamp. That > seems very very improbable! Is it? My understanding was that plt_stamp64 is just a software extension to the more narrow HW counter, and hence the low plt_mask bits would always be expected to be identical. The plt_wrap < plt_now thing of course is entirely unexplainable to me too: Considering that plt_scale doesn't change at all post- boot, apart from memory corruption I could only see an memory access ordering problem to be the reason (platform_timer_stamp and/or stime_platform_stamp changing despite platform_timer_lock being held. So maybe taking a snapshot of all three static values involved in the calculation in __read_platform_stime() between acquiring the lock and the first call to __read_platform_stime(), and printing them together with the "live" values in a second printk() after the one your original patch added could rule that out. But the box doesn't even seem to be NUMA (of course it also doesn't help that the log level was kept restricted - hint, hint, Philippe), not does there appear to be any S3 cycle or pCPU bring-up/-down in between... Philippe, could you clarify again what CPU model(s) this is being observed on (the long times between individual steps forward with this problem perhaps warrant repeating the basics each time, as it's otherwise quite cumbersome to always look up old pieces of information). > I wonder whether the overflow handling should just be removed, or made > conditional on a command-line parameter, or on the 32-bit platform counter > being at least somewhat likely to overflow before a softirq occurs -- it > seems lots of systems are using 14MHz HPET, and that gives us a couple of > minutes for the plt_overflow softirq to do its work before overflow occurs. > I think we would notice that outage in other ways. :) Iirc we added this for a good reason - to cover the, however unlikely, event of Xen running for very long without preemption. Presumably most of the cases got fixed meanwhile, and indeed a wraparound time on the order of minutes should make this superfluous, but as the case here shows that code did spot a severe anomaly (whatever that may turn out to be). Also recall that there are HPET implementations around that tick at a much higher frequency than 14MHz. So unless we finally reach the understanding that the code is flawed, I would rather want to keep it. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.