Xen project Mailing List

Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

To: <mrsanna1@xxxxxxxxx>

From: <Philippe.Simonet@xxxxxxxxxxxx>

Date: Thu, 8 Nov 2012 13:47:26 +0000

Accept-language: en-US, de-CH

Cc: keir@xxxxxxx, 599161@xxxxxxxxxxxxxxx, xen-devel@xxxxxxxxxxxxx, JBeulich@xxxxxxxx, ijc@xxxxxxxxxxxxxx

Delivery-date: Thu, 08 Nov 2012 13:48:15 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: Ac280B6CWUiPuRIkTkmJRMKB8zdEqQAEnBqAAAkCjQAAIYFoAAAKGgkA

Thread-topic: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

Hi Mauro, that's a question for you : > Philippe, could you clarify again what CPU model(s) this is being observed on > (the long times between individual steps forward with this problem perhaps > warrant repeating the basics each time, as it's otherwise quite cumbersome > to always look up old pieces of information). can you provide this information ? cat /proc/cpuinfo cat /proc/meminfo hardware information (manufacturer, model, urls, ...) Thanks, Philippe > -----Original Message----- > From: Jan Beulich [mailto:JBeulich@xxxxxxxx] > Sent: Thursday, November 08, 2012 10:40 AM > To: Simonet Philippe, ITS-OUS-OP-IFM-NW-IPE; Keir Fraser > Cc: 599161@xxxxxxxxxxxxxxx; mrsanna1@xxxxxxxxx; Ian Campbell; xen- > devel@xxxxxxxxxxxxx > Subject: Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 > minutes" bug. > > >>> On 07.11.12 at 18:40, Keir Fraser <keir@xxxxxxx> wrote: > > On 07/11/2012 13:22, "Ian Campbell" <ijc@xxxxxxxxxxxxxx> wrote: > > > >>>> (XEN) XXX plt_overflow: plt_now=5ece12d34128 > plt_wrap=5ece12d09306 > >>>> now=5ece12d16292 old_stamp=35c7c new_stamp=800366a5 > >>>> plt_stamp64=15b800366a5 plt_mask=ffffffff tsc=e3839fd23854 > >>>> tsc_stamp=e3839fcb0273 > >>> > >>> (below is the complete xm dmesg output) > >>> > >>> did that help you ? do you need more info ? > >> > >> I'll leave this to Keir (who wrote the debugging patch) to answer but > >> it looks to me like it should be useful! > > > > I'm scratching my head. plt_wrap is earlier than plt_now, which should > > be impossible. plt_stamp64 oddly has low 32 bits identical to > > new_stamp. That seems very very improbable! > > Is it? My understanding was that plt_stamp64 is just a software extension to > the more narrow HW counter, and hence the low plt_mask bits would always > be expected to be identical. > > The plt_wrap < plt_now thing of course is entirely unexplainable to me too: > Considering that plt_scale doesn't change at all post- boot, apart from > memory corruption I could only see an memory access ordering problem to > be the reason (platform_timer_stamp and/or stime_platform_stamp > changing despite platform_timer_lock being held. So maybe taking a > snapshot of all three static values involved in the calculation in > __read_platform_stime() between acquiring the lock and the first call to > __read_platform_stime(), and printing them together with the "live" values > in a second > printk() after the one your original patch added could rule that out. > > But the box doesn't even seem to be NUMA (of course it also doesn't help > that the log level was kept restricted - hint, hint, Philippe), not does there > appear to be any S3 cycle or pCPU bring-up/-down in between... > > Philippe, could you clarify again what CPU model(s) this is being observed on > (the long times between individual steps forward with this problem perhaps > warrant repeating the basics each time, as it's otherwise quite cumbersome > to always look up old pieces of information). > > > I wonder whether the overflow handling should just be removed, or made > > conditional on a command-line parameter, or on the 32-bit platform > > counter being at least somewhat likely to overflow before a softirq > > occurs -- it seems lots of systems are using 14MHz HPET, and that > > gives us a couple of minutes for the plt_overflow softirq to do its work > before overflow occurs. > > I think we would notice that outage in other ways. :) > > Iirc we added this for a good reason - to cover the, however unlikely, event > of Xen running for very long without preemption. > Presumably most of the cases got fixed meanwhile, and indeed a > wraparound time on the order of minutes should make this superfluous, but > as the case here shows that code did spot a severe anomaly (whatever that > may turn out to be). > > Also recall that there are HPET implementations around that tick at a much > higher frequency than 14MHz. > > So unless we finally reach the understanding that the code is flawed, I would > rather want to keep it. > > Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.