[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.
At 09:39 +0000 on 08 Nov (1352367592), Jan Beulich wrote: > The plt_wrap < plt_now thing of course is entirely unexplainable > to me too: Considering that plt_scale doesn't change at all post- > boot, apart from memory corruption I could only see an memory > access ordering problem to be the reason (platform_timer_stamp > and/or stime_platform_stamp changing despite platform_timer_lock > being held. So maybe taking a snapshot of all three static values > involved in the calculation in __read_platform_stime() between > acquiring the lock and the first call to __read_platform_stime(), > and printing them together with the "live" values in a second > printk() after the one your original patch added could rule that > out. > > But the box doesn't even seem to be NUMA (of course it also > doesn't help that the log level was kept restricted - hint, hint, > Philippe), not does there appear to be any S3 cycle or pCPU > bring-up/-down in between... S3 looks like it might be a culprit, since resume_platform_timer() clobbers plt_stamp64 without taking the platform_timer_lock. But both the S3 resume code and the plt_overflow timer should only ever run on CPU 0, so even that should be safe (unless continue_hypercall_on_cpu() is broken...) Definitely having loglvl=all would have helped here, to eliminate S3 from our enquiries. > > I wonder whether the overflow handling should just be removed, or made > > conditional on a command-line parameter, or on the 32-bit platform counter > > being at least somewhat likely to overflow before a softirq occurs -- it > > seems lots of systems are using 14MHz HPET, and that gives us a couple of > > minutes for the plt_overflow softirq to do its work before overflow occurs. > > I think we would notice that outage in other ways. :) > > Iirc we added this for a good reason - to cover the, however > unlikely, event of Xen running for very long without preemption. > Presumably most of the cases got fixed meanwhile, and indeed > a wraparound time on the order of minutes should make this > superfluous, but as the case here shows that code did spot a > severe anomaly (whatever that may turn out to be). ISTR when this code went in we were dealing with a timer that had a period of about 4 seconds (ACPI PMTIMER?). It might well be OTT for the HPET, but if there's something weird going on I'd like to track it down while we have some sort of a handle on it. Tim. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |