[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] #599161: Xen debug patch for the "clock shifts by 50 minutes" bug.

At 09:39 +0000 on 08 Nov (1352367592), Jan Beulich wrote:
> The plt_wrap < plt_now thing of course is entirely unexplainable
> to me too: Considering that plt_scale doesn't change at all post-
> boot, apart from memory corruption I could only see an memory
> access ordering problem to be the reason (platform_timer_stamp
> and/or stime_platform_stamp changing despite platform_timer_lock
> being held. So maybe taking a snapshot of all three static values
> involved in the calculation in __read_platform_stime() between
> acquiring the lock and the first call to __read_platform_stime(),
> and printing them together with the "live" values in a second
> printk() after the one your original patch added could rule that
> out.
> But the box doesn't even seem to be NUMA (of course it also
> doesn't help that the log level was kept restricted - hint, hint,
> Philippe), not does there appear to be any S3 cycle or pCPU
> bring-up/-down in between...

S3 looks like it might be a culprit, since resume_platform_timer()
clobbers plt_stamp64 without taking the platform_timer_lock.  But both
the S3 resume code and the plt_overflow timer should only ever run on
CPU 0, so even that should be safe (unless continue_hypercall_on_cpu()
is broken...)

Definitely having loglvl=all would have helped here, to eliminate S3
from our enquiries.

> > I wonder whether the overflow handling should just be removed, or made
> > conditional on a command-line parameter, or on the 32-bit platform counter
> > being at least somewhat likely to overflow before a softirq occurs -- it
> > seems lots of systems are using 14MHz HPET, and that gives us a couple of
> > minutes for the plt_overflow softirq to do its work before overflow occurs.
> > I think we would notice that outage in other ways. :)
> Iirc we added this for a good reason - to cover the, however
> unlikely, event of Xen running for very long without preemption.
> Presumably most of the cases got fixed meanwhile, and indeed
> a wraparound time on the order of minutes should make this
> superfluous, but as the case here shows that code did spot a
> severe anomaly (whatever that may turn out to be).

ISTR when this code went in we were dealing with a timer that had a
period of about 4 seconds (ACPI PMTIMER?).  It might well be OTT for the
HPET, but if there's something weird going on I'd like to track it down
while we have some sort of a handle on it.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.