[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] x86/watchdog: Use real timestamps for watchdog timeout
At 12:36 +0100 on 24 May (1369398982), Jan Beulich wrote: > >>> On 24.05.13 at 12:13, Tim Deegan <tim@xxxxxxx> wrote: > >> > Handling this case it nice, but I wonder whether this patch ought to > >> > detect and report ludicrous NMI rates rather than silently ignoring > >> > them. I guess that's hard to do in an NMI handler, other than by > >> > adjusting the printk when we crash. > > > > Actually on second thoughts it's easier: as well as having this patch > > (or near equivalent) to avoid premature watchdog expiry, we cna detect > > the NMI rate in, say, the timer softirq and report if it's gone mad. > > I may not get how you imagine this to work, but why would you > assume the timer softirq gets any chance to run anymore with an > absurd NMI rate (or at least get run enough times to prevent the > watchdog to fire)? In that case, the watchdog will fire, if we can add something to the watchdog printk if the NMI count is mugh higher than (timeout * nmi_hz). I was worrying about the case where there are lots of NMIs but not so many that the system livelocks entirely. It would be nice to have some non-fatal warning that something is wrong. A one-time (or at least very limited) printk in the softirq would do that for not much extra cost. At 12:42 +0100 on 24 May (1369399344), Jan Beulich wrote: > >>> On 24.05.13 at 12:33, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote: > > On 24/05/13 11:13, Tim Deegan wrote: > >> At 10:57 +0100 on 24 May (1369393060), Andrew Cooper wrote: > >>> On 24/05/13 08:09, Jan Beulich wrote: > >>>> You can't use NOW() here - while the time updating code is safe > >>>> against normal interrupts, it's not atomic wrt NMIs. > >>> But NMIs are latched at the hardware level. If we get a nested NMI the > >>> Xen will be toast on the exit path anyway. > >> The problem is that an NMI can arrive while local_time_calibration() is > >> writing its results, so calling NOW() in the NMI handler might return > >> garbage. > > > > Aah - I see. Sorry - I misunderstood the original point. > > > > Yes - that is an issue. > > > > Two solutions come to mind. > > > > 1) Along with the local_irq_disable()/enable() pairs in > > local_time_calibration, having an atomic_t indicating "time data update > > in progress", allowing the NMI handler to decide to bail early. > > > > 2) Modify local_time_calibration() to fill in a shadow cpu_time set, and > > a different atomic_t to indicate which one is consistent. This would > > allow the NMI handler to always use one consistent set of timing > > information. Of those two, I prefer (1), just because it doesn't add any cost to the normal users of NOW(). Using TSC to gate the actual watchdog crash might get a bit messy, especially if it ends up adding code to the users of write_tsc(). But all we need is to double-check the existing count -- I think it would be enough to say "the watchdog fired but it looks like it was caused by an NMI storm" and crash anyway. (At least assuming timeout * nmi_hz is a a largish number.) And nmi_watchdog_tick() can just check regs->eip as a hint not to trust the scale factors. :) Tim. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |