[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Patch] x86/HVM: Fix RTC interrupt modelling

On 10/02/14 17:46, George Dunlap wrote:
> On 02/10/2014 04:34 PM, Jan Beulich wrote:
>>>    * The results from XenRT suggest that the new emulation is better
>>> than the
>>>      old.
>> "Better" in the sense of the limited set of uses of the virtual hardware
>> by whatever selection of guest OSes is being run there. But very
>> likely not "better" in the sense on matching up with how the respective
>> hardware specification would require it to behave.
> The context of the above sentence was in a justification for including
> it in 4.4.  Obviously "occasionally gets stuck during boot" is a
> pretty bad bug that we'd like to see fix.  But given the tricky nature
> of this whole area, there's a risk that this will cause regressions in
> *other* situations or operating systems.  What I understand Andy to be
> saying is that with the patch, the RTC appears to cause less problems
> than without it.

That was indeed the message I was trying to convey.

> What your analysis is missing, Andy, is what the effects might be if
> there were a bug.  Obviously other guests might hang during boot; but
> what else?  Might they hang at some point much later, perhaps when
> being pounded with interrupts due to heavy network traffic? Might the
> clock begin to drift or jump around?  Would the XenRT testing catch
> that if it happened?  And, would those potential bugs be worse than
> what we have now?

This is much more complicated to answer.  Lets try.

The patch only touches RTC and Periodic Timer code.  All of the Periodic
Timer code is reversions back to the previous code (mid Xen-4.3 and
before), along with an attempted justification as to why the old code
was more correct than the current code.

Some of the RTC changes are reversions, but there is also new logic. 
All new logic is to do with how to update REG_B and REG_C correctly. 
None of the other functionality is touched.

REG_B and REG_C are to do with interrupts, and which events
should(B)/have(C) generated interrupts.  The worst case is that a guest
gets none/too-few/too-many interrupts when trying to drive the RTC. 
None of this should lead to clock skew, as reading the time values
directly will still provide the same information as before, although any
guest which attempts to guess time based on counting periodic interrupts
from the RTC is a) already broken and b) already having massive skew as
a VM due to vcpu scheduling.

XenRT does have tests for clock drift, but don't know for certain
whether they have been run against the new code yet.

I will ensure they get run on v2 of the patch.


> There's a reason for trying to go through the whole exercise,
> particularly in bugs like this.  I do have a lot of faith in our
> intuition to consider hundreds of individual factors and come up with
> a reasonably good judgement of the probabilities -- but only if it is
> guided properly.  We are all very prone to only consider the things we
> happen to be thinking about, and to completely ignore all the things
> we don't happen to be looking at.  My own temptation, looking at this
> bug, is to say, "Random hangs during boot, yeah, that's pretty bad; we
> should take it."  But then I'm only looking at the positives of the
> patch: I'm not really making a balance of the positives versus the
> negatives.
> The goal of going through the "worst-case-scenario" exercise is to
> bring to our minds the potential outcomes we are prone to ignore. Only
> then can we reasonably trust our intuition to make a properly informed
> judgement.
>  -George

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.