[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled
On 21.11.2022 13:23, Andrew Cooper wrote: > On 21/11/2022 08:56, Jan Beulich wrote: >> On 18.11.2022 15:27, Andrew Cooper wrote: >>> On 18/11/2022 12:54, Jan Beulich wrote: >>>> On 18.11.2022 13:33, Andrew Cooper wrote: >>>>> On 18/11/2022 10:31, Jan Beulich wrote: >>>>>> Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has >>>>>> exposed a problem with the marking of the respective vector as >>>>>> pending: For quite some time Linux has been checking whether any stale >>>>>> ISR or IRR bits would still be set while preparing the LAPIC for use. >>>>>> This check is now triggering on the upcall vector, as the registration, >>>>>> at least for APs, happens before the LAPIC is actually enabled. >>>>>> >>>>>> In software-disabled state an LAPIC would not accept any interrupt >>>>>> requests and hence no IRR bit would newly become set while in this >>>>>> state. As a result it is also wrong for us to mark the upcall vector as >>>>>> having a pending request when the vLAPIC is in this state. >>>>> I agree with this. >>>>> >>>>>> To compensate for the "enabled" check added to the assertion logic, add >>>>>> logic to (conditionally) mark the upcall vector as having a request >>>>>> pending at the time the LAPIC is being software-enabled by the guest. >>>>> But this, I don't think is appropriate. >>>>> >>>>> The point of raising on enable is allegedly to work around setup race >>>>> conditions. I'm unconvinced by this reasoning, but it is what it is, >>>>> and the stated behaviour is to raise there and then. >>>>> >>>>> If a guest enables evtchn while the LAPIC is disabled, then the >>>>> interrupt is lost. Like every other interrupt in an x86 system. >>>> Edge triggered ones you mean, I suppose, but yes. >>> For IO-APIC systems, you mostly lose line interrupts too, don't you? >>> >>> The line will remain pending at the IO-APIC, but nothing in the system >>> will unwedge until someone polls the IO-APIC. >>> >>> Either way... >>> >>>>> I don't think there is any credible way a guest kernel author can expect >>>>> the weird evtchn edgecase to wait for an arbitrary point in the future, >>>>> and it's a corner case that I think is worth not keeping. >>>> Well - did you look at 7b5b8ca7dffd ("x86/upcall: inject a spurious event >>>> after setting upcall vector"), referenced by the Fixes: tag? The issue is >>>> that with evtchn_upcall_pending once set, there would never again be a >>>> notification. >>> Ok, so we do need to do something. >>> >>>> So if what you say is to be the model we follow, then that >>>> earlier change was perhaps wrong as well. Instead it should then have >>>> been a guest change (as also implicit from your reply) to clear >>>> evtchn_upcall_pending after vCPU info registration (there) or LAPIC >>>> enabling (here), perhaps by way of "manually" invoking the handling of >>>> that pending event, or by issuing a self-IPI with that vector. >>>> Especially the LAPIC enabling case would then be yet another Xen-specific >>>> on a guest code path which better wouldn't have to be aware of Xen. >>> Without trying to prescribe how to fix this specific issue, wherever >>> possible we should be trying to limit the Xen-isms from non-Xen areas. >>> There's a whole lot of poorly described and surprising behaviours which >>> have not stood the test of time. >>> >>> In this case, it seems that we have yet another x86 PV-ism which hasn't >>> translated well x86 HVM. Specifically, we're trying to overlay an >>> entirely shared-memory (and delayed return-to-guest) interrupt >>> controller onto one which is properly constructed to handle events in >>> realtime. >>> >>> >>> I even got as far as writing that maybe leaving it as-is was the best >>> option (principle of least surprise for Xen developers), but our >>> "friend" apic acceleration strikes again. >>> >>> Xen doesn't always get a VMExit when the guest clears SW_DISABLE, >>> because microcode may have accelerated it. >> But as per "APIC-Write Emulation" in the SDM we'd still get an APIC-write >> VM exit. > > Intel isn't the only accelerated implementation, and there future > details not in the public docs. > > There will be an implementation we will want to support where Xen > doesn't get a vmexit for a write to SPIV. I see. >> If we didn't, how would our internal accounting of APIC enabled >> state (VLAPIC_SW_DISABLED) work? > > It doesn't. > > One of many problems on the "known errors" list from an incomplete > original attempt to get acceleration working. > > There's no good reason to cache those disables in the first place (both > are both trivially available from other positions in memory), and > correctness reasons not to. > >> And the neighboring (to where I'm adding >> the new code) pt_may_unmask_irq() call then also wouldn't occur. >> >> I'm actually pretty sure we do too much in this case - in particular none >> of the vlapic_set_reg() should be necessary. But we certainly can't get >> away with doing nothing, and hence we depend on that VM exit to actually >> occur. > > We must do exactly and only what real hardware does, so that the state > changes emulated by Xen are identical to those accelerated by microcode. > > Other than that, I really wouldn't make any presumptions about the > existing vLAPIC logic being correct. > > It is, at best, enough of an approximation to the spec for major OSes to > function, with multiple known bugs and no coherent testing. But can we leave resolving of the wider issue then separate, and leave the change here as it presently is? Yes, mimic-ing the same behavior later may be "interesting", but if we can't achieve consistent behavior with yet more advanced acceleration, maybe we simply can't use that (perhaps until a complete overhaul of everything involved in LAPIC handling, possibly including a guest side indicator that they're happy without the extra signaling, at which point that yet-more-advanced acceleration could then be enabled for that guest). Otherwise - do you have any suggestion as to alternative logic which I might use in this patch? Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |