|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids
On 08/02/2021 09:41, Jürgen Groß wrote: On 08.02.21 10:11, Julien Grall wrote:Hi Juergen, On 07/02/2021 12:58, Jürgen Groß wrote:On 06.02.21 19:46, Julien Grall wrote:Hi Juergen, On 06/02/2021 10:49, Juergen Gross wrote:The first three patches are fixes for XSA-332. The avoid WARN splats and a performance issue with interdomain events.Thanks for helping to figure out the problem. Unfortunately, I still see reliably the WARN splat with the latest Linux master (1e0d27fce010) + your first 3 patches.I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L events ABI.After some debugging, I think I have an idea what's went wrong. The problem happens when the event is initially bound from vCPU0 to a different vCPU.From the comment in xen_rebind_evtchn_to_cpu(), we are masking the event to prevent it being delivered on an unexpected vCPU. However, I believe the following can happen:vCPU0 | vCPU1 | | Call xen_rebind_evtchn_to_cpu() receive event X | | mask event X | bind to vCPU1 <vCPU descheduled> | unmask event X | | receive event X | | handle_edge_irq(X) handle_edge_irq(X) | -> handle_irq_event() | -> set IRQD_IN_PROGRESS -> set IRQS_PENDING | | -> evtchn_interrupt() | -> clear IRQD_IN_PROGRESS | -> IRQS_PENDING is set | -> handle_irq_event() | -> evtchn_interrupt() | -> WARN() |All the lateeoi handlers expect a ONESHOT semantic and evtchn_interrupt() is doesn't tolerate any deviation.I think the problem was introduced by 7f874a0447a9 ("xen/events: fix lateeoi irq acknowledgment") because the interrupt was disabled previously. Therefore we wouldn't do another iteration in handle_edge_irq().I think you picked the wrong commit for blaming, as this is just the last patch of the three patches you were testing.I actually found the right commit for blaming but I copied the information from the wrong shell :/. The bug was introduced by: I don't think we can easily divide the two because we rely on the IRQ framework to handle the lifecycle of the event. So... Our problem isn't due to the IRQ still being pending, but due it being raised again, which should happen for a one shot IRQ the same way. ... I don't really see how the difference matter here. The idea is to re-use what's already existing rather than trying to re-invent the wheel with an extra lock (or whatever we can come up). Cheers, -- Julien Grall
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |