Re: [Xen-devel] HPET stack overflow, and general problems with do_IRQ()

>>> On 15.08.13 at 22:21, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
> Hello,
> I have finally managed to get a full stack dump from affected hardware.
> The logs can be found here (including hypervisor with debugging symbols):
> http://xenbits.xen.org/people/andrewcoop/hpet-overflow-full-stackdump.tar.gz 
> The interesting log file is xen.pcpu0.stack.log
> By my count (grepping for e008 as CS), there are are 8 exception frames
> on the Xen stack (all stack page 6)
> However, because of the early ack() at the LAPIC, and disabling of
> interrupts, the vectors (in order of interrupts arriving) are
> c1, 99, b1, b9, a9, a1, 91, 89

So these are all HPET interrupts as it seems to me. You said the
box just has 8 of them, so the fundamental problem is not the
general handling of interrupts that you talk about below, but the
fact that _all_ these channels are bound to CPU0: That's an
insane side effect of the way channel management works when
there are (potentially) more CPUs than channels. So _I_ think
this is what needs fixing.

That's even more so that the above sequence would be impossible
for guest interrupts (which don't get EOI-ed immediately, and
interrupts don't get re-enabled on that path either). Hence in the
discussion here we need to only be concerned of interrupts that
Xen uses for itself: timer, console, iommu, and HPET. Out of these,
timer and console - going through the IO-APIC - are safe from this
because of how io_apic.c implements the ->ack()/->end() pairs.
Both IOMMU implementations ack their IRQs in the LAPIC only in
->end(). And that's what I suggested to switch HPET to too. And
other than I said about this earlier, disabling interrupts in the
->end() handler isn't even necessary, as it already gets called with
them disabled.

So we have two possible fixes to the HPET, either of which is
very likely to deal with the problem on its own.


