[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] cpuidle and un-eoid interrupts at the local apic

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
From: "Thimo E." <abc@xxxxxxxxxx>
Date: Wed, 04 Sep 2013 21:56:40 +0200
Cc: Keir Fraser <keir@xxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, "Dong, Eddie" <eddie.dong@xxxxxxxxx>, Xen-develList <xen-devel@xxxxxxxxxxxxx>, "Nakajima, Jun" <jun.nakajima@xxxxxxxxx>, "Zhang, Yang Z" <yang.z.zhang@xxxxxxxxx>, "Zhang, Xiantao" <xiantao.zhang@xxxxxxxxx>
Delivery-date: Wed, 04 Sep 2013 19:57:44 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hello Andrew,

thanks for your response. At least I've seen the trigger of the newcrash (2e) already before, so they seem so belong together.

I can't image that I am the only one on the world who is using a haswellboard. And as I haven't seen any other Xen bug/crash reportslike mine (and one time you) nor bug reports from users with otheroperating systems, I ask myself if only my hardware is buggyor if other operating systems handle those "spurious" interrupts inanother way ?!?!


What does " ioapic_ack=old" change ?

Best regards
  Thimo

Am 04.09.2013 20:55, schrieb Andrew Cooper:

On 04/09/13 19:32, Thimo E. wrote:

Hello again,

the last two weeks no crash with pinning dom0_vcpus_pin and
restricting dom0 to 1 cpu. But yesterday it crashed again. So changed
the command line again to:

iommu=no-intremap noirqbalance com1=115200,8n1,0xe050,0
console=com1,vga mem=1024G dom0_max_vcpus=4 dom0_mem=752M,max:752M
watchdog_timeout=300 lowmem_emergency_pool=1M crashkernel=64M@32M
cpuid_mask_xsave_eax=0

And today server crashed again and produced a lot of debugging
messages, see attached. The "..." in the logfiles mean that the
message above the points was repeated very often.

My summary so far:
- With only 1 cpu atteched to dom0 the server was stable for 2 weeks,
the crash there did not really show any irq problems, see
crash20130903.txt
    You can find Andrews ideas to this in
http://forums.citrix.com/thread.jspa?messageID=1760771#1760771
- With more than 1 cpu and irqbalance the server produced the crashes
I've already posted before
- Without irqbalance crash with some other fancy output, see
crash20130904.txt

Next step is to change the network card.

Zhang, any update from your side ? Or do the others have any idea ?
Could "ioapic_ack=old" help somewhere ?

Best regards
   Thimo

Ok - the second attachment (crash20130903.txt) is the one I have triaged
before, and the crash is impossible given the expected code flow through
the function.

%r14 is calculated as a the per-cpu cpu_info, which cannot possibly be
-1 at the point of the fault.  The only explanation is that the
pagefault is a result of a spurious jump to this location.

 From a quick glance at the other crash, vector 2e was the problematic
one (iirc).  The "Bad vmexit (reason 3)" at the top would suggest that
something on the system has sent an INIT to pcpu 2, which seems antisocial.

As we have identified that the hardware is delivering invalid
interrupts, I wouldn't necessarily read any more into this new crash;
something is very broken in the hardware.

I would be interested for any update from Intel regarding the ISR violation.

~Andrew



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] cpuidle and un-eoid interrupts at the local apic
  - From: Andrew Cooper

References:
- Re: [Xen-devel] cpuidle and un-eoid interrupts at the local apic
  - From: Thimo E.
- Re: [Xen-devel] cpuidle and un-eoid interrupts at the local apic
  - From: Andrew Cooper

Prev by Date: Re: [Xen-devel] cpuidle and un-eoid interrupts at the local apic
Next by Date: Re: [Xen-devel] [PATCH] tracing/events: Add bounce tracing to swiotbl-xen
Previous by thread: Re: [Xen-devel] cpuidle and un-eoid interrupts at the local apic
Next by thread: Re: [Xen-devel] cpuidle and un-eoid interrupts at the local apic
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.