[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-devel] Instability with Xen, interrupt routing frozen, HPET broadcast
I am the original developer of HPET broadcast code. First of all, to disable HPET broadcast, no additional patch is required. Please simply add option "cpuidle=off" or "max_cstate=1" at xen cmdline in /boot/grub/grub.conf. Second, I noticed that the issue just occur on pre-nehalem server processors. I will check whether I can reproduce it. Meanwhile, I am looking forward to see whether Jeremy & Xiantao's suggestions have effects. So Andreas, could you help to have a try on their suggestions? Jimmy On , xen-devel-bounces@xxxxxxxxxxxxxxxxxxx wrote: > Maybe you can disable pirq_set_affinity to have a try with the > following patch. It may trigger IRQ migration in hypervisor, > and the IRQ migration logic about(especailly > shared)level-triggered ioapic IRQ is not well tested because > of no users before. After intoducing the pirq_set_affinity in > #Cset21625, the logic is used frequently when vcpu migration > occurs, so I doubt it maybe expose the issue you met. > Besides, there is a bug in event driver which is fixed in > latest pv_ops dom0, seems the dom0 you are using doesn't > include the fix. This bug may result in lost event in dom0 > and invoke dom0 hang eventually. To workaround this bug, you > can disable irqbalance in dom0. Good luck! > Xiantao > > diff -r fc29e13f669d xen/arch/x86/irq.c > --- a/xen/arch/x86/irq.c Mon Aug 09 16:36:07 2010 +0100 > +++ b/xen/arch/x86/irq.c Thu Sep 30 20:33:11 2010 +0800 > @@ -516,6 +516,7 @@ void irq_set_affinity(struct irq_desc *d > > void pirq_set_affinity(struct domain *d, int pirq, const cpumask_t > *mask) { > +#if 0 > unsigned long flags; > struct irq_desc *desc = domain_spin_lock_irq_desc(d, pirq, > &flags); > > @@ -523,6 +524,7 @@ void pirq_set_affinity(struct domain *d, > return; irq_set_affinity(desc, mask); > spin_unlock_irqrestore(&desc->lock, flags); > +#endif > } > > DEFINE_PER_CPU(unsigned int, irq_count); > > > Andreas Kinzler wrote: >> On 21.09.2010 13:56, Pasi Kärkkäinen wrote: >>>> I am talking a while (via email) with Jan now to track the >>>> following problem and he suggested that I report the problem on >>>> xen-devel: >>>> >>>> Jul 9 01:48:04 virt kernel: aacraid: Host adapter reset request. >>>> SCSI hang ? Jul 9 01:49:05 virt kernel: aacraid: SCSI bus appears >>>> hung Jul 9 01:49:10 virt kernel: Calling adapter init >>>> Jul 9 01:49:49 virt kernel: IRQ 16/aacraid: IRQF_DISABLED is not >>>> guaranteed on shared IRQs Jul 9 01:49:49 virt kernel: Acquiring >>>> adapter information Jul 9 01:49:49 virt kernel: >>>> update_interval=30:00 check_interval=86400s Jul 9 01:53:13 virt >>>> kernel: aacraid: aac_fib_send: first asynchronous command timed >>>> out. Jul 9 01:53:13 virt kernel: Usually a result of a PCI >>>> interrupt routing problem; Jul 9 01:53:13 virt kernel: update >>>> mother board BIOS or consider utilizing one of Jul 9 01:53:13 >>>> virt kernel: the SAFE mode kernel options (acpi, apic etc) >>>> >>>> After the VMs have been running a while the aacraid driver reports >>>> a non-responding RAID controller. Most of the time the NIC is also >>>> no longer working. I nearly tried every combination of dom0 kernel >>>> (pvops0, xenfied suse >>>> 2.6.31.x, xenfied suse 2.6.32.x, xenfied suse 2.6.34.x) with Xen >>>> hypervisor 3.4.2, 3.4.4-cs19986, 4.0.1, unstable. >>>> No success in two month. Every combination earlier or later had the >>>> problem shown above. I did extensive tests to make sure that the >>>> hardware is OK. And it is - I am sure it is a Xen/dom0 problem. >>>> >>>> Jan suggested to try the fix in c/s 22051 but it did not help. My >>>> answer to him: >>>> >>>>> In the meantime I did try xen-unstable c/s 22068 (contains staging >>>>> c/s 22051) and it did not fix the problem at all. I was able to >>>>> fix a problem with the serial console and so I got some debug info >>>>> that is attached to this email. The following line looks >>>>> suspicious to me (irr=1, delivery_status=1): >>>> >>>>> (XEN) IRQ 16 Vec216: >>>>> (XEN) Apic 0x00, Pin 16: vector=216, delivery_mode=1, >>>>> dest_mode=logical, delivery_status=1, polarity=1, >>>>> irr=1, trigger=level, mask=0, dest_id:1 >>>> >>>>> IRQ 16 is the aacraid controller which after some while seems to >>>>> be enable to receive interrupts. Can you see from the debug info >>>>> what is going on? >>>> >>>> I also applied a small patch which disables HPET broadcast. The >>>> machine is now running for 110 hours without a crash while normally >>>> it crashes within a few minutes. Is there something wrong (race, >>>> deadlock) with HPET broadcasts in relation to blocked interrupt >>>> reception (see above)? >>> What kind of hardware does this happen on? >> >> It is a Supermicro X8SIL-F, Intel Xeon 3450 system. >> >>> Should this patch be merged? >> >> Not easy to answer. I spend more than 10 weeks searching nearly full >> time for the reason of the stability issues. Finally I was able to >> track it down to the HPET broadcast code. >> >> We need to find the developer of the HPET broadcast code. Then, he >> should try to fix the code. I consider it a quite severe bug as it >> renders Xen nearly useless on affected systems. That is why I (and my >> boss who pays me) spend so much time (developing/fixing Xen is not >> really my core job) and money (buying a E5620 machine just for >> testing Xen). >> >> I think many people on affected systems are having problems. See >> > http://lists.xensource.com/archives/html/xen-users/2010-09/msg0 > 0370.html >> >> Regards Andreas >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@xxxxxxxxxxxxxxxxxxx >> http://lists.xensource.com/xen-devel > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |