[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [RFC] Xen crashes on ASSERT on suspend/resume, suggested fix
Hi all, After many PVH Dom0 suspend/resume cycles we are seeing the following Xen crash (it is random and doesn't reproduce reliably): (XEN) [555.042981][<ffff82d04032a137>] R arch/x86/irq.c#_clear_irq_vector+0x214/0x2bd (XEN) [555.042986][<ffff82d04032a74c>] F destroy_irq+0xe2/0x1b8 (XEN) [555.042991][<ffff82d0403276db>] F msi_free_irq+0x5e/0x1a7 (XEN) [555.042995][<ffff82d04032da2d>] F unmap_domain_pirq+0x441/0x4b4 (XEN) [555.043001][<ffff82d0402d29b9>] F arch/x86/hvm/vmsi.c#vpci_msi_disable+0xc0/0x155 (XEN) [555.043006][<ffff82d0402d39fc>] F vpci_msi_arch_disable+0x1e/0x2b (XEN) [555.043013][<ffff82d04026561c>] F drivers/vpci/msi.c#control_write+0x109/0x10e (XEN) [555.043018][<ffff82d0402640c3>] F vpci_write+0x11f/0x268 (XEN) [555.043024][<ffff82d0402c726a>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7 (XEN) [555.043029][<ffff82d0402c6682>] F hvm_process_io_intercept+0x203/0x26f (XEN) [555.043034][<ffff82d0402c6718>] F hvm_io_intercept+0x2a/0x4c (XEN) [555.043039][<ffff82d0402b6353>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5f6 (XEN) [555.043043][<ffff82d0402b66dd>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a (XEN) [555.043048][<ffff82d0402b7bde>] F hvmemul_do_pio_buffer+0x33/0x35 (XEN) [555.043053][<ffff82d0402c7042>] F handle_pio+0x6d/0x1b4 (XEN) [555.043059][<ffff82d04029ec20>] F svm_vmexit_handler+0x10bf/0x18b0 (XEN) [555.043064][<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18 (XEN) [555.043067] (XEN) [555.469861] (XEN) [555.471855] **************************************** (XEN) [555.477315] Panic on CPU 9: (XEN) [555.480608] Assertion 'per_cpu(vector_irq, cpu)[old_vector] == irq' failed at arch/x86/irq.c:233 (XEN) [555.489882] **************************************** Looking at the code in question, the ASSERT looks wrong to me. Specifically, if you see send_cleanup_vector and irq_move_cleanup_interrupt, it is entirely possible to have old_vector still valid and also move_in_progress still set, but only some of the per_cpu(vector_irq, me)[vector] cleared. It seems to me that this could happen especially when an MSI has a large old_cpu_mask. While we go and clear per_cpu(vector_irq, me)[vector] in each CPU one by one, the state is that not all per_cpu(vector_irq, me)[vector] are cleared and old_vector is still set. If at this point we enter _clear_irq_vector, we are going to hit the ASSERT above. My suggestion was to turn the ASSERT into an if. Any better ideas? Cheers, Stefano diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c index 20150b1c7f..c82c6b350a 100644 --- a/xen/arch/x86/irq.c +++ b/xen/arch/x86/irq.c @@ -230,9 +230,11 @@ static void _clear_irq_vector(struct irq_desc *desc) for_each_cpu(cpu, tmp_mask) { - ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq); - TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu); - per_cpu(vector_irq, cpu)[old_vector] = ~irq; + if ( per_cpu(vector_irq, cpu)[old_vector] == irq ) + { + TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu); + per_cpu(vector_irq, cpu)[old_vector] = ~irq; + } } release_old_vec(desc);
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |