[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen crash: map_domain_page() on an NMI path



On 19/12/13 11:00, Tim Deegan wrote:
> At 19:37 +0000 on 18 Dec (1387391848), Andrew Cooper wrote:
>> Hello,
>>
>> This is a stack trace caught by automated testing.  The server BMC has
>> indicated that it has genuinely injected an IOCK NMI (which is believed
>> to be caused by a system erratum we are aware of and trying to work around)
>>
>> However, the interesting point is the nested crash.  This is a failed
>> assertion while attempting to execute the kexec crash path.  Xen is
>> 4.3.1 based, and built with debug, so the stack trace below is generated
>> with frame pointers, and is correct.
>>
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82c4c01634ac>] __context_switch+0xb0/0x41e
>> (XEN)    [<ffff82c4c016388d>] __sync_local_execstate+0x73/0x83
>> (XEN)    [<ffff82c4c01638a6>] sync_local_execstate+0x9/0xb
>> (XEN)    [<ffff82c4c0166789>] map_domain_page+0x98/0x5c4
>> (XEN)    [<ffff82c4c0153820>] map_vtd_domain_page+0xd/0x1d
>> (XEN)    [<ffff82c4c015139f>] queue_invalidate_context+0x94/0x141
>> (XEN)    [<ffff82c4c0151891>] flush_context_qi+0x55/0x66
>> (XEN)    [<ffff82c4c014d1ed>] iommu_flush_all+0x68/0x12f
>> (XEN)    [<ffff82c4c014f770>] vtd_crash_shutdown+0x15/0x64
>> (XEN)    [<ffff82c4c0149eec>] iommu_crash_shutdown+0x3f/0x4f
>> (XEN)    [<ffff82c4c01a8790>] machine_crash_shutdown+0x273/0x2eb
>> (XEN)    [<ffff82c4c0114af2>] kexec_crash+0x4c/0x70
>> (XEN)    [<ffff82c4c01442f2>] panic+0x12c/0x15b
>> (XEN)    [<ffff82c4c0190815>] fatal_trap+0xb8/0xc6
>> (XEN)    [<ffff82c4c0190f1c>] do_nmi+0xf9/0x180
>> (XEN)    [<ffff82c4c02366fc>] handle_ist_exception+0x92/0xf6
>> (XEN)    [<ffff82c4c0167558>] write_cr3+0x6a/0x83
>> (XEN)    [<ffff82c4c0176b08>] write_ptbase+0x10/0x12
>> (XEN)    [<ffff82c4c016374b>] __context_switch+0x34f/0x41e
>> (XEN)    [<ffff82c4c016388d>] __sync_local_execstate+0x73/0x83
>> (XEN)    [<ffff82c4c01638a6>] sync_local_execstate+0x9/0xb
>> (XEN)    [<ffff82c4c012df35>] do_tasklet_work+0x9d/0xeb
>> (XEN)    [<ffff82c4c012e152>] tasklet_softirq_action+0x44/0x92
>> (XEN)    [<ffff82c4c012b4bc>] __do_softirq+0x9f/0xb0
>> (XEN)    [<ffff82c4c012b4e0>] do_softirq+0x13/0x15
>> (XEN)    [<ffff82c4c01628bc>] idle_loop+0x66/0x6c
>> (XEN)
>> (XEN)
>> (XEN) ****************************************
>> (XEN) Panic on CPU 0:
>> (XEN) Assertion 'cpumask_empty(n->vcpu_dirty_cpumask)' failed at
>> domain.c:1321
>> (XEN) ****************************************
>> (XEN)
>>
>> Here, we have managed to re-enter the __context_switch() path because of
>> an NMI interrupting it.  The sync_local_execstate() in map_domain_page()
>> is by way of mapcache_current_vcpu().
>>
>> I am struggling to work out how best to fix this.  Would it be best for
>> the crash path to unconditionally change to the idle_pagetables and use
>> mapcache_override_current(NULL)?
> I think it would be best for the iommu_crash_shutdown() path to be
> made crash-safe -- after all, that code takes spinlocks too.
> Presumably we can do something a bit ruder in crash code, like just
> turn the IOMMUs off entirely?
>
> Or are there other map_domain_page() ops on the crash path?  Does
> kexec need it?
>
> Tim.

I don't believe we can safely just disable the IOMMU without tearing it
down in a sensible fashion.

Having said that, we certainly should try and make the crash path as
"crash safe" as possible.


I don't think it is reasonable to prevent the use of map_domain_page()
on codepaths in the crash path (as being too invasive), but the
mapcache_override_current(NULL) is an override which prevents any
playing with the pagetables, with the caveat that mfn_to_virt(mfn) needs
to work for all mfn's in the current set of pagetables.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.