Xen project Mailing List

Re: [Xen-devel] Xen-unstable-staging: Xen BUG at iommu_map.c:455

To: Sander Eikelenboom <linux@xxxxxxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Sat, 11 Apr 2015 17:38:17 +0100

Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Paul Durrant <paul.durrant@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>

Delivery-date: Sat, 11 Apr 2015 16:38:33 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 11/04/15 17:32, Andrew Cooper wrote: > On 11/04/15 17:21, Sander Eikelenboom wrote: >> Saturday, April 11, 2015, 4:21:56 PM, you wrote: >> >>> On 11/04/15 15:11, Sander Eikelenboom wrote: >>>> Friday, April 10, 2015, 8:55:27 PM, you wrote: >>>> >>>>> On 10/04/15 11:24, Sander Eikelenboom wrote: >>>>>> Hi Andrew, >>>>>> >>>>>> Finally got some time to figure this out .. and i have narrowed it down >>>>>> to: >>>>>> git://xenbits.xen.org/staging/qemu-upstream-unstable.git >>>>>> commit 7665d6ba98e20fb05c420de947c1750fd47e5c07 "Xen: Use the >>>>>> ioreq-server API when available" >>>>>> A straight revert of this commit prevents the issue from happening. >>>>>> >>>>>> The reason i had a hard time figuring this out was: >>>>>> - I wasn't aware of this earlier, since git pulling the main xen tree, >>>>>> doesn't >>>>>> auto update the qemu-* trees. >>>>> This has caught me out so many times. It is very non-obvious behaviour. >>>>>> - So i happen to get this when i cloned a fresh tree to try to figure >>>>>> out the >>>>>> other issue i was seeing. >>>>>> - After that checking out previous versions of the main xen tree didn't >>>>>> resolve >>>>>> this new issue, because the qemu tree doesn't get auto updated and is >>>>>> set >>>>>> "master". >>>>>> - Cloning a xen-stable-4.5.0 made it go away .. because that has a >>>>>> specific >>>>>> git://xenbits.xen.org/staging/qemu-upstream-unstable.git tag which is >>>>>> not >>>>>> master. >>>>>> >>>>>> *sigh* >>>>>> >>>>>> This is tested with xen main tree at last commit >>>>>> 3a28f760508fb35c430edac17a9efde5aff6d1d5 >>>>>> (normal xen-unstable, not the staging branch) >>>>>> >>>>>> Ok so i have added some extra debug info (see attached diff) and this is >>>>>> the >>>>>> output when it crashes due to something the commit above triggered, the >>>>>> level is out of bounds and the pfn looks fishy too. >>>>>> Complete serial log from both bad and good (specific commit reverted) >>>>>> are >>>>>> attached. >>>>> Just to confirm, you are positively identifying a qemu changeset as >>>>> causing this crash? >>>>> If so, the qemu change has discovered a pre-existing issue in the >>>>> toolstack pci-passthrough interface. Whatever qemu is or isn't doing, >>>>> it should not be able to cause a crash like this. >>>>> With this in mind, I need to brush up on my AMD-Vi details. >>>>> In the meantime, can you run with the following patch to identify what >>>>> is going on, domctl wise? I assume it is the assign_device which is >>>>> failing, but it will be nice to observe the differences between the >>>>> working and failing case, which might offer a hint. >>>> Hrrm with your patch i end up with a fatal page fault in >>>> iommu_do_pci_domctl: >>>> >>>> (XEN) [2015-04-11 14:03:31.833] ----[ Xen-4.6-unstable x86_64 debug=y >>>> Tainted: C ]---- >>>> (XEN) [2015-04-11 14:03:31.857] CPU: 5 >>>> (XEN) [2015-04-11 14:03:31.868] RIP: e008:[<ffff82d08014c52c>] >>>> iommu_do_pci_domctl+0x2dc/0x740 >>>> (XEN) [2015-04-11 14:03:31.894] RFLAGS: 0000000000010256 CONTEXT: >>>> hypervisor >>>> (XEN) [2015-04-11 14:03:31.915] rax: 0000000000000008 rbx: >>>> 0000000000000800 rcx: ffffffffffebe5ed >>>> (XEN) [2015-04-11 14:03:31.942] rdx: 0000000000000800 rsi: >>>> 0000000000000000 rdi: ffff830256ef7e38 >>>> (XEN) [2015-04-11 14:03:31.968] rbp: ffff830256ef7c98 rsp: >>>> ffff830256ef7c08 r8: 00000000deadbeef >>>> (XEN) [2015-04-11 14:03:31.995] r9: 00000000deadbeef r10: >>>> ffff82d08024e500 r11: 0000000000000282 >>>> (XEN) [2015-04-11 14:03:32.022] r12: 0000000000000000 r13: >>>> 0000000000000008 r14: 0000000000000000 >>>> (XEN) [2015-04-11 14:03:32.049] r15: 0000000000000000 cr0: >>>> 0000000080050033 cr4: 00000000000006f0 >>>> (XEN) [2015-04-11 14:03:32.076] cr3: 00000002336a6000 cr2: >>>> 0000000000000000 >>>> (XEN) [2015-04-11 14:03:32.096] ds: 0000 es: 0000 fs: 0000 gs: 0000 >>>> ss: e010 cs: e008 >>>> (XEN) [2015-04-11 14:03:32.121] Xen stack trace from rsp=ffff830256ef7c08: >>>> (XEN) [2015-04-11 14:03:32.141] ffff830256ef7c78 ffff82d08012c178 >>>> ffff830256ef7c28 ffff830256ef7c28 >>>> (XEN) [2015-04-11 14:03:32.168] 0000000000000010 0000000000000000 >>>> 0000000000000000 0000000000000000 >>>> (XEN) [2015-04-11 14:03:32.195] 00000000000006f0 00007fe300000000 >>>> ffff830256eb7790 ffff83025cc6d300 >>>> (XEN) [2015-04-11 14:03:32.222] ffff82d080330c60 00007fe396bab004 >>>> 0000000000000000 00007fe396bab004 >>>> (XEN) [2015-04-11 14:03:32.249] 0000000000000000 0000000000000005 >>>> ffff830256ef7ca8 ffff82d08014900b >>>> (XEN) [2015-04-11 14:03:32.276] ffff830256ef7d98 ffff82d080161f2d >>>> 0000000000000010 0000000000000000 >>>> (XEN) [2015-04-11 14:03:32.303] 0000000000000000 ffff830256ef7ce8 >>>> ffff82d08018b655 ffff830256ef7d48 >>>> (XEN) [2015-04-11 14:03:32.330] ffff830256ef7cf8 ffff82d08018b66a >>>> ffff830256ef7d38 ffff82d08012925e >>>> (XEN) [2015-04-11 14:03:32.357] ffff830256efc068 0000000800000001 >>>> 800000022e12c167 0000000000000000 >>>> (XEN) [2015-04-11 14:03:32.384] 0000000000000002 ffff830256ef7e38 >>>> 0000000800000000 800000022e12c167 >>>> (XEN) [2015-04-11 14:03:32.411] 0000000000000003 ffff830256ef7db8 >>>> 0000000000000000 00007fe396780eb0 >>>> (XEN) [2015-04-11 14:03:32.439] 0000000000000202 ffffffffffffffff >>>> 0000000000000000 00007fe396bab004 >>>> (XEN) [2015-04-11 14:03:32.466] 0000000000000000 0000000000000005 >>>> ffff830256ef7ef8 ffff82d08010497f >>>> (XEN) [2015-04-11 14:03:32.493] 0000000000000001 0000000000100001 >>>> 800000022e12c167 ffff88001f7ecc00 >>>> (XEN) [2015-04-11 14:03:32.520] 00007fe396780eb0 ffff88001c849508 >>>> 0000000e00000007 ffffffff8105594a >>>> (XEN) [2015-04-11 14:03:32.547] 000000000000e033 0000000000000202 >>>> ffff88001ece3d40 000000000000e02b >>>> (XEN) [2015-04-11 14:03:32.574] ffff830256ef7e28 ffff82d080194933 >>>> 000000000000beef ffffffff81bd6c85 >>>> (XEN) [2015-04-11 14:03:32.601] ffff830256ef7f08 ffff82d080193edd >>>> 0000000b0000002d 0000000000000001 >>>> (XEN) [2015-04-11 14:03:32.628] 0000000100000800 00007fe3962abbd0 >>>> ffff000a81050001 00007fe39656ce6e >>>> (XEN) [2015-04-11 14:03:32.655] 00007ffdf2a654f0 00007fe39656d0c9 >>>> 00007fe39656ce6e 00007fe3969a9a55 >>>> (XEN) [2015-04-11 14:03:32.682] Xen call trace: >>>> (XEN) [2015-04-11 14:03:32.695] [<ffff82d08014c52c>] >>>> iommu_do_pci_domctl+0x2dc/0x740 >>>> (XEN) [2015-04-11 14:03:32.718] [<ffff82d08014900b>] >>>> iommu_do_domctl+0x17/0x1a >>>> (XEN) [2015-04-11 14:03:32.739] [<ffff82d080161f2d>] >>>> arch_do_domctl+0x2469/0x26e1 >>>> (XEN) [2015-04-11 14:03:32.762] [<ffff82d08010497f>] >>>> do_domctl+0x1a1f/0x1d60 >>>> (XEN) [2015-04-11 14:03:32.783] [<ffff82d080234c6b>] >>>> syscall_enter+0xeb/0x145 >>>> (XEN) [2015-04-11 14:03:32.804] >>>> (XEN) [2015-04-11 14:03:32.813] Pagetable walk from 0000000000000000: >>>> (XEN) [2015-04-11 14:03:32.831] L4[0x000] = 0000000234075067 >>>> 000000000001f2a8 >>>> (XEN) [2015-04-11 14:03:32.852] L3[0x000] = 0000000229ad4067 >>>> 0000000000014c49 >>>> (XEN) [2015-04-11 14:03:32.873] L2[0x000] = 0000000000000000 >>>> ffffffffffffffff >>>> (XEN) [2015-04-11 14:03:32.894] >>>> (XEN) [2015-04-11 14:03:32.903] **************************************** >>>> (XEN) [2015-04-11 14:03:32.922] Panic on CPU 5: >>>> (XEN) [2015-04-11 14:03:32.935] FATAL PAGE FAULT >>>> (XEN) [2015-04-11 14:03:32.948] [error_code=0000] >>>> (XEN) [2015-04-11 14:03:32.961] Faulting linear address: 0000000000000000 >>>> (XEN) [2015-04-11 14:03:32.981] **************************************** >>>> (XEN) [2015-04-11 14:03:33.000] >>>> (XEN) [2015-04-11 14:03:33.009] Reboot in five seconds... >>>> >>>> The RIP resolves to the prink added by your patch in: >>>> >>>> case XEN_DOMCTL_test_assign_device: >>>> ret = xsm_test_assign_device(XSM_HOOK, >>>> domctl->u.assign_device.machine_sbdf); >>>> if ( ret ) >>>> break; >>>> >>>> seg = domctl->u.assign_device.machine_sbdf >> 16; >>>> bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff; >>>> devfn = domctl->u.assign_device.machine_sbdf & 0xff; >>>> >>>> printk("*** %pv->d%d: test_assign_device({%04x:%02x:%02x.%u})\n", >>>> current, d->domain_id, >>>> seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); >>>> >>>> if ( device_assigned(seg, bus, devfn) ) >>>> { >>>> printk(XENLOG_G_INFO >>>> "%04x:%02x:%02x.%u already assigned, or non-existent\n", >>>> seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); >>>> ret = -EINVAL; >>>> } >>>> break; >>> hmm - 'd' is NULL. This ought to work better. >>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c >>> index 9f3413c..85ff1fc 100644 >>> --- a/xen/drivers/passthrough/pci.c >>> +++ b/xen/drivers/passthrough/pci.c >>> @@ -1532,6 +1532,11 @@ int iommu_do_pci_domctl( >>> max_sdevs = domctl->u.get_device_group.max_sdevs; >>> sdevs = domctl->u.get_device_group.sdev_array; >>> >>> + printk("*** %pv->d%d: get_device_group({%04x:%02x:%02x.%u, %u})\n", >>> + current, d->domain_id, >>> + seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), >>> + max_sdevs); >>> + >>> ret = iommu_get_device_group(d, seg, bus, devfn, sdevs, max_sdevs); >>> if ( ret < 0 ) >>> { >>> @@ -1558,6 +1563,9 @@ int iommu_do_pci_domctl( >>> bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff; >>> devfn = domctl->u.assign_device.machine_sbdf & 0xff; >>> >>> + printk("*** %pv: test_assign_device({%04x:%02x:%02x.%u})\n", >>> + current, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); >>> + >>> if ( device_assigned(seg, bus, devfn) ) >>> { >>> printk(XENLOG_G_INFO >>> @@ -1582,6 +1590,10 @@ int iommu_do_pci_domctl( >>> bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff; >>> devfn = domctl->u.assign_device.machine_sbdf & 0xff; >>> >>> + printk("*** %pv->d%d: assign_device({%04x:%02x:%02x.%u})\n", >>> + current, d->domain_id, >>> + seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); >>> + >>> ret = device_assigned(seg, bus, devfn) ?: >>> assign_device(d, seg, bus, devfn); >>> if ( ret == -ERESTART ) >>> @@ -1604,6 +1616,10 @@ int iommu_do_pci_domctl( >>> bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff; >>> devfn = domctl->u.assign_device.machine_sbdf & 0xff; >>> >>> + printk("*** %pv->d%d: deassign_device({%04x:%02x:%02x.%u})\n", >>> + current, d->domain_id, >>> + seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); >>> + >>> spin_lock(&pcidevs_lock); >>> ret = deassign_device(d, seg, bus, devfn); >>> spin_unlock(&pcidevs_lock); >> Hi Andrew, >> >> Attached are the serial logs good (with revert) and bad (without): >> >> Some things that seems strange to me: >> - The numerous calls to get the device 08:00.0 assigned ... for 0a:00.0 >> there >> was only one call to both test assign and assign. >> - However these numerous calls are there in both the good and the bad case, >> so perhaps it's strange and wrong .. but not the cause .. >> - I had a hunch it could be due to the 08:00.0 using MSI-X, but when only >> passing through 0a:00.0, i get the same numerous calls but now for the >> 0a:00.0 which uses IntX, so I think that is more related to being the >> *first* >> device to be passed through to a guest. > I have also observed this behaviour, but not had time to investigate. > It doesn't appear problematic in the longrun but it probably a toolstack > issue which wants fixing (if only in the name of efficiency). And just after I sent this email, I have realised why. The first assign device will have to build IO pagetables, which is a long operation and subject to hypercall continuations. The second device will reused the same pagetables, so is quick to complete. ~Andrew > >> - On both the good and the bad case the "current" indicates more than one >> vpcu >> was used, so that doesn't seem to be a pointer either. > That is to be expected. It will be whichever vcpu the toolstack process > happened to be scheduled on at the point that it made the hypercall, and > is liable to change at the whim of the dom0 scheduler. > >> - The bad log seems to indicate it crashes on or before hd->arch.paging_mode >> gets to >> 3. >> >> Is there anything useful to dump in update_paging_mode() when it updates the >> paging mode to the invalid value (8)? >> (and shouldn't that be denied in the first place ?) >> >> To me that looks like the first indication that things are starting to go >> wrong. > I would concur. I am currently trying to develop another debugging > patch, but to do that I also what to understand what I am debugging, > which is why I have the AMD-Vi spec open :) > > ~Andrew > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.