[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: PCI pass-through problem for SN570 NVME SSD
Update some findings with extra triage effort... Detailed log could be found in the attachments. 1. Confirm stock Debian 11.2 kernel (5.10) shares the same syndrome.. 2. With loglvl=all, it reveals why the mapping failure happens, looks like it comes from some duplicated mapping.. (XEN) memory_map:add: dom1 gfn=f3074 mfn=a2610 nr=2 (XEN) memory_map:add: dom1 gfn=f3077 mfn=a2615 nr=1 (XEN) memory_map:add: dom1 gfn=f3078 mfn=a2616 nr=1 <===========Here (XEN) ioport_map:add: dom1 gport=c140 mport=4060 nr=20 (XEN) ioport_map:add: dom1 gport=c170 mport=4090 nr=8 (XEN) ioport_map:add: dom1 gport=c178 mport=4080 nr=4 (XEN) memory_map:add: dom1 gfn=f3070 mfn=a2500 nr=2 (XEN) memory_map:add: dom1 gfn=f3073 mfn=a2503 nr=1 (XEN) memory_map:add: dom1 gfn=f3078 mfn=a2504 nr=1 <===========Here (XEN) d1: GFN 0xf3078 (0xa2616,0,5,7) -> (0xa2504,0,5,7) not permitted (XEN) domain_crash called from p2m.c:1301 (XEN) Domain 1 reported crashed by domain 0 on cpu#2: (XEN) memory_map:fail: dom1 gfn=f3078 mfn=a2504 nr=1 ret:-1 (XEN) memory_map:remove: dom1 gfn=f3078 mfn=a2504 nr=1 3. Recompiled kernel with DEBUG enabled for xen_pciback driver and play with xl pci-assignable-XXX with it 3.1 It's confirmed that the DPC / AER error log happens only when xen_pciback attempts to seize && release the device 3.1.1 It only happens on each of the first add / remove operations. 3.2 There is still a 'MSI-X preparation failed' message later-on, but otherwise it appears to be successful to add / remove the device after the 1st attempt. 3.3 Not necessarily related, but the DPC / AER log looks similar to this report [1] [1]: https://patchwork.kernel.org/project/linux-pci/patch/20220127025418.1989642-1-kai.heng.feng@xxxxxxxxxxxxx/#24713767 PS: Attempting to fix the line-wrapping issue below... Have no idea what happened about the formatting.... On Sun, Jul 3, 2022 at 1:43 AM G.R. <firemeteor@xxxxxxxxxxxxxxxxxxxxx> wrote: > > Hi everybody, > > I run into problems passing through a SN570 NVME SSD to a HVM guest. > So far I have no idea if the problem is with this specific SSD or with > the CPU + motherboard combination or the SW stack. > Looking for some suggestions on troubleshooting. > > List of build info: > CPU+motherboard: E-2146G + Gigabyte C246N-WU2 > XEN version: 4.14.3 > Dom0: Linux Kernel 5.10 (built from Debian 11.2 kernel source package) > The SN570 SSD sits here in the PCI tree: > +-1d.0-[05]----00.0 Sandisk Corp Device 501a > > Syndromes observed: > With ASPM enabled, pciback has problem seizing the device. > > Jul 2 00:36:54 gaia kernel: [ 1.648270] pciback 0000:05:00.0: > xen_pciback: seizing device > ... > Jul 2 00:36:54 gaia kernel: [ 1.768646] pcieport 0000:00:1d.0: AER: > enabled with IRQ 150 > Jul 2 00:36:54 gaia kernel: [ 1.768716] pcieport 0000:00:1d.0: DPC: > enabled with IRQ 150 > Jul 2 00:36:54 gaia kernel: [ 1.768717] pcieport 0000:00:1d.0: DPC: error > containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO > Log 4, DL_ActiveErr+ > ... > Jul 2 00:36:54 gaia kernel: [ 1.770039] xen: registering gsi 16 > triggering 0 polarity 1 > Jul 2 00:36:54 gaia kernel: [ 1.770041] Already setup the GSI :16 > Jul 2 00:36:54 gaia kernel: [ 1.770314] pcieport 0000:00:1d.0: DPC: > containment event, status:0x1f11 source:0x0000 > Jul 2 00:36:54 gaia kernel: [ 1.770315] pcieport 0000:00:1d.0: DPC: > unmasked uncorrectable error detected > Jul 2 00:36:54 gaia kernel: [ 1.770320] pcieport 0000:00:1d.0: PCIe Bus > Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) > Jul 2 00:36:54 gaia kernel: [ 1.770371] pcieport 0000:00:1d.0: device > [8086:a330] error status/mask=00200000/00010000 > Jul 2 00:36:54 gaia kernel: [ 1.770413] pcieport 0000:00:1d.0: [21] > ACSViol (First) > Jul 2 00:36:54 gaia kernel: [ 1.770466] pciback 0000:05:00.0: > xen_pciback: device is not found/assigned > Jul 2 00:36:54 gaia kernel: [ 1.920195] pciback 0000:05:00.0: > xen_pciback: device is not found/assigned > Jul 2 00:36:54 gaia kernel: [ 1.920260] pcieport 0000:00:1d.0: AER: > device recovery successful > Jul 2 00:36:54 gaia kernel: [ 1.920263] pcieport 0000:00:1d.0: DPC: > containment event, status:0x1f01 source:0x0000 > Jul 2 00:36:54 gaia kernel: [ 1.920264] pcieport 0000:00:1d.0: DPC: > unmasked uncorrectable error detected > Jul 2 00:36:54 gaia kernel: [ 1.920267] pciback 0000:05:00.0: > xen_pciback: device is not found/assigned > Jul 2 00:36:54 gaia kernel: [ 1.938406] xen: registering gsi 16 > triggering 0 polarity 1 > Jul 2 00:36:54 gaia kernel: [ 1.938408] Already setup the GSI :16 > Jul 2 00:36:54 gaia kernel: [ 1.938666] xen_pciback: backend is vpci > ... > Jul 2 00:43:48 gaia kernel: [ 420.231955] pcieport 0000:00:1d.0: DPC: > containment event, status:0x1f01 source:0x0000 > Jul 2 00:43:48 gaia kernel: [ 420.231961] pcieport 0000:00:1d.0: DPC: > unmasked uncorrectable error detected > Jul 2 00:43:48 gaia kernel: [ 420.231993] pcieport 0000:00:1d.0: PCIe Bus > Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester > ID) > Jul 2 00:43:48 gaia kernel: [ 420.235775] pcieport 0000:00:1d.0: device > [8086:a330] error status/mask=00100000/00010000 > Jul 2 00:43:48 gaia kernel: [ 420.235779] pcieport 0000:00:1d.0: [20] > UnsupReq (First) > Jul 2 00:43:48 gaia kernel: [ 420.235783] pcieport 0000:00:1d.0: AER: TLP > Header: 34000000 05000010 00000000 88458845 > Jul 2 00:43:48 gaia kernel: [ 420.235819] pci 0000:05:00.0: AER: can't > recover (no error_detected callback) > Jul 2 00:43:48 gaia kernel: [ 420.384349] pcieport 0000:00:1d.0: AER: > device recovery successful > ... // The following might relate to an attempt to assign the device to > guest, not very sure... > Jul 2 00:46:06 gaia kernel: [ 559.147333] pciback 0000:05:00.0: > xen_pciback: seizing device > Jul 2 00:46:06 gaia kernel: [ 559.147435] pciback 0000:05:00.0: enabling > device (0000 -> 0002) > Jul 2 00:46:06 gaia kernel: [ 559.147508] xen: registering gsi 16 > triggering 0 polarity 1 > Jul 2 00:46:06 gaia kernel: [ 559.147511] Already setup the GSI :16 > Jul 2 00:46:06 gaia kernel: [ 559.147558] pciback 0000:05:00.0: > xen_pciback: MSI-X preparation failed (-6) > > > With pcie_aspm=off, the error log related to pciback goes away. > But I suspect there are still some problems hidden -- since I don't > see any AER enabled messages so errors may be hidden. > I have the xen_pciback built directly into the kernel and assigned the > SSD to it in the kernel command-line. > However, the result from pci-assignable-xxx commands are not very consistent: > > root@gaia:~# xl pci-assignable-list > 0000:00:17.0 > 0000:05:00.0 > root@gaia:~# xl pci-assignable-remove 05:00.0 > libxl: error: libxl_pci.c:853:libxl__device_pci_assignable_remove: failed to > de-quarantine 0000:05:00.0 <===== Here!!! > root@gaia:~# xl pci-assignable-add 05:00.0 > libxl: warning: libxl_pci.c:794:libxl__device_pci_assignable_add: > 0000:05:00.0 already assigned to pciback <==== Here!!! > root@gaia:~# xl pci-assignable-remove 05:00.0 > root@gaia:~# xl pci-assignable-list 0000:00:17.0 > root@gaia:~# xl pci-assignable-add 05:00.0 > libxl: warning: libxl_pci.c:814:libxl__device_pci_assignable_add: > 0000:05:00.0 not bound to a driver, will not be rebound. > root@gaia:~# xl pci-assignable-list > 0000:00:17.0 > 0000:05:00.0 > > > After the 'xl pci-assignable-list' appears to be self-consistent, creating VM > with the SSD assigned still leads to a guest crash: > From qemu log: > [00:06.0] xen_pt_region_update: Error: create new mem mapping failed! (err: 1) > qemu-system-i386: terminating on signal 1 from pid 1192 (xl) > > From the 'xl dmesg' output: > (XEN) d1: GFN 0xf3078 (0xa2616,0,5,7) -> (0xa2504,0,5,7) not permitted > (XEN) domain_crash called from p2m.c:1301 > (XEN) Domain 1 reported crashed by domain 0 on cpu#4: > (XEN) memory_map:fail: dom1 gfn=f3078 mfn=a2504 nr=1 ret:-1 > > > Which of the three syndromes are more fundamental? > 1. The DPC / AER error log > 2. The inconsistency in 'xl pci-assignable-list' state tracking > 3. The GFN mapping failure on guest setup > > Any suggestions for the next step? > > > Thanks, > G.R. Attachment:
xldmesg_sn570_pt_fail.log Attachment:
pciback_dbg_xl-pci_assignable_XXX.log
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |