[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: PCI pass-through problem for SN570 NVME SSD
On Fri, Jul 8, 2022 at 12:38 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: > > On 07.07.2022 17:24, G.R. wrote: > > On Wed, Jul 6, 2022 at 2:33 PM Jan Beulich <jbeulich@xxxxxxxx> wrote: > >> > >> On 06.07.2022 08:25, G.R. wrote: > >>> On Tue, Jul 5, 2022 at 7:59 PM Jan Beulich <jbeulich@xxxxxxxx> wrote: > >>>> Nothing useful in there. Yet independent of that I guess we need to > >>>> separate the issues you're seeing. Otherwise it'll be impossible to > >>>> know what piece of data belongs where. > >>> Yep, I think I'm seeing several different issues here: > >>> 1. The FLR related DPC / AER message seen on the 1st attempt only when > >>> pciback tries to seize and release the SN570 > >>> - Later-on pciback operations appear just fine. > >>> 2. MSI-X preparation failure message that shows up each time the SN570 > >>> is seized by pciback or when it's passed to domU. > >>> 3. XEN tries to map BAR from two devices to the same page > >>> 4. The "write-back to unknown field" message in QEMU log that goes > >>> away with permissive=1 passthrough config. > >>> 5. The "irq 16: nobody cared" message shows up *sometimes* in a > >>> pattern that I haven't figured out (See attached) > >>> 6. The FreeBSD domU sees the device but fails to use it because low > >>> level commands sent to it are aborted. > >>> 7. The device does not return to the pci-assignable-list when the domU > >>> it was assigned shuts-down. (See attached) > >>> > >>> #3 appears to be a known issue that could be worked around with > >>> patches from the list. > >>> I suspect #1 may have something to do with the device itself. It's > >>> still not clear if it's deadly or just annoying. > >>> I was able to update the firmware to the latest version and confirmed > >>> that the new firmware didn't make any noticeable difference. > >>> > >>> I suspect issue #2, #4, #5, #6, #7 may be related, and the > >>> pass-through was not completely successful... > >>> > >>> Should I expect a debug build of XEN hypervisor to give better > >>> diagnose messages, without the debug patch that Roger mentioned? > >> > >> Well, "expect" is perhaps too much to say, but with problems like > >> yours (and even more so with multiple ones) using a debug > >> hypervisor (or kernel, if there such a build mode existed) is imo > >> always a good idea. As is using as up-to-date a version as > >> possible. > > > > I built both 4.14.3 debug version and 4.16.1 release version for > > testing purposes. > > Unfortunately they gave me absolutely zero information, since both of > > them are not able to get through issue #1 > > the FlR related DPC / AER issue. > > With 4.16.1 release, it actually can survive the 'xl > > pci-assignable-add' which triggers the first AER failure. > > Then that's what needs debugging first. Yet from all I've seen so > far I'm not sure who one the Xen side could be doing that, the more > without themselves being able to repro - this seems more like a > Linux side issue (and even outside of the pciback driver). > Yep, this one is likely not XEN related, as I've seen some discussions ([1],[2]) on similar syndrome (not necessarily same root cause though). The question is why this only shows up during the FLR attempt and if following pci-assignable-adds that do not trigger the error are actually reliable. BTW, I'm under the impression that the device is still usable in dom0 afterwards, I'll have to double check though... [1] https://patchwork.kernel.org/project/linux-pci/patch/20220408153159.106741-1-kai.heng.feng@xxxxxxxxxxxxx/ [2] https://patchwork.kernel.org/project/linux-pci/patch/20220127025418.1989642-1-kai.heng.feng@xxxxxxxxxxxxx/#24713767 > > But the 'xl pci-assignable-remove' will lead to xl segmentation fault... > >> [ 655.041442] xl[975]: segfault at 0 ip 00007f2cccdaf71f sp > >> 00007ffd73a3d4d0 error 4 in libxenlight.so.4.16.0[7f2cccd92000+7c000] > >> [ 655.041460] Code: 61 06 00 eb 13 66 0f 1f 44 00 00 83 c3 01 39 5c 24 2c > >> 0f 86 1b 01 00 00 48 8b 34 24 89 d8 4d 89 f9 4d 89 f0 4c 89 e9 4c 89 e2 > >> <48> 8b 3c c6 31 c0 48 89 ee e8 53 44 fe ff 83 f8 04 75 ce 48 8b 44 > > That'll need debugging. Cc-ing Anthony for awareness, but I'm sure > he'll need more data to actually stand a chance of doing something > about it. > > Is there any chance you could be doing some debugging work yourself, > at the very least to figure out where this (apparent) NULL deref is > happening? Yep, I can collect the call-stack for sure. > > Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |