[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: MSI-X cleanup(?) issue with passthrough after domU restart
On Tue, Aug 26, 2025 at 04:52:17PM +0200, Marek Marczykowski-Górecki wrote: > On Tue, Aug 26, 2025 at 04:47:50PM +0200, Roger Pau Monné wrote: > > On Tue, Aug 26, 2025 at 03:55:05PM +0200, Marek Marczykowski-Górecki wrote: > > > On Tue, Aug 26, 2025 at 12:57:50PM +0200, Marek Marczykowski-Górecki > > > wrote: > > > > On Tue, Aug 26, 2025 at 10:28:56AM +0200, Roger Pau Monné wrote: > > > > > On Tue, Aug 26, 2025 at 08:16:56AM +0200, Jan Beulich wrote: > > > > > > On 26.08.2025 03:49, Marek Marczykowski-Górecki wrote: > > > > > > > Hi, > > > > > > > > > > > > > > I'm hitting an MSI-X issue after rebooting the domU. The symptoms > > > > > > > are > > > > > > > rather boring: on initial domU start the device (realtek eth > > > > > > > card) works > > > > > > > fine, but after domU restart, the link doesn't come up (there is > > > > > > > no > > > > > > > "Link is Up" message anymore). No errors from domU driver or Xen. > > > > > > > I > > > > > > > tracked it down to MSI-X - if I force INTx (via pci=nomsi on domU > > > > > > > cmdline) it works fine. Convincing the driver to poll instead of > > > > > > > waiting > > > > > > > for an interrupt also workarounds the issue. > > > > > > > > > > > > > > I noticed also some interrupts are not cleaned up on restart. The > > > > > > > list > > > > > > > of MSIs in 'Q' debug key output grows: > > > > > > > > > > > > > > (XEN) 0000:03:00.0 - d22 - node -1 - MSIs < 41 42 43 44 45 > > > > > > > 46 47 > > > > > > > > restart sys-net domU > > > > > > > (XEN) 0000:03:00.0 - d24 - node -1 - MSIs < 41 42 43 44 45 > > > > > > > 46 47 48 > > > > > > > > restart sys-net domU > > > > > > > (XEN) 0000:03:00.0 - d26 - node -1 - MSIs < 41 42 43 44 45 > > > > > > > 46 47 48 49 > > > > > > > > > > > > > > > and 'M' output is: > > > > > > > > > > > > > > (XEN) MSI-X 41 vec=b1 lowest edge assert log lowest > > > > > > > dest=00000001 mask=1/H /1 > > > > > > > (XEN) MSI-X 42 vec=b9 lowest edge assert log lowest > > > > > > > dest=00000004 mask=1/HG/1 > > > > > > > (XEN) MSI-X 43 vec=c1 lowest edge assert log lowest > > > > > > > dest=00000010 mask=1/HG/1 > > > > > > > (XEN) MSI-X 44 vec=d9 lowest edge assert log lowest > > > > > > > dest=00000001 mask=1/HG/1 > > > > > > > (XEN) MSI-X 45 vec=e1 lowest edge assert log lowest > > > > > > > dest=00000001 mask=1/HG/1 > > > > > > > (XEN) MSI-X 46 vec=e9 lowest edge assert log lowest > > > > > > > dest=00000040 mask=1/HG/1 > > > > > > > (XEN) MSI-X 47 vec=32 lowest edge assert log lowest > > > > > > > dest=00000004 mask=1/HG/1 > > > > > > > (XEN) MSI-X 48 vec=3a lowest edge assert log lowest > > > > > > > dest=00000040 mask=1/HG/1 > > > > > > > (XEN) MSI-X 49 vec=42 lowest edge assert log lowest > > > > > > > dest=00000010 mask=1/ G/1 > > > > > > > > > > > > > > And also, after starting and stopping the domU, `xl > > > > > > > pci-assignable-remove 03:00.0` > > > > > > > makes pciback to complain: > > > > > > > > > > > > > > [ 1180.919874] pciback 0000:03:00.0: xen_pciback: MSI-X > > > > > > > release failed (-16) > > > > > > > > > > > > > > This is all running on Xen 4.19.3, but I don't see much changes > > > > > > > in this > > > > > > > area since then. > > > > > > > > > > > > > > Some more info collected at > > > > > > > https://github.com/QubesOS/qubes-issues/issues/9335 > > > > > > > > > > > > > > My question is: what should be responsible for this cleanup on > > > > > > > domain > > > > > > > destroy? Xen, or maybe device model (which is QEMU in stubdomain > > > > > > > here)? > > > > > > > > > > > > The expectation is that qemu invokes the necessary cleanup, but of > > > > > > course ... > > > > > > > > > > > > > I see some cleanup (apparently not enough) happening via QEMU > > > > > > > when the > > > > > > > domU driver is unloaded, but logically correct cleanup shouldn't > > > > > > > depend > > > > > > > on correct domU operation... > > > > > > > > > > > > ... Xen may not make itself dependent upon either DomU or QEMU. > > > > > > > > > > AFAICT free_domain_pirqs() called by arch_domain_destroy() should take > > > > > care of unbinding and freeing pirqs (but obviously not in this case). > > > > > Can you repeat the test with a debug=y hypervisor and post the > > > > > resulting serial or dmesg here? Some of the errors on those paths are > > > > > printed with dprintk() and won't be visible unless using a Xen debug > > > > > build. > > > > > > > > Sure, will do. > > > > > > Output collected during domU shutdown and subsequent startup (dom0 logs > > > to Xen console here too): > > > https://gist.github.com/marmarek/6dc3ac14d3ba840482e6361fcbd37c30 > > > > > > I don't see any errors there... > > > > Hm, yes, I don't see any errors either. Do you think you could > > instrument unmap_domain_pirq() and figure out whether it gets called, > > and if such call to unmap the PIRQ succeeds? > > Sure, now that I know where to look at, I'll try to find out what goes > wrong. Ok, after adding several debug prints there and looking why it's not called, I found it's a completely different issue. arch_domain_destroy() is not called, because there was a dom0 userspace process still having a page mapped of that domain, and indeed there was a zombie on xl list. Killing that process fixes it. Sorry for the noise... > Yeah, it's a single boot and that's it. I can improve that (and double > check if MSI-X is covered too). But this might be a good idea anyway. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab Attachment:
signature.asc
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |