[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: MSI-X cleanup(?) issue with passthrough after domU restart



On Tue, Aug 26, 2025 at 10:28:56AM +0200, Roger Pau Monné wrote:
> On Tue, Aug 26, 2025 at 08:16:56AM +0200, Jan Beulich wrote:
> > On 26.08.2025 03:49, Marek Marczykowski-Górecki wrote:
> > > Hi,
> > > 
> > > I'm hitting an MSI-X issue after rebooting the domU. The symptoms are
> > > rather boring: on initial domU start the device (realtek eth card) works
> > > fine, but after domU restart, the link doesn't come up (there is no
> > > "Link is Up" message anymore). No errors from domU driver or Xen. I
> > > tracked it down to MSI-X - if I force INTx (via pci=nomsi on domU
> > > cmdline) it works fine. Convincing the driver to poll instead of waiting
> > > for an interrupt also workarounds the issue.
> > > 
> > > I noticed also some interrupts are not cleaned up on restart. The list
> > > of MSIs in 'Q' debug key output grows:
> > > 
> > >     (XEN) 0000:03:00.0 - d22 - node -1  - MSIs < 41 42 43 44 45 46 47 >
> > >     restart sys-net domU
> > >     (XEN) 0000:03:00.0 - d24 - node -1  - MSIs < 41 42 43 44 45 46 47 48 >
> > >     restart sys-net domU
> > >     (XEN) 0000:03:00.0 - d26 - node -1  - MSIs < 41 42 43 44 45 46 47 48 
> > > 49 >
> > > 
> > > and 'M' output is:
> > > 
> > >     (XEN)  MSI-X   41 vec=b1 lowest  edge   assert  log lowest 
> > > dest=00000001 mask=1/H /1
> > >     (XEN)  MSI-X   42 vec=b9 lowest  edge   assert  log lowest 
> > > dest=00000004 mask=1/HG/1
> > >     (XEN)  MSI-X   43 vec=c1 lowest  edge   assert  log lowest 
> > > dest=00000010 mask=1/HG/1
> > >     (XEN)  MSI-X   44 vec=d9 lowest  edge   assert  log lowest 
> > > dest=00000001 mask=1/HG/1
> > >     (XEN)  MSI-X   45 vec=e1 lowest  edge   assert  log lowest 
> > > dest=00000001 mask=1/HG/1
> > >     (XEN)  MSI-X   46 vec=e9 lowest  edge   assert  log lowest 
> > > dest=00000040 mask=1/HG/1
> > >     (XEN)  MSI-X   47 vec=32 lowest  edge   assert  log lowest 
> > > dest=00000004 mask=1/HG/1
> > >     (XEN)  MSI-X   48 vec=3a lowest  edge   assert  log lowest 
> > > dest=00000040 mask=1/HG/1
> > >     (XEN)  MSI-X   49 vec=42 lowest  edge   assert  log lowest 
> > > dest=00000010 mask=1/ G/1
> > > 
> > > And also, after starting and stopping the domU, `xl pci-assignable-remove 
> > > 03:00.0`
> > > makes pciback to complain:
> > > 
> > >     [ 1180.919874] pciback 0000:03:00.0: xen_pciback: MSI-X release 
> > > failed (-16)
> > > 
> > > This is all running on Xen 4.19.3, but I don't see much changes in this
> > > area since then.
> > > 
> > > Some more info collected at 
> > > https://github.com/QubesOS/qubes-issues/issues/9335
> > > 
> > > My question is: what should be responsible for this cleanup on domain
> > > destroy? Xen, or maybe device model (which is QEMU in stubdomain here)?
> > 
> > The expectation is that qemu invokes the necessary cleanup, but of course 
> > ...
> > 
> > > I see some cleanup (apparently not enough) happening via QEMU when the
> > > domU driver is unloaded, but logically correct cleanup shouldn't depend
> > > on correct domU operation...
> > 
> > ... Xen may not make itself dependent upon either DomU or QEMU.
> 
> AFAICT free_domain_pirqs() called by arch_domain_destroy() should take
> care of unbinding and freeing pirqs (but obviously not in this case).
> Can you repeat the test with a debug=y hypervisor and post the
> resulting serial or dmesg here?  Some of the errors on those paths are
> printed with dprintk() and won't be visible unless using a Xen debug
> build.

Sure, will do.

> > What I find puzzling (assuming I can take the quoted output plus your 
> > annotations
> > verbatim) is that the device apparently uses multiple vectors, 

No, that was not the first domU restart before I started collecting this
output. At fresh boot there is just one vector.

> > and we're leaking
> > exactly one of them. Also, since reboot is generally nothing else than 
> > shutdown
> > and immediate relaunch, is there a leak also after shutdown? I ask because 
> > it
> > might help to know which of the multiple vectors is leaked (first, last, 
> > random).
> 
> Can we maybe get the output of `lspci -vv` when the device is
> attached?

Both below on first domU start, when the device still works, but when it
breaks it's identical.

Collected in dom0:
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. 
RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller (rev 06)
        Subsystem: Gigabyte Technology Co., Ltd Onboard Ethernet
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 18
        Region 0: I/O ports at e000 [size=256]
        Region 2: Memory at f7c00000 (64-bit, non-prefetchable) [size=4K]
        Region 4: Memory at f0000000 (64-bit, prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA 
PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [70] Express (v2) Endpoint, IntMsgNum 1
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, 
L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- 
SlotPowerLimit 10W TEE-IO-
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ 
TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit 
Latency L0s unlimited, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- 
FltModeDis-
                LnkSta: Speed 2.5GT/s, Width x1
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- 
LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- 
EETLPPrefix-
                         EmergencyPowerReduction Not Supported, 
EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB 
preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- 
EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- 
LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported, FltMode-
        Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00000800
        Capabilities: [d0] Vital Product Data
                Not readable
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- 
AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- 
PCRC_CHECK- TLPXlatBlocked-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- 
AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- 
PCRC_CHECK- TLPXlatBlocked-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF+ MalfTLP+
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- 
AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- 
PCRC_CHECK- TLPXlatBlocked-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
AdvNonFatalErr- CorrIntErr- HeaderOF-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
AdvNonFatalErr+ CorrIntErr- HeaderOF-
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- 
ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [140 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
        Kernel driver in use: pciback
        Kernel modules: r8169


and the domU view:

00:06.0 Ethernet controller: Realtek Semiconductor Co., Ltd. 
RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller (rev 06)
        Subsystem: Gigabyte Technology Co., Ltd Onboard Ethernet
        Physical Slot: 6
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 40
        Region 0: I/O ports at c200 [size=256]
        Region 2: Memory at f2018000 (64-bit, non-prefetchable) [size=4K]
        Region 4: Memory at f2010000 (64-bit, prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [70] Express (v2) Endpoint, IntMsgNum 1
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, 
L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- 
SlotPowerLimit 10W TEE-IO-
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ 
TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit 
Latency L0s unlimited, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- 
FltModeDis-
                LnkSta: Speed 2.5GT/s, Width x1
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- 
LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- 
EETLPPrefix-
                         EmergencyPowerReduction Not Supported, 
EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- 
EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- 
LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported, FltMode-
        Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00000800
        Capabilities: [d0] Vital Product Data
                Not readable
        Kernel driver in use: r8169
        Kernel modules: r8169


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.