Xen project Mailing List

Re: [PATCH v3] x86/msi: fix locking for SR-IOV devices

From: Stewart Hildebrand <stewart.hildebrand@xxxxxxx>

Date: Sun, 25 Aug 2024 14:03:26 -0400

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=suse.com smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0)

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=EbzNDj1Z9NQ0MfA/byv5edJIS37uK2LeuyIBCwbsJPc=; b=rPp3kdh3yTERgSKwZ6AIioW40jdyxBqOftWK04HhQxmctkvuoS5RGU4A8rrhuHjK54IR56R1d4Dcyr2tc826D2YHHxlTo/R6r+FcH1CG5Os8SJn8Fx7+Ux5gij9q3vtbuXW4kzzvUWKpX58GX2yBhE7/IzpOl7Cyl7Fw6g0yvrArJ3OCTC70gJCKQaTu1AsZicDxMKjqSSOk4bU7dEreJ8w8aBS/1c8xLD76cDUcRr5jwMWIjh0gPSNbL+nTm2Jw39JjgX3da51Hw0PkKOgxi4mGuSNQoSkGJKEKSypAygvrSYjYsDWONEneUhYAvf1oeroGUkiiFOtoZ+t9GsVYSw==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=X3WGYEiRDFbOtAwiyqrBBToL/QqAUHLfeutGTmq5GmrKR66Svnj/QIy2Ezm0DpLhroMYfxvDiQZYcCOIx0Pg4uLPGgLOidF/kXD9/hgNN2jGaKh20MBrygUWpW1ZM+HSlM43q3ehzpc38IeGtEjom4AYT08lUHitPxS9N3Vyc3svqOGXgjZxw/9rFrRXikYVWanXUzSEA0Sqg4l89mDQGEwTI5JP/t3wN2j0Bp5pETyANpl5gb6O32TZ13916qoMr4iHNHZPftu7EHmEMNVzeX+oyfWUM7CZhYcw1spOpyueks4Ex5rO2kYfAIuNeOpVHl2iSKWucfLjrL3L+zT1/Q==

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Teddy Astie <teddy.astie@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Sun, 25 Aug 2024 18:03:59 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 8/13/24 10:01, Jan Beulich wrote: > On 12.08.2024 22:39, Stewart Hildebrand wrote: >> In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci >> structure") a lock moved from allocate_and_map_msi_pirq() to the caller >> and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one >> call path wasn't updated to reflect the change, leading to a failed >> assertion observed under the following conditions: >> >> * PV dom0 >> * Debug build (debug=y) of Xen >> * There is an SR-IOV device in the system with one or more VFs enabled >> * Dom0 has loaded the driver for the VF and enabled MSI-X >> >> (XEN) Assertion 'd || pcidevs_locked()' failed at >> drivers/passthrough/pci.c:535 >> (XEN) ----[ Xen-4.20-unstable x86_64 debug=y Not tainted ]---- >> ... >> (XEN) Xen call trace: >> (XEN) [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab >> (XEN) [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272 >> (XEN) [<ffff82d04034530e>] F >> arch/x86/msi.c#msix_capability_init+0x198/0x755 >> (XEN) [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8 >> (XEN) [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78 >> (XEN) [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc >> (XEN) [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262 >> (XEN) [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259 >> (XEN) [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454 >> (XEN) [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af >> (XEN) [<ffff82d0402012d3>] F lstar_enter+0x143/0x150 >> >> In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its >> associated PF to access the vf_rlen array. This array is initialized in >> pci_add_device() and is only populated in the associated PF's struct >> pci_dev. >> >> Add a link from the VF's struct pci_dev to the associated PF struct >> pci_dev, ensuring the PF's struct doesn't get deallocated until all its >> VFs have gone away. Access the vf_rlen array via the new link to the PF, >> and remove the troublesome call to pci_get_pdev(). >> >> Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of >> pci_add_device() to set up the link from VF to PF. In case the new >> pci_add_device() invocation fails to find the associated PF (returning >> NULL), we are no worse off than before: read_pci_mem_bar() will still >> return 0 in that case. >> >> Note that currently the only way for Xen to know if a device is a VF is >> if the toolstack tells Xen about it. Using PHYSDEVOP_manage_pci_add for >> a VF is not a case that Xen handles. > > How does the toolstack come into play here? It's still the Dom0 kernel to > tell Xen, via PHYSDEVOP_pci_device_add (preferred) or > PHYSDEVOP_manage_pci_add_ext (kind of deprecated; PHYSDEVOP_manage_pci_add > is even more kind of deprecated). This last bit of the commit description isn't adding much, I'll remove it. >> --- a/xen/arch/x86/msi.c >> +++ b/xen/arch/x86/msi.c >> @@ -662,34 +662,32 @@ static int msi_capability_init(struct pci_dev *dev, >> return 0; >> } >> >> -static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int >> vf) >> +static u64 read_pci_mem_bar(struct pf_info *pf_info, u16 seg, u8 bus, u8 >> slot, >> + u8 func, u8 bir, int vf) >> { >> + pci_sbdf_t sbdf = PCI_SBDF(seg, bus, slot, func); >> u8 limit; >> u32 addr, base = PCI_BASE_ADDRESS_0; >> u64 disp = 0; >> >> if ( vf >= 0 ) >> { >> - struct pci_dev *pdev = pci_get_pdev(NULL, >> - PCI_SBDF(seg, bus, slot, func)); >> unsigned int pos; >> uint16_t ctrl, num_vf, offset, stride; >> >> - if ( !pdev ) >> - return 0; >> - >> - pos = pci_find_ext_capability(pdev->sbdf, PCI_EXT_CAP_ID_SRIOV); >> - ctrl = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_CTRL); >> - num_vf = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_NUM_VF); >> - offset = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_OFFSET); >> - stride = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_STRIDE); >> + pos = pci_find_ext_capability(sbdf, PCI_EXT_CAP_ID_SRIOV); >> + ctrl = pci_conf_read16(sbdf, pos + PCI_SRIOV_CTRL); >> + num_vf = pci_conf_read16(sbdf, pos + PCI_SRIOV_NUM_VF); >> + offset = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_OFFSET); >> + stride = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_STRIDE); >> >> if ( !pos || >> !(ctrl & PCI_SRIOV_CTRL_VFE) || >> !(ctrl & PCI_SRIOV_CTRL_MSE) || >> !num_vf || !offset || (num_vf > 1 && !stride) || >> bir >= PCI_SRIOV_NUM_BARS || >> - !pdev->vf_rlen[bir] ) >> + !pf_info || > > See further down - I think it would be nice if we didn't need this new > check. OK >> @@ -813,6 +811,7 @@ static int msix_capability_init(struct pci_dev *dev, >> int vf; >> paddr_t pba_paddr; >> unsigned int pba_offset; >> + struct pf_info *pf_info = NULL; > > Pointer-to-const? OK >> @@ -827,9 +826,12 @@ static int msix_capability_init(struct pci_dev *dev, >> pslot = PCI_SLOT(dev->info.physfn.devfn); >> pfunc = PCI_FUNC(dev->info.physfn.devfn); >> vf = dev->sbdf.bdf; >> + if ( dev->virtfn.pf_pdev ) >> + pf_info = &dev->virtfn.pf_pdev->physfn; >> } >> >> - table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf); >> + table_paddr = read_pci_mem_bar(pf_info, seg, pbus, pslot, pfunc, >> bir, >> + vf); > > I don't think putting the new arg first is very nice. SBDF should remain > first imo. Between the latter arguments I'm not as fussed. OK >> @@ -446,7 +448,27 @@ static void free_pdev(struct pci_seg *pseg, struct >> pci_dev *pdev) >> >> list_del(&pdev->alldevs_list); >> pdev_msi_deinit(pdev); >> - xfree(pdev); >> + >> + if ( pdev->info.is_virtfn ) >> + { >> + struct pci_dev *pf_pdev = pdev->virtfn.pf_pdev; >> + >> + if ( pf_pdev ) >> + { >> + list_del(&pdev->virtfn.next); >> + if ( pf_pdev->physfn.dealloc_pending && >> + list_empty(&pf_pdev->physfn.vf_list) ) >> + xfree(pf_pdev); >> + } >> + xfree(pdev); >> + } >> + else >> + { >> + if ( list_empty(&pdev->physfn.vf_list) ) >> + xfree(pdev); >> + else >> + pdev->physfn.dealloc_pending = true; >> + } > > Could I talk you into un-indenting the last if/else by a level, by making > the earlier else and "else if()"? Yep. Actually, there will be no need for dealloc_pending after reworking pci_remove_device() (see below). > One aspect I didn't properly consider when making the suggestion: What if, > without all VFs having gone away, the PF is re-added? In that case we > would better recycle the existing structure. That's getting a little > complicated, so maybe a better approach is to refuse the request (in > pci_remove_device()) when the list isn't empty? I set up a test case locally to remove a PF without removing the associated VFs by hacking an SR-IOV PF driver. Although the PF driver *should* remove the VFs first, it's completely up to the particular PF driver how VFs/PFs are removed during hot-un-plug, in what order, or whether at all to remove the VFs before removing the PF. Anyway, during PF-only removal, at least the Linux PCI subsystem warns about it: [ 106.500334] igb 0000:01:00.0: driver left SR-IOV enabled after remove Returning an error code in pci_remove_device() results in only a warning from Linux: [ 106.507011] pci 0000:01:00.0: Failed to delete - passthrough or MSI/MSI-X might fail! Despite the warning, Linux still proceeds to remove the PF, and we would retain a stale PF in Xen. Re-adding (hotplugging) the just-removed PF led to Xen crashing in another weird way. To handle this more gracefully, I suggest removing the VFs right away along with the PF in pci_remove_device() when a PF removal request comes along. This would satisfy the test case described here without Xen crashing. >> @@ -700,10 +722,22 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, >> * extended function. >> */ >> if ( pdev->info.is_virtfn ) >> + { >> + struct pci_dev *pf_pdev; >> + >> pdev->info.is_extfn = pf_is_extfn; >> + pf_pdev = pci_get_pdev(NULL, >> + PCI_SBDF(seg, info->physfn.bus, >> + info->physfn.devfn)); >> + if ( pf_pdev ) >> + { >> + pdev->virtfn.pf_pdev = pf_pdev; >> + list_add(&pdev->virtfn.next, &pf_pdev->physfn.vf_list); >> + } >> + } > > Hmm, yes, we can't really use the result of the earlier pci_get_pdev(), as > we drop the lock intermediately. I wonder though if we wouldn't better fail > the function if we can't find the PF instance. Yes >> --- a/xen/include/xen/pci.h >> +++ b/xen/include/xen/pci.h >> @@ -150,7 +150,17 @@ struct pci_dev { >> unsigned int count; >> #define PT_FAULT_THRESHOLD 10 >> } fault; >> - u64 vf_rlen[6]; >> + struct pf_info { >> + /* Only populated for PFs (info.is_virtfn == false) */ >> + uint64_t vf_rlen[PCI_SRIOV_NUM_BARS]; >> + struct list_head vf_list; /* List head */ >> + bool dealloc_pending; >> + } physfn; >> + struct { >> + /* Only populated for VFs (info.is_virtfn == true) */ >> + struct pci_dev *pf_pdev; /* Link from VF to PF */ >> + struct list_head next; /* Entry in vf_list */ > > For doubly-linked lists "next" isn't really a good name. Since both struct > variants need such a member, why not use vf_list uniformly? I'd like to retain some distinction between what's a list head and what's an entry in the list. I suggest the name "entry". > That'll then > also lower the significance of my other question: > >> + } virtfn; > > Should the two new struct-s perhaps be a union? Yes

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.