[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3] x86/msi: fix locking for SR-IOV devices


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Stewart Hildebrand <stewart.hildebrand@xxxxxxx>
  • Date: Wed, 14 Aug 2024 21:28:19 -0400
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=suse.com smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0)
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=B0l3Hu/nacDiYCFoeaN96Ca9/rly+kx6EWAp7kHpDLA=; b=Dr3CzZfkJThA58fM0oMd/WaBHZB2vJRi9quh4ygRPy48/+LmS9xeCK3wEFfm8331FCl6Wa2oIaG06IaHuh2iIYr7ATtcWwf6rDiFYiVseVF3JGtcjxxCnbknNGV4LONo8Zx5mCsOnDtOnQrzjZQzwMxSSwO1dri6A8Sgfu0b3YYOkvWjxsCCo4ULNBQqvGm2TzhXQGKRXhmbhoU6hgzaCslS0XB2O68XWcvbkGVUofHvSbl5t7rIF9QjH3NHRSJ9tyYSTjNr6BG0HD8x0Pc0gAyGoVCpFDzPm/FisbqCdZ3dGhWW6+63mExcH90d1IPOn5PgJbNdMWwhPz4IpNG6xA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=y1yrR/LdoGWZ5SZj2mm7VnLnFSX/qvetE9xGVjybM2HT08WdeZiRV1/k+TlgYvglSr23lMjdzM67TOCJp+ZS9rOyPXaRtDfIGCdmpB+d8FG5X3B+qwChBCrI8SGFcCtpeCiDI6MmQPSMpj77moS0g551pY5l2T5kc0+iGiZYBY9Vmyynf6A4UwzpsqEsCCwRHxEVkZwqY6ACRjAgEYTZcaSXsdNaVfte9ATWbDkEclOxBnf65ItmexyZGDb0Ss1JJOfnq9WVS1uDsCp45YfCFU1k0G0qxbDQAA0lNVMhMH1jTJj6woZyzC+5eU0KOE7ffPsc9G9imLm7nXMXgKKYRw==
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Teddy Astie <teddy.astie@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Thu, 15 Aug 2024 01:28:49 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 8/13/24 10:01, Jan Beulich wrote:
> On 12.08.2024 22:39, Stewart Hildebrand wrote:
>> In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
>> structure") a lock moved from allocate_and_map_msi_pirq() to the caller
>> and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one
>> call path wasn't updated to reflect the change, leading to a failed
>> assertion observed under the following conditions:
>>
>> * PV dom0
>> * Debug build (debug=y) of Xen
>> * There is an SR-IOV device in the system with one or more VFs enabled
>> * Dom0 has loaded the driver for the VF and enabled MSI-X
>>
>> (XEN) Assertion 'd || pcidevs_locked()' failed at 
>> drivers/passthrough/pci.c:535
>> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
>> ...
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
>> (XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
>> (XEN)    [<ffff82d04034530e>] F 
>> arch/x86/msi.c#msix_capability_init+0x198/0x755
>> (XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
>> (XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
>> (XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
>> (XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
>> (XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
>> (XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
>> (XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
>> (XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
>>
>> In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
>> associated PF to access the vf_rlen array. This array is initialized in
>> pci_add_device() and is only populated in the associated PF's struct
>> pci_dev.
>>
>> Add a link from the VF's struct pci_dev to the associated PF struct
>> pci_dev, ensuring the PF's struct doesn't get deallocated until all its
>> VFs have gone away. Access the vf_rlen array via the new link to the PF,
>> and remove the troublesome call to pci_get_pdev().
>>
>> Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of
>> pci_add_device() to set up the link from VF to PF. In case the new
>> pci_add_device() invocation fails to find the associated PF (returning
>> NULL), we are no worse off than before: read_pci_mem_bar() will still
>> return 0 in that case.
>>
>> Note that currently the only way for Xen to know if a device is a VF is
>> if the toolstack tells Xen about it. Using PHYSDEVOP_manage_pci_add for
>> a VF is not a case that Xen handles.
> 
> How does the toolstack come into play here? It's still the Dom0 kernel to
> tell Xen, via PHYSDEVOP_pci_device_add (preferred) or
> PHYSDEVOP_manage_pci_add_ext (kind of deprecated; PHYSDEVOP_manage_pci_add
> is even more kind of deprecated).

I guess I meant to say Dom0 kernel, not toolstack. I'm actually
questioning how much value this last portion of the commit description
is really adding. Maybe it would be better to just remove this bit.

>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -662,34 +662,32 @@ static int msi_capability_init(struct pci_dev *dev,
>>      return 0;
>>  }
>>  
>> -static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int 
>> vf)
>> +static u64 read_pci_mem_bar(struct pf_info *pf_info, u16 seg, u8 bus, u8 
>> slot,
>> +                            u8 func, u8 bir, int vf)
>>  {
>> +    pci_sbdf_t sbdf = PCI_SBDF(seg, bus, slot, func);
>>      u8 limit;
>>      u32 addr, base = PCI_BASE_ADDRESS_0;
>>      u64 disp = 0;
>>  
>>      if ( vf >= 0 )
>>      {
>> -        struct pci_dev *pdev = pci_get_pdev(NULL,
>> -                                            PCI_SBDF(seg, bus, slot, func));
>>          unsigned int pos;
>>          uint16_t ctrl, num_vf, offset, stride;
>>  
>> -        if ( !pdev )
>> -            return 0;
>> -
>> -        pos = pci_find_ext_capability(pdev->sbdf, PCI_EXT_CAP_ID_SRIOV);
>> -        ctrl = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_CTRL);
>> -        num_vf = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_NUM_VF);
>> -        offset = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_OFFSET);
>> -        stride = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_STRIDE);
>> +        pos = pci_find_ext_capability(sbdf, PCI_EXT_CAP_ID_SRIOV);
>> +        ctrl = pci_conf_read16(sbdf, pos + PCI_SRIOV_CTRL);
>> +        num_vf = pci_conf_read16(sbdf, pos + PCI_SRIOV_NUM_VF);
>> +        offset = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_OFFSET);
>> +        stride = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_STRIDE);
>>  
>>          if ( !pos ||
>>               !(ctrl & PCI_SRIOV_CTRL_VFE) ||
>>               !(ctrl & PCI_SRIOV_CTRL_MSE) ||
>>               !num_vf || !offset || (num_vf > 1 && !stride) ||
>>               bir >= PCI_SRIOV_NUM_BARS ||
>> -             !pdev->vf_rlen[bir] )
>> +             !pf_info ||
> 
> See further down - I think it would be nice if we didn't need this new
> check.
> 
>> @@ -813,6 +811,7 @@ static int msix_capability_init(struct pci_dev *dev,
>>          int vf;
>>          paddr_t pba_paddr;
>>          unsigned int pba_offset;
>> +        struct pf_info *pf_info = NULL;
> 
> Pointer-to-const?
> 
>> @@ -827,9 +826,12 @@ static int msix_capability_init(struct pci_dev *dev,
>>              pslot = PCI_SLOT(dev->info.physfn.devfn);
>>              pfunc = PCI_FUNC(dev->info.physfn.devfn);
>>              vf = dev->sbdf.bdf;
>> +            if ( dev->virtfn.pf_pdev )
>> +                pf_info = &dev->virtfn.pf_pdev->physfn;
>>          }
>>  
>> -        table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
>> +        table_paddr = read_pci_mem_bar(pf_info, seg, pbus, pslot, pfunc, 
>> bir,
>> +                                       vf);
> 
> I don't think putting the new arg first is very nice. SBDF should remain
> first imo. Between the latter arguments I'm not as fussed.
> 
>> @@ -446,7 +448,27 @@ static void free_pdev(struct pci_seg *pseg, struct 
>> pci_dev *pdev)
>>  
>>      list_del(&pdev->alldevs_list);
>>      pdev_msi_deinit(pdev);
>> -    xfree(pdev);
>> +
>> +    if ( pdev->info.is_virtfn )
>> +    {
>> +        struct pci_dev *pf_pdev = pdev->virtfn.pf_pdev;
>> +
>> +        if ( pf_pdev )
>> +        {
>> +            list_del(&pdev->virtfn.next);
>> +            if ( pf_pdev->physfn.dealloc_pending &&
>> +                 list_empty(&pf_pdev->physfn.vf_list) )
>> +                xfree(pf_pdev);
>> +        }
>> +        xfree(pdev);
>> +    }
>> +    else
>> +    {
>> +        if ( list_empty(&pdev->physfn.vf_list) )
>> +            xfree(pdev);
>> +        else
>> +            pdev->physfn.dealloc_pending = true;
>> +    }
> 
> Could I talk you into un-indenting the last if/else by a level, by making
> the earlier else and "else if()"?
> 
> One aspect I didn't properly consider when making the suggestion: What if,
> without all VFs having gone away, the PF is re-added? In that case we
> would better recycle the existing structure. That's getting a little
> complicated, so maybe a better approach is to refuse the request (in
> pci_remove_device()) when the list isn't empty?

Hm. Right. The growing complexity of maintaining these PF<->VF links
(introduced on a hunch that they may be useful in the future) is making
me favor the previous approach from v2 of simply copying the vf_len
array. So unless there are any objections I'll go back to that approach
for v4.

>> @@ -700,10 +722,22 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>           * extended function.
>>           */
>>          if ( pdev->info.is_virtfn )
>> +        {
>> +            struct pci_dev *pf_pdev;
>> +
>>              pdev->info.is_extfn = pf_is_extfn;
>> +            pf_pdev = pci_get_pdev(NULL,
>> +                                   PCI_SBDF(seg, info->physfn.bus,
>> +                                            info->physfn.devfn));
>> +            if ( pf_pdev )
>> +            {
>> +                pdev->virtfn.pf_pdev = pf_pdev;
>> +                list_add(&pdev->virtfn.next, &pf_pdev->physfn.vf_list);
>> +            }
>> +        }
> 
> Hmm, yes, we can't really use the result of the earlier pci_get_pdev(), as
> we drop the lock intermediately. I wonder though if we wouldn't better fail
> the function if we can't find the PF instance.
> 
>> --- a/xen/include/xen/pci.h
>> +++ b/xen/include/xen/pci.h
>> @@ -150,7 +150,17 @@ struct pci_dev {
>>          unsigned int count;
>>  #define PT_FAULT_THRESHOLD 10
>>      } fault;
>> -    u64 vf_rlen[6];
>> +    struct pf_info {
>> +        /* Only populated for PFs (info.is_virtfn == false) */
>> +        uint64_t vf_rlen[PCI_SRIOV_NUM_BARS];
>> +        struct list_head vf_list;             /* List head */
>> +        bool dealloc_pending;
>> +    } physfn;
>> +    struct {
>> +        /* Only populated for VFs (info.is_virtfn == true) */
>> +        struct pci_dev *pf_pdev;              /* Link from VF to PF */
>> +        struct list_head next;                /* Entry in vf_list */
> 
> For doubly-linked lists "next" isn't really a good name. Since both struct
> variants need such a member, why not use vf_list uniformly? That'll then
> also lower the significance of my other question:
> 
>> +    } virtfn;
> 
> Should the two new struct-s perhaps be a union?
> 
> Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.