[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4] x86/msi: fix locking for SR-IOV devices


  • To: Stewart Hildebrand <stewart.hildebrand@xxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Wed, 28 Aug 2024 12:36:32 +0200
  • Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Teddy Astie <teddy.astie@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Wed, 28 Aug 2024 10:36:58 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 27.08.2024 05:59, Stewart Hildebrand wrote:
> In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
> structure") a lock moved from allocate_and_map_msi_pirq() to the caller
> and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one
> call path wasn't updated to reflect the change, leading to a failed
> assertion observed under the following conditions:
> 
> * PV dom0
> * Debug build (debug=y) of Xen
> * There is an SR-IOV device in the system with one or more VFs enabled
> * Dom0 has loaded the driver for the VF and enabled MSI-X
> 
> (XEN) Assertion 'd || pcidevs_locked()' failed at 
> drivers/passthrough/pci.c:535
> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
> (XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
> (XEN)    [<ffff82d04034530e>] F 
> arch/x86/msi.c#msix_capability_init+0x198/0x755
> (XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
> (XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
> (XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
> (XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
> (XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
> (XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
> (XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
> (XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
> 
> In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
> associated PF to access the vf_rlen array. This array is initialized in
> pci_add_device() and is only populated in the associated PF's struct
> pci_dev.
> 
> Add links between the VF's struct pci_dev and associated PF struct
> pci_dev, ensuring the PF's struct doesn't get deallocated until all its
> VFs have been removed. Access the vf_rlen array via the new link to the
> PF, and remove the troublesome call to pci_get_pdev().
> 
> Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of
> pci_add_device() to set up the link from VF to PF. In case the new
> pci_get_pdev() invocation fails to find the associated PF (returning
> NULL), return an error.
> 
> In pci_remove_device(), handle an issue with a corner case when a PF is
> removed with VFs enabled, then re-added with VFs disabled. During PF
> removal, a buggy PF driver may fail to disable SR-IOV, leaving stale
> dangling VFs. At least Linux warns in this case:
> 
> [  106.500334]  0000:01:00.0: driver left SR-IOV enabled after remove
> 
> If the PF is subsequently re-added with VFs disabled, the previously
> associated VFs in Xen would not be backed by a device. Any attempt to
> access the config space of the stale VF would be invalid. Avoid issues
> in this case by removing the VFs right away when removing the PF. This
> also happens to simplify the maintenance of the PF<->VF links since the
> associated PF will always exist during the lifetime of a VF. Note,
> however, if a PF is removed before the VFs, Xen will return an error
> for the VF removals since they were already removed.

Not very nice, but probably tolerable.

> ---
> When I tested removing a PF with VFs enabled, then re-adding with VFs
> disabled, I observed the following Xen crash when dom0 attempted to
> access the config space of a stale VF:
> 
> (XEN) Assertion 'pos' failed at arch/x86/msi.c:1275
> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Tainted:   C    ]----
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040346834>] R pci_msi_conf_write_intercept+0xa2/0x1de
> (XEN)    [<ffff82d04035d6b4>] F pci_conf_write_intercept+0x68/0x78
> (XEN)    [<ffff82d0403264e5>] F 
> arch/x86/pv/emul-priv-op.c#pci_cfg_ok+0xa0/0x114
> (XEN)    [<ffff82d04032660e>] F 
> arch/x86/pv/emul-priv-op.c#guest_io_write+0xb5/0x1c8
> (XEN)    [<ffff82d0403267bb>] F arch/x86/pv/emul-priv-op.c#write_io+0x9a/0xe0
> (XEN)    [<ffff82d04037c77a>] F x86_emulate+0x100e5/0x25f1e
> (XEN)    [<ffff82d0403941a8>] F x86_emulate_wrapper+0x29/0x64
> (XEN)    [<ffff82d04032802b>] F pv_emulate_privileged_op+0x12e/0x217
> (XEN)    [<ffff82d040369f12>] F do_general_protection+0xc2/0x1b8
> (XEN)    [<ffff82d040201aa7>] F 
> x86_64/entry.S#handle_exception_saved+0x2b/0x8c
> 
> This crash is avoided by removing the VFs right away with the PF.

If I'm not mistaken the same would happen independent of your change if
Dom0 disabled VFs without reporting their removal (first). That's
technically no different from removing with VFs enabled, then adding
with VFs disabled. We may want to be able to cope with that.

> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -341,6 +341,8 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, 
> u8 bus, u8 devfn)
>  
>      list_add(&pdev->alldevs_list, &pseg->alldevs_list);
>  
> +    INIT_LIST_HEAD(&pdev->physfn.vf_list);

There is a certain risk with doing such uniformly when the field is part
of a union. Yes, little initialization has happened up to here, but I'm
still concerned. (One of the reasons I don't like the struct list_head
instances to be split, despite your legitimate point regarding naming.)
At the very least this wants moving yet earlier in the function, before
the new struct is passed anywhere else.

> @@ -700,10 +706,31 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>           * extended function.
>           */
>          if ( pdev->info.is_virtfn )
> +        {
> +            struct pci_dev *pf_pdev;
> +
>              pdev->info.is_extfn = pf_is_extfn;
> +            pf_pdev = pci_get_pdev(NULL,
> +                                   PCI_SBDF(seg, info->physfn.bus,
> +                                            info->physfn.devfn));
> +            if ( pf_pdev )
> +            {
> +                pdev->virtfn.pf_pdev = pf_pdev;
> +                list_add(&pdev->virtfn.entry, &pf_pdev->physfn.vf_list);
> +            }
> +            else
> +            {
> +                printk(XENLOG_WARNING "Failed to add SR-IOV device PF %pp 
> for VF %pp\n",
> +                       &PCI_SBDF(seg, info->physfn.bus, info->physfn.devfn),
> +                       &pdev->sbdf);
> +                free_pdev(pseg, pdev);
> +                ret = -ENODEV;

Why -ENODEV? Ideally the error code from the earlier pci_add_device() could
be retained (even if some fallback would be needed in case that one succeeded
or the path wasn't even taken). Yet on the whole I wonder if we wouldn't be
better off delaying that recursive call some, such that the lock wouldn't
be dropped anymore between the call and getting here. This would then also
eliminate the need for the new pci_get_pdev() invocation.

> @@ -818,6 +847,24 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>      list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>          if ( pdev->bus == bus && pdev->devfn == devfn )
>          {
> +            ret = 0;
> +
> +            if ( !pdev->info.is_virtfn )
> +            {
> +                struct pci_dev *vf_pdev, *tmp;
> +
> +                /*
> +                 * If a PF with VFs enabled is removed, then re-added without
> +                 * VFs enabled, the previously associated VFs would no 
> longer be
> +                 * backed by a device. Remove the associated VFs right away.
> +                 */

While in the description the "enabled" and "disabled" are kind of
tolerable, at least in the code comment I think we want to be more
precise. The question isn't whether VFs are enabled, but whether we
know of the VFs.

It's further unclear here what "a device" is.

> +                list_for_each_entry_safe(vf_pdev, tmp, &pdev->physfn.vf_list,
> +                                         virtfn.entry)
> +                    ret = pci_remove_device(vf_pdev->sbdf.seg,
> +                                            vf_pdev->sbdf.bus,
> +                                            vf_pdev->sbdf.devfn) ?: ret;

And if this fails, the VF will still remain orphaned. I think in the
model I had suggested no such risk would exist.

Misra also isn't going to like the recursion here.

> @@ -826,7 +873,7 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
>                  write_unlock(&pdev->domain->pci_lock);
>              }
>              pci_cleanup_msi(pdev);
> -            ret = iommu_remove_device(pdev);
> +            ret = iommu_remove_device(pdev) ?: ret;
>              printk(XENLOG_DEBUG "PCI remove device %pp\n", &pdev->sbdf);
>              free_pdev(pseg, pdev);
>              break;

Is it even legitimate to continue cleaning up if there was an earlier
failure?

In any event - please consider e.g. the case where the XSM check allows
the PF removal, but denies the removal of some of the VFs.

> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -150,7 +150,18 @@ struct pci_dev {
>          unsigned int count;
>  #define PT_FAULT_THRESHOLD 10
>      } fault;
> -    u64 vf_rlen[6];
> +    union {
> +        struct pf_info {
> +            /* Only populated for PFs (info.is_virtfn == false) */
> +            struct list_head vf_list;             /* List head */
> +            uint64_t vf_rlen[PCI_SRIOV_NUM_BARS];
> +        } physfn;
> +        struct {
> +            /* Only populated for VFs (info.is_virtfn == true) */
> +            struct list_head entry;               /* Entry in vf_list */
> +            struct pci_dev *pf_pdev;              /* Link from VF to PF */

const?

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.