Xen project Mailing List

Re: [Xen-devel] pci-passthrough loses msi-x interrupts ability after domain destroy

To: Christopher Clark <christopher.w.clark@xxxxxxxxx>

From: Pasi Kärkkäinen <pasik@xxxxxx>

Date: Fri, 22 Sep 2017 09:58:05 +0300

Cc: Kyle Temkin <temkink@xxxxxxxxxxxx>, Sander Eikelenboom <linux@xxxxxxxxxxxxxx>, Jérôme Oufella <jerome.oufella@xxxxxxxxxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, Roger Pau Monné <roger.pau@xxxxxxxxxx>

Delivery-date: Fri, 22 Sep 2017 06:58:18 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hi, On Thu, Sep 21, 2017 at 07:09:12PM -0700, Christopher Clark wrote: > On Thu, Sep 21, 2017 at 1:27 PM, Sander Eikelenboom > <linux@xxxxxxxxxxxxxx> wrote: > > > > On Thu, September 21, 2017, 10:39:52 AM, Roger Pau Monné wrote: > > > >> On Wed, Sep 20, 2017 at 03:50:35PM -0400, Jérôme Oufella wrote: > >>> > >>> I'm using PCI pass-through to map a PCIe (intel i210) controller into > >>> a HVM domain. The system uses xen-pciback to hide the appropriate PCI > >>> device from Dom0. > >>> > >>> When creating the HVM domain after an hypervisor cold boot, the HVM > >>> domain can access and use the PCIe controller without problem. > >>> > >>> However, if the HVM domain is destroyed then restarted, it won't be > >>> able to use the pass-through PCI device anymore. The PCI device is > >>> seen and can be mapped, however, the interrupts will not be passed to > >>> the HVM domain anymore (this is visible under a Linux guest as > >>> /proc/interrupts counters remain 0). The behavior on a Windows10 guest > >>> is the same. > >>> > >>> A few interesting hints I noticed: > >>> > >>> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and > >>> the "muted interrupts" states, I noted a difference between the > >>> MSI-X caps: > >>> > >>> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if > >>> domain started > >>> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work > >>> if domain started > >>> ^^^^^^^ > > > >> IMHO it seems that either your device is not able to perform a reset > >> successfully, or Linux is not correctly performing such reset. I don't > >> think there's a lot that can be done from the Xen side. > > > > Unfortunately for a lot of pci-devices a simple reset as performed by > > default isn't enough, > > but also almost none support a real pci FLR. > > > > In the distant past Konrad has made a patchset that implemented a bus reset > > and > > reseting config space. (It piggy backed on already existing libxl mechanism > > of > > trying to call on a syfs "do_flr" attribute which triggers pciback to > > perform > > the busreset and rewrite of config space for the device. > > > > I use that patchset ever since for my pci-passtrough needs and it works > > pretty > > well. I can shutdown an restart VM's with pci devices passed trhough (also > > AMD > > Radeon graphic cards). > > Just to confirm the utility of that piece of work: OpenXT also uses an > extended version of that same patch to perform device reset for > passthrough. > > I've attached a copy of that OpenXT patch to this message and it can > also be obtained from our git repository: > https://github.com/OpenXT/xenclient-oe/blob/f8d3b282a87231d9ae717b13d506e8e7e28c78c4/recipes-kernel/linux/4.9/patches/thorough-reset-interface-to-pciback-s-sysfs.patch > This version creates a sysfs node named "reset_device" and the OpenXT > libxl toolstack is patched to use that node instead of "do_flr". > > Konrad's original work encountered pushback on upstream acceptance at > the time it was developed. I'm not sure I've found where that > discussion ended. Is there any prospect of a more comprehensive reset > mechanism being accepted into xen-pciback, or elsewhere in the kernel? > > As noted in the original LKML threads, vfio has similar relevant pci > device reset retry logic. (Thanks to Rich Persaud for this pointer:) > http://elixir.free-electrons.com/linux/v4.14-rc1/source/drivers/vfio/pci/vfio_pci.c#L1353 > > libvirt also performs similar reset logic, using a direct low level > interface to config space (Thanks to Marek for this pointer, libvirt > is used by Qubes:) > https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L929 > I thinks this indicates that it would be possible to extend libxl to > do something similar, but that seems less satisfactory compared to > performing the work in a kernel-provided implementation. > > Is there a way forward to providing this functionality within Xen > software or Linux? > Adding Konrad to CC .. -- Pasi > Christopher > -- > > openxt.org > From d686351d8ea4a1ea1d755d0a10f6f14d1c870911 Mon Sep 17 00:00:00 2001 > From: Kyle Temkin <ktemkin@xxxxxxxxxxxxxx> > Date: Wed, 8 Apr 2015 00:58:24 -0400 > Subject: [PATCH] Add thorough reset interface to pciback's sysfs. > > -------------------------------------------------------------------------------- > SHORT DESCRIPTION: > -------------------------------------------------------------------------------- > Adds an interface that allows "more thorough" resets to be performed > on devices which don't support Function Level Resets (FLRs). This > interface should allow the toolstack to ensure that a PCI device is in a > known state prior to passing it through to a VM. > > -------------------------------------------------------------------------------- > LONG DESCRIPTION: > -------------------------------------------------------------------------------- > > From Konrad Rzeszutek Wilk's original post to xen-devel and the LKML: > > The life-cycle of a PCI device in Xen pciback is complex > and is constrained by the PCI generic locking mechanism. > > It starts with the device being binded to us - for which > we do a device function reset (and done via SysFS > so the PCI lock is held) > > If the device is unbinded from us - we also do a function > reset (also done via SysFS so the PCI lock is held). > > If the device is un-assigned from a guest - we do a function > reset (no PCI lock). > > All on the individual PCI function level (so bus:device:function). > > Unfortunatly a function reset is not adequate for certain > PCIe devices. The reset for an individual PCI function "means > device must support FLR (PCIe or AF), PM reset on D3hot->D0 > device specific reset, or be a singleton device on a bus > a secondary bus reset. FLR does not have widespread support, > reset is not very reliable, and bus topology is dictated by the > and device design. We need to provide a means for a user to > a bus reset in cases where the existing mechanisms are not > or not reliable. " (Adam Williamson, 'vfio-pci: PCI hot reset > interface' commit 8b27ee60bfd6bbb84d2df28fa706c5c5081066ca). > > As such to do a slot or a bus reset is we need another mechanism. > This is not exposed SysFS as there is no good way of exposing > a bus topology there. > > This is due to the complexity - we MUST know that the different > functions off a PCIe device are not in use by other drivers, or > if they are in use (say one of them is assigned to a guest > and the other is idle) - it is still OK to reset the slot > (assuming both of them are owned by Xen pciback). > > This patch does that by doing an slot or bus reset (if > slot not supported) if all of the functions of a PCIe > device belong to Xen PCIback. We do not care if the device is > in-use as we depend on the toolstack to be aware of this - > however if it is we will WARN the user. > > Due to the complexity with the PCI lock we cannot do > the reset when a device is binded ('echo $BDF > bind') > or when unbinded ('echo $BDF > unbind') as the pci_[slot|bus]_reset > also take the same lock resulting in a dead-lock. > > Putting the reset function in a workqueue or thread > won't work either - as we have to do the reset function > outside the 'unbind' context (it holds the PCI lock). > But once you 'unbind' a device the device is no longer > under the ownership of Xen pciback and the pci_set_drvdata > has been reset so we cannot use a thread for this. > > Instead of doing all this complex dance, we depend on the toolstack > doing the right thing. As such implement [... a SysFS attribute] > which [... the toolstack] uses when a device is detached or attached > from/to a guest. It bypasses the need to worry about the PCI lock. > > To not inadvertly do a bus reset that would affect devices that > are in use by other drivers (other than Xen pciback) prior > to the reset we check that all of the devices under the bridge > are owned by Xen pciback. If they are not we do not do > the bus (or slot) reset. > > We also warn the user if the device is in use - but still > continue with the reset. This should not happen as the toolstack > also does the check. > > -- > > Our version of the patch has been modified to use a less confusing > sysfs name. The original name ('do_flr') is inappropriate, as it > implies a function level reset; it is entirely possible that the patch > code will use a bus-level reset when appropriate. > > The new sysfs entry is located at: > > /sys/bus/pci/drivers/pciback/reset_device > > and can be activated by writing a domain:bus:device:function device > identifier into the sysfs file. As an example: > > echo "0000:01:00.0" > /sys/bus/pci/drivers/pciback/reset_device > > would reset the device matching the D:BDF descriptor above. > > -------------------------------------------------------------------------------- > CHANGELOG: > -------------------------------------------------------------------------------- > This is a port of a patch that likely had many authors, including: > -Konrad Rzeszutek Wilk > -Alex Williamson > -Ross Phillipson <rphilipson@xxxxxxxxxxxx> > Ported to OpenXT by: Kyle J. Temkin <temkink@xxxxxxxxxxxx>, 4/8/15 > Rewrite by: Kyle J. Temkin <temkink@xxxxxxxxxxxx>, 4/10/15 > > -------------------------------------------------------------------------------- > DEPENDENCIES > -------------------------------------------------------------------------------- > This patch requires ONE of the following: > -A relatively modern linux kernel (3.18+) as a base; which provides > the PCI functions used; or > -Our PCI reset backports patch (backport-pci-reset-functionality.patch), > which backports the relevant functionality to 3.11. > > To take advantage of this patch, the utilized toolstack should be > changed to: > -Use the provided "reset_device" property, rather than the PCI > device's sysfs "reset" entry. This enables resets beyond a FLR to be > used. > -Ensure that all functions of a given device are passed through > together. This allows us to use some of the more thorugh resetting > techniques, when possible. > > -------------------------------------------------------------------------------- > REMOVAL > -------------------------------------------------------------------------------- > This patch provides a service which is necessary for proper passthrough > of many PCI cards: a generalized ability to reset PCI devices, without > requiring that the device support FLR or power-management based resets. > > This patch will be necessary until either the Linux PCI subsystem or Xen > PCIback drivers are modified to provide this support; or until cards > without proper FLR support are no longer supported. > > -------------------------------------------------------------------------------- > UPSTREAM PLAN > -------------------------------------------------------------------------------- > > This code is taken from a patch which was originally proposed and > rejected from upstream on the LKML and xen-devel. An upstream > implementation of the functionality of this patch is still necessary; > and can and should be implemented. > > This patch will hopefully be replaced with an upstream version when > community concensus has produced a single "blessed" method of > accomplishing its functionality. > > -------------------------------------------------------------------------------- > PATCHES > -------------------------------------------------------------------------------- > --- > drivers/xen/xen-pciback/pci_stub.c | 338 > ++++++++++++++++++++++++++++++++++--- > 1 file changed, 312 insertions(+), 26 deletions(-) > > Index: linux-4.9.40/drivers/xen/xen-pciback/pci_stub.c > =================================================================== > --- linux-4.9.40.orig/drivers/xen/xen-pciback/pci_stub.c > +++ linux-4.9.40/drivers/xen/xen-pciback/pci_stub.c > @@ -102,10 +102,9 @@ static void pcistub_device_release(struc > > xen_unregister_device_domain_owner(dev); > > - /* Call the reset function which does not take lock as this > - * is called from "unbind" which takes a device_lock mutex. > - */ > - __pci_reset_function_locked(dev); > + > + /* Reset is done by the toolstack by using 'reset_device' on the > + * SysFS. */ > if (pci_load_and_free_saved_state(dev, &dev_data->pci_saved_state)) > dev_info(&dev->dev, "Could not reload PCI state\n"); > else > @@ -125,9 +124,6 @@ static void pcistub_device_release(struc > err); > } > > - /* Disable the device */ > - xen_pcibk_reset_device(dev); > - > kfree(dev_data); > pci_set_drvdata(dev, NULL); > > @@ -224,6 +220,271 @@ struct pci_dev *pcistub_get_pci_dev_by_s > return found_dev; > } > > + > +/** > + * Returns true iff the given device supports PCIe FLRs. > + */ > +static bool __device_supports_pcie_flr(struct pci_dev *dev) > +{ > + u32 cap; > + > + /* > + * Read the device's capabilities. Note that this can be used even > on legacy > + * PCI devices (and not just on PCIe devices)-- it indicates that no > capabilities > + * are supported if the device is legacy PCI by setting cap to 0. > + */ > + pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &cap); > + > + /* Return true iff the device advertises supporting an FLR. */ > + return (cap & PCI_EXP_DEVCAP_FLR); > +} > + > + > +/** > + * Returns true iff the given device supports PCI Advanced Functionality > (AF) FLRs. > + */ > +static bool __device_supports_pci_af_flr(struct pci_dev *dev) > +{ > + int pos; > + u8 capability_flags; > + > + /* First, try to find the location of the PCI Advanced Functionality > capability byte. */ > + pos = pci_find_capability(dev, PCI_CAP_ID_AF); > + > + /* > + * If we weren't able to find the capability byte, this device doesn't > support > + * the Advanced Functionality extensions, and thus won't support AF FLR. > + */ > + if (!pos) > + return false; > + > + /* Read the capabilities advertised in the AF capability byte. */ > + pci_read_config_byte(dev, pos + PCI_AF_CAP, &capability_flags); > + > + /* > + * If the device does support AF, it will advertise FLR support via the > + * PCI_AF_CAP_FLR bit. We'll also check for the Transactions Pending > (TP) > + * mechanism, as the kernel requires this extension to issue an AF FLR. > + * (Internally, the PCI reset code needs to be able to wait for all > + * pending transactions to complete prior to issuing the AF FLR.) > + */ > + return (capability_flags & PCI_AF_CAP_TP) && (capability_flags & > PCI_AF_CAP_FLR); > +} > + > + > +/** > + * Returns true iff the given device adverstises supporting function- > + * level-reset (FLR). > + */ > +static bool device_supports_flr(struct pci_dev *dev) > +{ > + return __device_supports_pci_af_flr(dev) || > __device_supports_pcie_flr(dev); > +} > + > + > +/** > + * Returns true iff the given device is located in a slot that > + * supports hotplugging slot resets. > + */ > +static bool device_supports_slot_reset(struct pci_dev *dev) > +{ > + return !pci_probe_reset_slot(dev->slot); > +} > + > + > +/** > + * Returns true iff the given device is located on a bus that > + * we can reset. Note that root bridges are excluded, as this > + * would cause more than just an SBR. > + */ > +static bool device_supports_bus_reset(struct pci_dev *dev) > +{ > + return !pci_is_root_bus(dev->bus) && !pci_probe_reset_bus(dev->bus); > +} > + > + > +/** > + * Out argument for the __safe_to_sbr_device_callback function. > + */ > +struct safe_to_sbr_arguments { > + > + //Stores the most recently encountered PCI device that does > + //not belong to pciback. As used below, this is the result of a > + //search for a non-pciback device on a bus; we stop upon finding > + //the first non-pciback device. > + struct pci_dev *last_non_pciback_device; > + > + //Stores the number of pciback devices that appear to be in use > + //on the bus in question. > + int use_count; > + > +}; > + > + > +/** > + * A callback function which determines if a given PCI device is owned by > pciback, > + * and whether the given device is in use. Used by safe_to_sbr_device. > + * > + * @param dev The PCI device to be checked. > + * @param data An out argument of type struct > safe_to_sbr_device_callback_arguments. > + * Updated to indicate the result of the search. See the > struct's definition > + * for more details. > + * > + */ > +static int __safe_to_sbr_device_callback(struct pci_dev *dev, void *data) > +{ > + > + struct pcistub_device *psdev; > + > + bool device_owned_by_pciback = false; > + struct safe_to_sbr_arguments *arg = data; > + > + unsigned long flags; > + > + //Ensure that we have exclusive access to the list of PCI devices, > + //so we can traverse it. > + spin_lock_irqsave(&pcistub_devices_lock, flags); > + > + //Iterate over all PCI devices owned by the pci stub. > + list_for_each_entry(psdev, &pcistub_devices, dev_list) { > + > + //If the given device is owned by pciback... > + if (psdev->dev == dev) { > + > + //mark it as a pciback device. > + device_owned_by_pciback = true; > + > + //If we have a physical device associated with the > pciback device, > + //mark this device as in-use. > + if (psdev->pdev) > + arg->use_count++; > + > + //Stop searching; we've found a the PCIback device > associated with this one. > + break; > + } > + } > + > + //Release the PCI device lock... > + spin_unlock_irqrestore(&pcistub_devices_lock, flags); > + > + //... and report if we've found a device that's not owned by pciback. > + dev_dbg(&dev->dev, "%s\n", device_owned_by_pciback ? "is owned by > pciback, and can be reset if not in use." > + : "not owned by pciback, and thus cannot be reset."); > + > + //If we've found a device that's not owned by pciback, update our data > + //argument so it points to the most recent unowned device. (We check > + //this like a flag, later: if it's never set, no one owns the device!) > + if (!device_owned_by_pciback) > + arg->last_non_pciback_device = dev; > + > + //If we've found a device that's not owned by pciback, return false-- > + //this indicates that pci_walk_bus should cease its walk. > + return !device_owned_by_pciback; > +} > + > + > +/** > + * Returns true iff it should be safe to issue a secondary bus reset > + * to the device; that is, if an SBR can be issued without disrupting > + * other devices. > + */ > +static bool safe_to_sbr_device(struct pci_dev *dev) > +{ > + struct safe_to_sbr_arguments walk_result = { .last_non_pciback_device = > NULL, .use_count = 0 }; > + > + //Walk the PCI bus, attempting to find if any of the given devices > + pci_walk_bus(dev->bus, __safe_to_sbr_device_callback, &walk_result); > + > + //If the device is in use, emit a warning error. > + if(walk_result.use_count > 0) > + dev_dbg(&dev->dev, "is in use; currently not safe to SBR > device.\n"); > + > + //Return true iff we did not pick up any other devices > + //that were either in use, or not owned by pciback. > + return (walk_result.last_non_pciback_device == NULL) && > (walk_result.use_count == 0); > +} > + > + > +/** > + * Attempt a raw reset of the provided PCI device-- via any > + * method available to us. This method prefers the gentlest > + * possible reset method-- currently an FLR, which many > + * PCIe devices should support. > + * > + * @param dev The pci device to be reset. > + * @return Zero on success, or the error code generated by the reset method > on failure. > + */ > +static int __pcistub_raw_device_reset(struct pci_dev *dev) > +{ > + //Determine if bus resetting techniques (SBR, slot resets) > + //are safe, and thus should be allowed. > + int allow_bus_reset = safe_to_sbr_device(dev); > + > + //If FLRs are supported; we'll try to let the linux kernel > + //manually reset the device. > + if(device_supports_flr(dev)) { > + dev_dbg(&dev->dev, "Resetting device using an FLR."); > + return pci_reset_function(dev); > + } > + > + //Next, we'll try the next gentlest: a hotplugging reset > + //of the PCI slot. > + if(allow_bus_reset && device_supports_slot_reset(dev)) { > + dev_dbg(&dev->dev, "Resetting device using a slot reset."); > + return pci_try_reset_slot(dev->slot); > + } > + > + //Finally, we'll try the most drastic: resetting the parent > + //PCI bus-- which we can only do conditionally. > + if(allow_bus_reset && device_supports_bus_reset(dev)) { > + dev_dbg(&dev->dev, "Resetting device using an SBR."); > + return pci_try_reset_bus(dev->bus); > + } > + > + //If we weren't able to reset the device by any of our known-good > methods, > + //fall back to the linux kernel's reset function. Unfortunately, this > considers a > + //power management reset to be a valid reset; though this doesn't work > for many devices-- > + //especially GPUs. > + dev_err(&dev->dev, "No reset methods available for %s. Falling back to > kernel reset.", pci_name(dev)); > + pci_reset_function(dev); > + > + //Return an error code, indicating that we likely did not reset the > device correctly. > + return -ENOTTY; > +} > + > + > +/** > + * Resets the target (pciback-owned) PCI device. Primarily intended > + * for use by the toolstack, so it can ensure a consistent PCI device > + * state on VM startup. > + * > + * @param dev The device to be reset. > + * @return Zero on success, or a negated error code on failure. > + */ > +static int pcistub_reset_pci_dev(struct pci_dev *dev) > +{ > + int rc; > + > + if (!dev) > + return -EINVAL; > + > + /* > + * Takes the PCI lock. OK to do it as we are never called > + * from 'unbind' state and don't deadlock. > + */ > + rc =__pcistub_raw_device_reset(dev); > + pci_restore_state(dev); > + > + /* This disables the device. */ > + xen_pcibk_reset_device(dev); > + > + /* And cleanup up our emulated fields. */ > + xen_pcibk_config_reset_dev(dev); > + return rc; > +} > + > + > + > struct pci_dev *pcistub_get_pci_dev(struct xen_pcibk_device *pdev, > struct pci_dev *dev) > { > @@ -279,11 +540,13 @@ void pcistub_put_pci_dev(struct pci_dev > * pcistub and xen_pcibk when AER is in processing > */ > down_write(&pcistub_sem); > - /* Cleanup our device > - * (so it's ready for the next domain) > - */ > device_lock_assert(&dev->dev); > - __pci_reset_function_locked(dev); > + /* > + * Reset is up to the toolstack. > + * The toolstack has to call 'reset_device' before > + * providing the PCI device to a guest (see pcistub_reset_device). > + */ > + //__pci_reset_function_locked(dev); > > dev_data = pci_get_drvdata(dev); > ret = pci_load_saved_state(dev, dev_data->pci_saved_state); > @@ -1460,6 +1723,41 @@ static ssize_t restrictive_add(struct de > } > static DRIVER_ATTR(restrictive, S_IWUSR, NULL, restrictive_add); > > +/** > + * Handles the "reset_device" sysfs attribute. This is the primary reset > interface > + * utilized by the toolstack. > + */ > +static ssize_t pcistub_sysfs_reset_device(struct device_driver *drv, const > char *buf, size_t count) > +{ > + int domain, bus, slot, func, err; > + struct pcistub_device *psdev; > + > + //Attempt to convert the user's string to a BDF/slot. > + err = str_to_slot(buf, &domain, &bus, &slot, &func); > + if (err) > + return -ENODEV; > + > + //... and then use that slot to find the pciback device. > + psdev = pcistub_device_find(domain, bus, slot, func); > + > + //If we have a device, attempt to reset it using our internal reset > path. > + if (psdev) { > + err = pcistub_reset_pci_dev(psdev->dev); > + pcistub_device_put(psdev); > + > + //If we were not able to reset the device, return the relevant > error code. > + if(err) > + err = -ENODEV; > + } > + //Otherwise, indicate that there's no such device. > + else { > + err = -ENODEV; > + } > + > + return err ? err : count; > + > +} > +static DRIVER_ATTR(reset_device, S_IWUSR, NULL, pcistub_sysfs_reset_device); > > static void pcistub_exit(void) > { > @@ -1476,6 +1774,8 @@ static void pcistub_exit(void) > &driver_attr_irq_handlers); > driver_remove_file(&xen_pcibk_pci_driver.driver, > &driver_attr_irq_handler_state); > + driver_remove_file(&xen_pcibk_pci_driver.driver, > + &driver_attr_reset_device); > pci_unregister_driver(&xen_pcibk_pci_driver); > } > > @@ -1572,6 +1872,9 @@ static int __init pcistub_init(void) > if (!err) > err = driver_create_file(&xen_pcibk_pci_driver.driver, > &driver_attr_irq_handler_state); > + if (!err) > + err = driver_create_file(&xen_pcibk_pci_driver.driver, > + &driver_attr_reset_device); > if (err) > pcistub_exit(); > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > https://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.