[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.



> From: Xu, Quan
> Sent: Wednesday, December 23, 2015 4:26 PM
> 
> Now if IOTLB/Context/IETC flush is timeout, panic hypervisor.
> The coming patch set will fix it.

again, please adjust comment to reflect what this patch is
doing, i.e. only about improvement on Device-TLB flush.

> 
> If Device-TLB flush is timeout, we'll hide the target ATS
> device and crash the domain owning this ATS device.
> 
> If impacted domain is hardware domain, just throw out a warning.
> 
> The hided Device will be disallowed to be further assigned to
> any domain.

hided Device -> hidden device

> 
> Signed-off-by: Quan Xu <quan.xu@xxxxxxxxx>
> ---
>  xen/drivers/passthrough/pci.c         |  2 +-
>  xen/drivers/passthrough/vtd/extern.h  |  2 +
>  xen/drivers/passthrough/vtd/qinval.c  | 80
> ++++++++++++++++++++++++++++++++++-
>  xen/drivers/passthrough/vtd/x86/ats.c | 13 ++++++
>  4 files changed, 94 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 27b3ca7..2d7dc59 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -407,7 +407,7 @@ static void _pci_hide_device(struct pci_dev *pdev)
>      list_add(&pdev->domain_list, &dom_xen->arch.pdev_list);
>  }
> 
> -int __init pci_hide_device(int bus, int devfn)
> +int pci_hide_device(int bus, int devfn)
>  {
>      struct pci_dev *pdev;
>      int rc = -ENOMEM;
> diff --git a/xen/drivers/passthrough/vtd/extern.h
> b/xen/drivers/passthrough/vtd/extern.h
> index ec9c513..a2123ce 100644
> --- a/xen/drivers/passthrough/vtd/extern.h
> +++ b/xen/drivers/passthrough/vtd/extern.h
> @@ -58,6 +58,8 @@ int ats_device(const struct pci_dev *, const struct 
> acpi_drhd_unit *);
> 
>  int dev_invalidate_iotlb(struct iommu *iommu, u16 did,
>                           u64 addr, unsigned int size_order, u64 type);
> +int dev_invalidate_iotlb_sync(struct iommu *iommu, u16 did,
> +                              u16 seg, u8 bus, u8 devfn);
> 
>  int qinval_device_iotlb(struct iommu *iommu,
>                          u32 max_invs_pend, u16 sid, u16 size, u64 addr);
> diff --git a/xen/drivers/passthrough/vtd/qinval.c 
> b/xen/drivers/passthrough/vtd/qinval.c
> index b227e4e..7330c5d 100644
> --- a/xen/drivers/passthrough/vtd/qinval.c
> +++ b/xen/drivers/passthrough/vtd/qinval.c
> @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu *iommu,
>  static int invalidate_sync(struct iommu *iommu)
>  {
>      struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> +    int rc;
> 
>      if ( qi_ctrl->qinval_maddr )
> -        return queue_invalidate_wait(iommu, 0, 1, 1);
> +    {
> +        rc = queue_invalidate_wait(iommu, 0, 1, 1);
> +        if ( rc )
> +        {
> +            if ( rc == -ETIMEDOUT )
> +                panic("Queue invalidate wait descriptor was timeout.\n");
> +            return rc;
> +        }
> +    }
> +

why do you still do panic here? w/ PATCH 1/3 you should return rc
to caller directly, right?

>      return 0;
>  }
> 
> @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu,
>      return 0;
>  }
> 
> +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> +                                         u16 seg, u8 bus, u8 devfn)
> +{
> +    struct domain *d;
> +    struct pci_dev *pdev;
> +
> +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> +    ASSERT(d);
> +    for_each_pdev(d, pdev)
> +    {
> +        if ( (pdev->seg == seg) &&
> +             (pdev->bus == bus) &&
> +             (pdev->devfn == devfn) )
> +        {
> +            if ( pdev->domain )
> +            {
> +                list_del(&pdev->domain_list);
> +                pdev->domain = NULL;
> +            }
> +
> +            if ( pci_hide_device(bus, devfn) )
> +            {
> +                printk(XENLOG_ERR
> +                       "IOMMU hide device %04x:%02x:%02x error.",
> +                       seg, bus, devfn);
> +                break;
> +            }

I'm not sure whether using pci_hide_device is enough or not break
other code path here? For example, now you have errors propagated
to callers. What's the final policy against flush error? If they will
finally go to cleanup some more state relying on pdev->domain, then
that code path might be broken. If it's fine to do cleanup at this point,
then just hidden is not enough. If you look at pci_remove_device, there's
quite some state to be further cleaned up...

I didn't think about the full logic thoroughly now. But it would always be
good to not hide device now until a point where all related states have
been cleaned up in error handling path chained up.

Thanks
Kevin 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.