[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.



>>> On 20.01.16 at 11:26, <quan.xu@xxxxxxxxx> wrote:
>> On January 15, 2016 at 9:10, <JBeulich@xxxxxxxx> wrote:
>> >>> On 23.12.15 at 09:25, <quan.xu@xxxxxxxxx> wrote:
>> > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu,
>> >      return 0;
>> >  }
>> >
>> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
>> > +                                         u16 seg, u8 bus, u8 devfn)
>> {
>> > +    struct domain *d;
>> > +    struct pci_dev *pdev;
>> > +
>> > +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
>> > +    ASSERT(d);
>> 
>> Don't you need to acquire some lock in order to safely assert this?
> 
> Referred to the other use case of 'rcu_lock_domain_by_id()', Xen checks 
> whether
> The domain is 'NULL'. 
> Could I also replace the 'ASSERT(d)' with
>           + If ( d == NULL )
>           +   return;
> ?

At a first glance this doesn't look right, but in the end that's
something you need to answer.

>> Also note that unused slots hold zero, i.e. there's at least a theoretical 
> risk of
>> problems here when you don't also look at
>> iommu->domid_bitmap.
>> 
> I am not clear how to fix this point. Do you have good idea?
> Add a lock to 'iommu->domid_bitmap'?

How would a lock help avoiding mistaking an unused slot to mean
Dom0? As already suggested - I think you simply need to consult
the bitmap along with the domain ID array.

>> > +            {
>> > +                list_del(&pdev->domain_list);
>> 
>> This should happen under pcidevs_lock - you need to either acquire it or
>> ASSERT() it being held.
>> 
> 
> I should check whether it is under pcidevs_lock -- with 
> spin_is_locked(&pcidevs_lock)
> If it is not under pcidevs_lock, I should acquire it.
> I think ASSERT() is not a good idea. Hypervisor acquires this lock and then 
> remove the resource.

I don't understand this last sentence.

>> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi(
>> >          queue_invalidate_iotlb(iommu,
>> >                                 type >>
>> DMA_TLB_FLUSH_GRANU_OFFSET, dr,
>> >                                 dw, did, size_order, 0, addr);
>> > +
>> > +        /*
>> > +         * Synchronize with hardware for invalidation request descriptors
>> > +         * submitted before Device-TLB invalidate descriptor.
>> > +         */
>> > +        rc = invalidate_sync(iommu);
>> > +        if ( rc )
>> > +             return rc;
>> > +
>> >          if ( flush_dev_iotlb )
>> >              ret = dev_invalidate_iotlb(iommu, did, addr, size_order, 
>> > type);
>> > -        rc = invalidate_sync(iommu);
>> > +
>> >          if ( !ret )
>> >              ret = rc;
>> >      }
>> 
>> This change is because of ...?
>> 
> As Kevin's comments,
> I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb(). 
> Then i can know which device is flush timeout.

I don't understand: How is your reply related to you moving up the
invocation of invalidate_sync()?

>> Plus logging the error code and device coordinates might turn out helpful in
>> such cases. But first of all - if there was a timeout, we'd make it here 
>> only for
>> Dom0. Perhaps the printing should move into an else to the domain_crash()?
>> And if there was another error, the message would be outright wrong.
>> 
> IMO, the print for Dom0 is enough.
> I think it is not a good idea to move into an else to domain_crash(). 
> Domain_crash is quite a common part.
> Anyway I can improve it in low priority.

At the very least the message should not end up being actively
misleading.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.