[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
> On January 20, 2016 at 7:29 pm, <JBeulich@xxxxxxxx> wrote: > >>> On 20.01.16 at 11:26, <quan.xu@xxxxxxxxx> wrote: > >> On January 15, 2016 at 9:10, <JBeulich@xxxxxxxx> wrote: > >> >>> On 23.12.15 at 09:25, <quan.xu@xxxxxxxxx> wrote: > >> > @@ -229,6 +239,63 @@ int qinval_device_iotlb(struct iommu *iommu, > >> > return 0; > >> > } > >> > > >> > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did, > >> > + u16 seg, u8 bus, u8 > >> > +devfn) > >> { > >> > + struct domain *d; > >> > + struct pci_dev *pdev; > >> > + > >> > + d = rcu_lock_domain_by_id(iommu->domid_map[did]); > >> > + ASSERT(d); > >> > >> Don't you need to acquire some lock in order to safely assert this? > > > > Referred to the other use case of 'rcu_lock_domain_by_id()', Xen > > checks whether The domain is 'NULL'. > > Could I also replace the 'ASSERT(d)' with > > + If ( d == NULL ) > > + return; > > ? > > At a first glance this doesn't look right, but in the end that's something > you need > to answer. > Is it quite similar to whether the domain has been destroyed when Device-TLB is flushing? Correct me if I still doesn't get you point.. > >> Also note that unused slots hold zero, i.e. there's at least a > >> theoretical > > risk of > >> problems here when you don't also look at > >> iommu->domid_bitmap. > >> > > I am not clear how to fix this point. Do you have good idea? > > Add a lock to 'iommu->domid_bitmap'? > > How would a lock help avoiding mistaking an unused slot to mean Dom0? As > already suggested - I think you simply need to consult the bitmap along with > the > domain ID array. > Try to get domain id with iommu->domid_map[did] ? > >> > + { > >> > + list_del(&pdev->domain_list); > >> > >> This should happen under pcidevs_lock - you need to either acquire it > >> or > >> ASSERT() it being held. > >> > > > > I should check whether it is under pcidevs_lock -- with > > spin_is_locked(&pcidevs_lock) > > If it is not under pcidevs_lock, I should acquire it. > > I think ASSERT() is not a good idea. Hypervisor acquires this lock and > > then remove the resource. > > I don't understand this last sentence. > For example: in pci_remove_device() { ... spin_lock(&pcidevs_lock); .. iommu_remove_device().. .. spin_unlock(&pcidevs_lock); } Device-TLB is maybe flush error in iommu_remove_device().. Then it is under pcidevs_lock.. In this case, I need to check whether it is under pcidevs_lock. If not, I need to acquire the pcidevs_lock. > >> > @@ -349,9 +416,18 @@ static int flush_iotlb_qi( > >> > queue_invalidate_iotlb(iommu, > >> > type >> > >> DMA_TLB_FLUSH_GRANU_OFFSET, dr, > >> > dw, did, size_order, 0, addr); > >> > + > >> > + /* > >> > + * Synchronize with hardware for invalidation request > descriptors > >> > + * submitted before Device-TLB invalidate descriptor. > >> > + */ > >> > + rc = invalidate_sync(iommu); > >> > + if ( rc ) > >> > + return rc; > >> > + > >> > if ( flush_dev_iotlb ) > >> > ret = dev_invalidate_iotlb(iommu, did, addr, size_order, > type); > >> > - rc = invalidate_sync(iommu); > >> > + > >> > if ( !ret ) > >> > ret = rc; > >> > } > >> > >> This change is because of ...? > >> > > As Kevin's comments, > > I add 'dev_invalidate_iotlb_sync()' to dev_invalidate_iotlb(). > > Then i can know which device is flush timeout. > > I don't understand: How is your reply related to you moving up the invocation > of > invalidate_sync()? This invalidate_sync() is to synchronize with hardware for invalidation request descriptors submitted before Device-TLB invalidate descriptor. If I don't add this invalidate_sync().. dev_invalidate_iotlb_timeout() is not clear whether the flush time out is about Device-TLB flush error or the other previous iotlb/iec/context flush error. > > >> Plus logging the error code and device coordinates might turn out > >> helpful in such cases. But first of all - if there was a timeout, > >> we'd make it here only for Dom0. Perhaps the printing should move into an > else to the domain_crash()? > >> And if there was another error, the message would be outright wrong. > >> > > IMO, the print for Dom0 is enough. > > I think it is not a good idea to move into an else to domain_crash(). > > Domain_crash is quite a common part. > > Anyway I can improve it in low priority. > > At the very least the message should not end up being actively misleading. > What about the below? + printk(XENLOG_ERR + "Device %04x:%02x:%02x is flush error.", + seg, bus, devfn); Quan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |