[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Revisit VT-d asynchronous flush issue
On 02/11/15 08:03, Tian, Kevin wrote: > Let's start a new thread with a summary of previous discussion, and > then our latest experiment data and updated proposal. > > From previous discussions, it's suggested that a spin model is accepted, > only when spin timeout doesn't exceed the order of a scheduling time > slice, or other blocking operations like what WBINVD might take. > Otherwise async-flush model is preferred to prevent misbehaving guests > taking long spins if possible, to impact whole system. > > Below are some thresholds to be considered: > > 1) scheduling time slice in Credit is 1ms. 1ms is the minimum scheduling timeslice. 30ms is the current default (not that this should affect the following reasoning). > > 2) WBINVD cost is 4.6ms in worst case on an IVT platform (32 cores, > 10GB NIC assigned to the VM, running iperf). Detail data is append in > the bottom. Actual cost varies on different platforms, due to different > cache size/layout. For example, we also heard from other colleagues > about 10ms level cost on another platform. > > 3) PCI SIG strongly recommends that Completion Timeout mechanism > not expire in less than 10ms (PCIe 3.0 spec, 7.8.15, Device Capabilities > 2 Register). It means CPU MMIO read might already take >10ms which > we just didn't note. > > Based on above information, at least we can think a timeout range > between [1ms, 10ms] would likely not introduce bad system behavior. > Or conservatively, we can define the spin timeout default as 1ms, > while allowing boot-time override up to 10ms for more flexibility. > > Then regarding to VT-d flush: > > - For context/iotlb/iec flush, our measurements show worst cases > <10us. We also confirmed with hardware team, that 1ms is large > enough for IOMMU internal flush. > > - For ATS device-TLB flush, PCI spec defines up to 60s, but: > > * Our hardware team confirms that 1ms should be enough for > integrated PCI devices w/ ATS. > > * for discrete PCI devices w/ ATS, it's uncertain whether 1ms > or 10ms is too restrictive to them, but there are only a few devices > now in the market. > > Based on above information, we propose to continue spin-timeout > model w/ some adjustment, which fixes current timeout concern > and also allows limited ATS support in a light way: > > 1) reduce spin timeout to 1ms, which can be boot-time changed > up to 10ms. If this is going to be command line configurable, don't have an upper limit. Given the uncertainty with external devices, it might be necessary to experiment with timeouts greater than 10ms. > > 2) if timeout expires, kill the VM which the target device is assigned > to. Optionally hypervisor may mark device non-assignable. > > It works for devices w/o ATS. It works for integrated devices w/ ATS. > It might or might not work for discrete devices w/ ATS, but we can > re-evaluate the gain vs. software complexity of async flush until we > see many discrete devices breaking the timeout assumptions in the > future. > > Thoughts? As presented, this is probably an improvement, but I am concerning with the case of external devices. Then again, as none of this currently works at all, we are not in a worse state. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |