[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] Revisit VT-d asynchronous flush issue
Let's start a new thread with a summary of previous discussion, and then our latest experiment data and updated proposal. From previous discussions, it's suggested that a spin model is accepted, only when spin timeout doesn't exceed the order of a scheduling time slice, or other blocking operations like what WBINVD might take. Otherwise async-flush model is preferred to prevent misbehaving guests taking long spins if possible, to impact whole system. Below are some thresholds to be considered: 1) scheduling time slice in Credit is 1ms. 2) WBINVD cost is 4.6ms in worst case on an IVT platform (32 cores, 10GB NIC assigned to the VM, running iperf). Detail data is append in the bottom. Actual cost varies on different platforms, due to different cache size/layout. For example, we also heard from other colleagues about 10ms level cost on another platform. 3) PCI SIG strongly recommends that Completion Timeout mechanism not expire in less than 10ms (PCIe 3.0 spec, 7.8.15, Device Capabilities 2 Register). It means CPU MMIO read might already take >10ms which we just didn't note. Based on above information, at least we can think a timeout range between [1ms, 10ms] would likely not introduce bad system behavior. Or conservatively, we can define the spin timeout default as 1ms, while allowing boot-time override up to 10ms for more flexibility. Then regarding to VT-d flush: - For context/iotlb/iec flush, our measurements show worst cases <10us. We also confirmed with hardware team, that 1ms is large enough for IOMMU internal flush. - For ATS device-TLB flush, PCI spec defines up to 60s, but: * Our hardware team confirms that 1ms should be enough for integrated PCI devices w/ ATS. * for discrete PCI devices w/ ATS, it's uncertain whether 1ms or 10ms is too restrictive to them, but there are only a few devices now in the market. Based on above information, we propose to continue spin-timeout model w/ some adjustment, which fixes current timeout concern and also allows limited ATS support in a light way: 1) reduce spin timeout to 1ms, which can be boot-time changed up to 10ms. 2) if timeout expires, kill the VM which the target device is assigned to. Optionally hypervisor may mark device non-assignable. It works for devices w/o ATS. It works for integrated devices w/ ATS. It might or might not work for discrete devices w/ ATS, but we can re-evaluate the gain vs. software complexity of async flush until we see many discrete devices breaking the timeout assumptions in the future. Thoughts? ---- <detail data> Min(us) Max(us) Average(us) context 5.24 5.49 5.36 iotlb 1.90 2.07 2.03 iec 5.54 7.86 6.58 wbinvd 2721.42 4655.71 3571.43 Platform info: 1. Base Board Information Manufacturer: Intel Corporation Product Name: S2600CP Version: E99552-561 2. CPU: cpu family : 6 model : 62 model name : Genuine Intel(R) CPU @ 2.80GHz Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |