[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Revisit VT-d asynchronous flush issue

To: "Tian, Kevin" <kevin.tian@xxxxxxxxx>, "george.dunlap@xxxxxxxxxxxxx" <george.dunlap@xxxxxxxxxxxxx>, "tim@xxxxxxx" <tim@xxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, "Dong, Eddie" <eddie.dong@xxxxxxxxx>, "ian.campbell@xxxxxxxxxx" <ian.campbell@xxxxxxxxxx>, "jbeulich@xxxxxxxx" <jbeulich@xxxxxxxx>, "Nakajima, Jun" <jun.nakajima@xxxxxxxxx>, "keir@xxxxxxx" <keir@xxxxxxx>, "Zhang, Yang Z" <yang.z.zhang@xxxxxxxxx>, "Xu, Quan" <quan.xu@xxxxxxxxx>, "Sankaran, Rajesh" <rajesh.sankaran@xxxxxxxxx>
From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
Date: Mon, 2 Nov 2015 11:39:49 +0000
Delivery-date: Mon, 02 Nov 2015 11:40:07 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 02/11/15 08:03, Tian, Kevin wrote:
> Let's start a new thread with a summary of previous discussion, and 
> then our latest experiment data and updated proposal.
>
> From previous discussions, it's suggested that a spin model is accepted, 
> only when spin timeout doesn't exceed the order of a scheduling time 
> slice, or other blocking operations like what WBINVD might take. 
> Otherwise async-flush model is preferred to prevent misbehaving guests 
> taking long spins if possible, to impact whole system.
>
> Below are some thresholds to be considered:
>
> 1) scheduling time slice in Credit is 1ms.

1ms is the minimum scheduling timeslice.  30ms is the current default
(not that this should affect the following reasoning).

>
> 2) WBINVD cost is 4.6ms in worst case on an IVT platform (32 cores, 
> 10GB NIC assigned to the VM, running iperf). Detail data is append in 
> the bottom. Actual cost varies on different platforms, due to different 
> cache size/layout. For example, we also heard from other colleagues 
> about 10ms level cost on another platform.
>
> 3) PCI SIG strongly recommends that Completion Timeout mechanism
> not expire in less than 10ms (PCIe 3.0 spec, 7.8.15, Device Capabilities
> 2 Register). It means CPU MMIO read might already take >10ms which 
> we just didn't note.
>
> Based on above information, at least we can think a timeout range
> between [1ms, 10ms] would likely not introduce bad system behavior. 
> Or conservatively, we can define the spin timeout default as 1ms, 
> while allowing boot-time override up to 10ms for more flexibility.
>
> Then regarding to VT-d flush:
>
> - For context/iotlb/iec flush, our measurements show worst cases
> <10us. We also confirmed with hardware team, that 1ms is large 
> enough for IOMMU internal flush.
>
> - For ATS device-TLB flush, PCI spec defines up to 60s, but:
>
>       * Our hardware team confirms that 1ms should be enough for 
> integrated PCI devices w/ ATS.
>
>       * for discrete PCI devices w/ ATS, it's uncertain whether 1ms 
> or 10ms is too restrictive to them, but there are only a few devices
> now in the market. 
>
> Based on above information, we propose to continue spin-timeout
> model w/ some adjustment, which fixes current timeout concern
> and also allows limited ATS support in a light way:
>
> 1) reduce spin timeout to 1ms, which can be boot-time changed
> up to 10ms.

If this is going to be command line configurable, don't have an upper limit.

Given the uncertainty with external devices, it might be necessary to
experiment with timeouts greater than 10ms.

>
> 2) if timeout expires, kill the VM which the target device is assigned 
> to. Optionally hypervisor may mark device non-assignable.
>
> It works for devices w/o ATS. It works for integrated devices w/ ATS.
> It might or might not work for discrete devices w/ ATS, but we can
> re-evaluate the gain vs. software complexity of async flush until we 
> see many discrete devices breaking the timeout assumptions in the 
> future.
>
> Thoughts?

As presented, this is probably an improvement, but I am concerning with
the case of external devices.

Then again, as none of this currently works at all, we are not in a
worse state.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] Revisit VT-d asynchronous flush issue
  - From: Tian, Kevin

References:
- [Xen-devel] Revisit VT-d asynchronous flush issue
  - From: Tian, Kevin

Prev by Date: Re: [Xen-devel] [PATCH v4 00/10] xen-block: multi hardware-queues/rings support
Next by Date: Re: [Xen-devel] [PATCH] Handles the error returned by the xc_dom_allocate function
Previous by thread: [Xen-devel] Revisit VT-d asynchronous flush issue
Next by thread: Re: [Xen-devel] Revisit VT-d asynchronous flush issue
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.