[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Enabling VT-d PI by default



On 27/04/17 08:08, Jan Beulich wrote:
>>>> On 26.04.17 at 19:11, <george.dunlap@xxxxxxxxxx> wrote:
>> On 18/04/17 07:24, Tian, Kevin wrote:
>>>> From: Gao, Chao
>>>> Sent: Monday, April 17, 2017 4:14 AM
>>>>
>>>> On Tue, Apr 11, 2017 at 02:21:07AM -0600, Jan Beulich wrote:
>>>>>>>> On 11.04.17 at 02:59, <chao.gao@xxxxxxxxx> wrote:
>>>>>> As you know, with VT-d PI enabled, hardware can directly deliver external
>>>>>> interrupts to guest without any VMM intervention. It will reduces overall
>>>>>> interrupt latency to guest and reduces overheads otherwise incurred by
>>>> the
>>>>>> VMM for virtualizing interrupts. In my mind, it's an important feature to
>>>>>> interrupt virtualization.
>>>>>>
>>>>>> But VT-d PI feature is disabled by default on Xen for some corner
>>>>>> cases and bugs. Based on Feng's work, we have fixed those corner
>>>>>> cases related to VT-d PI. Do you think it is a time to enable VT-d PI by
>>>>>> default. If no, could you list your concerns so that we can resolve them?
>>>>> I don't recall you addressing the main issue (blocked vCPU-s list
>>>>> length; see the comment next to the iommu_intpost definition).
>>>>>
>>>> Indeed. I have gone through the discussion happened in April 2016[1, 2].
>>>> [1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted- 
>>>> interrupt%20core%20logic%20handling;#422661
>>>> [2]
>>>> https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20o 
>>>> f%20the%20list%20depends;#422567.
>>>>
>>>> First of all, I admit this is an issue in extreme case and we should
>>>> come up with a solution.
>>>>
>>>> The problem we are facing is:
>>>> There is a per-cpu list used to maintain all the blocked vCPU on a
>>>> pCPU.  When a wakeup interrupt comes, the interrupt handler travels
>>>> the list to wake the vCPUs whose pi_desc indicates an interrupt has
>>>> been posted.  There is no policy to restrict the size of the list such
>>>> that in some extreme case, the list can be too long to cause some
>>>> issues (the most obvious issue is  about interrupt latency).
>>>>
>>>> The theoretical max number of entry in the list is 4M as one host can
>>>> have 32k domains and every domain can have 128vCPU. If all the vCPUs
>>>> are blocked in one list, the list gets its theoretical maximum.
>>>>
>>>> The root cause of this issue, I think, is that the wakeup interrupt
>>>> vector is shared by all the vCPUs on one pCPU. Lacking of enough
>>>> information (such as which device sends or which IRTE translates this
>>>> interrupt), there is no effective method to distinguish the
>>>> interrupt's destination vCPU except traveling this list. Right?  So we
>>>> only can mitigate this issue through decreasing or limiting the
>>>> entry's maximum in one list.
>>>>
>>>> Several methods we can take to mitigate this issue:
>>>> 1. According to your discussions, evenly distributing all the blocked
>>>> vCPUs among all pCPUs can mitigate this issue. With this approach, all
>>>> vCPUs are blocked in one list can be avoided. It can decrease the
>>>> entry's maximum in one list by N times (N is the number of pCPU).
>>>>
>>>> 2. Don't put the blocked vCPUs which won't be woken by the wakeup
>>>> interrupt into the per-cpu list. Currently, we put the blocked vCPUs
>>>> belong to domains who have assigned devices into the list. But if one
>>>> blocked vCPU of such domain is not a destination of every posted
>>>> format IRTE, it needn't be added to the per-cpu list. The blocked vCPU
>>>> will be woken by IPIs or other virtual interrupts. From this aspect, we
>>>> can decrease the entries in the per-cpu list.
>>>>
>>>> 3. Like what we do in struct irq_guest_action_t, can we limit the
>>>> maximum of entry we support in the list. With this approach, during
>>>> domain creation, we calculate the available entries and compare with
>>>> the domain's vCPU number to decide whether the domain can use VT-d PI.
>>> VT-d PI is global instead of per-domain. I guess you actually mean
>>> failing device assignment operation if counting new domain's #VCPUs
>>> exceeds the limitation.
>>>
>>>> This method will pose a strict restriction to the maximum of entry in
>>>> one list. But it may affect vCPU hotplug.
>>>>
>>>> According to your intuition, which methods are feasible and
>>>> acceptable? I will attempt to mitigate this issue per your advices.
>>>>
>>> My understanding is that we need them all. #1 is the baseline,
>>> with #2/#3 as further optimization. :-)
>> Actually, regarding #2, is that the case?
>>
>> If we do reference counting (as in patches 3 and 4 of Chao Gao's recent
>> series), then we are guaranteed never to have more vcpus on any given
>> wakeup list than there are machine IRQs on the system.  Are we ever
>> going to have a system with so many IRQs that going through such a list
>> would be problematic?
> I'm afraid this is not impossible, considering that people have already
> run into the interrupt vector limitation coming from there only being
> about 200 vectors per CPU (and there not being, in physical mode,
> any sharing of vectors between multiple CPUs, iirc). Devices using
> namely MSI-X can use an awful lot of vectors. Perhaps Andrew
> remembers numbers observed on actual systems here...

Citrix Netscalar SDX boxes have more MSI-X interrupts than fit in the
cumulative IDTs of a top end dual-socket Xeon server systems.  Some of
the device drivers are purposefully modelled to use fewer interrupts
than they otherwise would want to.

Using PI is the proper solution longterm, because doing so would remove
any need to allocate IDT vectors for the interrupts; the IOMMU could be
programmed to dump device vectors straight into the PI block without
them ever going through Xen's IDT.

However, fixing that requires rewriting Xen's Interrupt remapping
handling so it doesn't rewrite the cpu/vector in every interrupt source,
and only rewrites the interrupt remapping table.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.