[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling



On 09/03/16 05:22, Wu, Feng wrote:
> 
> 
>> -----Original Message-----
>> From: George Dunlap [mailto:george.dunlap@xxxxxxxxxx]
>> Sent: Wednesday, March 9, 2016 1:06 AM
>> To: Jan Beulich <JBeulich@xxxxxxxx>; George Dunlap
>> <George.Dunlap@xxxxxxxxxxxxx>; Wu, Feng <feng.wu@xxxxxxxxx>
>> Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>; Dario Faggioli
>> <dario.faggioli@xxxxxxxxxx>; Tian, Kevin <kevin.tian@xxxxxxxxx>; xen-
>> devel@xxxxxxxxxxxxx; Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>; Keir
>> Fraser <keir@xxxxxxx>
>> Subject: Re: [Xen-devel] Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt
>> core logic handling
>>
>> On 08/03/16 15:42, Jan Beulich wrote:
>>>>>> On 08.03.16 at 15:42, <George.Dunlap@xxxxxxxxxxxxx> wrote:
>>>> On Tue, Mar 8, 2016 at 1:10 PM, Wu, Feng <feng.wu@xxxxxxxxx> wrote:
>>>>>> -----Original Message-----
>>>>>> From: George Dunlap [mailto:george.dunlap@xxxxxxxxxx]
>>>> [snip]
>>>>>> It seems like there are a couple of ways we could approach this:
>>>>>>
>>>>>> 1. Try to optimize the reverse look-up code so that it's not a linear
>>>>>> linked list (getting rid of the theoretical fear)
>>>>>
>>>>> Good point.
>>>>>
>>>>>>
>>>>>> 2. Try to test engineered situations where we expect this to be a
>>>>>> problem, to see how big of a problem it is (proving the theory to be
>>>>>> accurate or inaccurate in this case)
>>>>>
>>>>> Maybe we can run a SMP guest with all the vcpus pinned to a dedicated
>>>>> pCPU, we can run some benchmark in the guest with VT-d PI and without
>>>>> VT-d PI, then see the performance difference between these two sceanrios.
>>>>
>>>> This would give us an idea what the worst-case scenario would be.
>>>
>>> How would a single VM ever give us an idea about the worst
>>> case? Something getting close to worst case is a ton of single
>>> vCPU guests all temporarily pinned to one and the same pCPU
>>> (could be multi-vCPU ones, but the more vCPU-s the more
>>> artificial this pinning would become) right before they go into
>>> blocked state (i.e. through one of the two callers of
>>> arch_vcpu_block()), the pinning removed while blocked, and
>>> then all getting woken at once.
>>
>> Why would removing the pinning be important?
>>
>> And I guess it's actually the case that it doesn't need all VMs to
>> actually be *receiving* interrupts; it just requires them to be
>> *capable* of receiving interrupts, for there to be a long chain all
>> blocked on the same physical cpu.
>>
>>>
>>>>  But
>>>> pinning all vcpus to a single pcpu isn't really a sensible use case we
>>>> want to support -- if you have to do something stupid to get a
>>>> performance regression, then I as far as I'm concerned it's not a
>>>> problem.
>>>>
>>>> Or to put it a different way: If we pin 10 vcpus to a single pcpu and
>>>> then pound them all with posted interrupts, and there is *no*
>>>> significant performance regression, then that will conclusively prove
>>>> that the theoretical performance regression is of no concern, and we
>>>> can enable PI by default.
>>>
>>> The point isn't the pinning. The point is what pCPU they're on when
>>> going to sleep. And that could involve quite a few more than just
>>> 10 vCPU-s, provided they all sleep long enough.
>>>
>>> And the "theoretical performance regression is of no concern" is
>>> also not a proper way of looking at it, I would say: Even if such
>>> a situation would happen extremely rarely, if it can happen at all,
>>> it would still be a security issue.
>>
>> What I'm trying to get at is -- exactly what situation?  What actually
>> constitutes a problematic interrupt latency / interrupt processing
>> workload, how many vcpus must be sleeping on the same pcpu to actually
>> risk triggering that latency / workload, and how feasible is it that
>> such a situation would arise in a reasonable scenario?
>>
>> If 200us is too long, and it only takes 3 sleeping vcpus to get there,
>> then yes, there is a genuine problem we need to try to address before we
>> turn it on by default.  If we say that up to 500us is tolerable, and it
>> takes 100 sleeping vcpus to reach that latency, then this is something I
>> don't really think we need to worry about.
>>
>> "I think something bad may happen" is a really difficult to work with.
>> "I want to make sure that even a high number of blocked cpus won't cause
>> the interrupt latency to exceed 500us; and I want it to be basically
>> impossible for the interrupt latency to exceed 5ms under any
>> circumstances" is a concrete target someone can either demonstrate that
>> they meet, or aim for when trying to improve the situation.
>>
>> Feng: It should be pretty easy for you to:
> 
> George, thanks a lot for you to pointing the possible way to move forward.
> 
>> * Implement a modified version of Xen where
>>  - *All* vcpus get put on the waitqueue
> 
> So this means, all the vcpus are blocked, and hence waiting in the
> blocking list, right?

No.

For testing purposes, we need a lot of vcpus on the list, but we only
need one vcpu to actually be woken up to see low long it takes to
traverse the list.

At the moment, a vcpu will only be put on the list if it has the
arch_block callback defined; and it will have the arch_block callback
defined only if the domain it's a part of has a device assigned to it.
But it would be easy enough to make it so that *all* VMs have the
arch_block callback defined; then all vcpus would end up on the
pi_blocked list when they're blocked, even if they don't have a device
assigned.

That way you could have a really long pi_blocked list while only needing
a single device to pass through to the guest.

>>  - Measure how long it took to run the loop in pi_wakeup_interrupt
>> * Have one VM receiving posted interrupts on a regular basis.
>> * Slowly increase the number of vcpus blocked on a single cpu (e.g., by
>> creating more guests), stopping when you either reach 500us or 500
>> vcpus. :-)
> 
> This may depends on the environment, I was using a 10G NIC to do the
> test, if we increase the number of guests, I need more NICs to get assigned
> to the guests, I will see if I can get them.

...which is why I suggested setting the arch_block() callback for all
domains, even those which don't have devices assigned, so that you could
get away with a single passed-through device. :-)

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.