[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 0/4] mitigate the per-pCPU blocking list may be too long



>>> On 08.05.17 at 18:15, <chao.gao@xxxxxxxxx> wrote:
> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote:
>>>>> On 03.05.17 at 12:08, <george.dunlap@xxxxxxxxxx> wrote:
>>> On 02/05/17 06:45, Chao Gao wrote:
>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote:
>>>>> On 26/04/17 01:52, Chao Gao wrote:
>>>>>> I compared the maximum of #entry in one list and #event (adding entry to
>>>>>> PI blocking list) with and without the three latter patches. Here
>>>>>> is the result:
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |    Items      |   Maximum of #entry  |      #event        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |W/ the patches |         6            |       22740        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>> |               |                      |                    |
>>>>>> |W/O the patches|        128           |       46481        |
>>>>>> |               |                      |                    |
>>>>>> -------------------------------------------------------------
>>>>>
>>>>> Any chance you could trace how long the list traversal took?  It would
>>>>> be good for future reference to have an idea what kinds of timescales
>>>>> we're talking about.
>>>> 
>>>> Hi.
>>>> 
>>>> I made a simple test to get the time consumed by the list traversal.
>>>> Apply below patch and create one hvm guest with 128 vcpus and a 
>>>> passthrough 
> 40 NIC.
>>>> All guest vcpu are pinned to one pcpu. collect data by
>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by
>>>> xentrace_format. When the list length is about 128, the traversal time
>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's
>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of 
>>>> 1us
>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most
>>>> 2900 vcpus can be added into the list.
>>> 
>>> Great, thanks Chao Gao, that's useful.
>>
>>Looks like Chao Gao has been dropped ...
>>
>>>  I'm not sure a fixed latency --
>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged
>>> to have interrupts staggered at 500us intervals it could easily lock up
>>> the cpu for nearly a full second.  But I'm having trouble formulating a
>>> good limit scenario.
>>> 
>>> In any case, 22us should be safe from a security standpoint*, and 128
>>> should be pretty safe from a "make the common case fast" standpoint:
>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up
>>> traffic will be the least of your performance problems I should think.
>>> 
>>>  -George
>>> 
>>> * Waiting for Jan to contradict me on this one. :-)
>>
>>22us would certainly be fine, if this was the worst case scenario.
>>I'm not sure the value measured for 128 list entries can be easily
>>scaled to several thousands of them, due cache and/or NUMA
>>effects. I continue to think that we primarily need theoretical
>>proof of an upper boundary on list length being enforced, rather
>>than any measurements or randomized balancing. And just to be
>>clear - if someone overloads their system, I do not see a need to
>>have a guaranteed maximum list traversal latency here. All I ask
>>for is that list traversal time scales with total vCPU count divided
>>by pCPU count.
> 
> Thanks, Jan & George.
> 
> I think it is more clear to me about what should I do next step.
> 
> In my understanding, we should distribute the wakeup interrupts like
> this:
> 1. By default, distribute it to the local pCPU ('local' means the vCPU
> is on the pCPU) to make the common case fast.
> 2. With the list grows to a point where we think it may consumers too
> much time to traverse the list, also distribute wakeup interrupt to local
> pCPU, ignoring admin intentionally overloads their system.
> 3. When the list length reachs the theoretic average maximum (means
> maximal vCPU count divided by pCPU count), distribute wakeup interrupt
> to another underutilized pCPU.
> 
> But, I am confused about that If we don't care someone overload their
> system, why we need the stage #3?  If not, I have no idea to meet Jan's
> request, the list traversal time scales with total vCPU count divided by
> pCPU count. Or we will reach stage #3 before stage #2?

Things is that imo point 2 is too fuzzy to be of any use, i.e. 3 should
take effect immediately. We don't mean to ignore any admin decisions
here, it is just that if they overload their systems, the net effect of 3
may still not be good enough to provide smooth behavior. But that's
then a result of them overloading their systems in the first place. IOW,
you should try to evenly distribute vCPU-s as soon as their count on
a given pCPU exceeds the calculated average.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.