[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
On Mon, May 08, 2017 at 02:39:25AM -0600, Jan Beulich wrote: >>>> On 08.05.17 at 18:15, <chao.gao@xxxxxxxxx> wrote: >> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote: >>>>>> On 03.05.17 at 12:08, <george.dunlap@xxxxxxxxxx> wrote: >>>> On 02/05/17 06:45, Chao Gao wrote: >>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote: >>>>>> On 26/04/17 01:52, Chao Gao wrote: >>>>>>> I compared the maximum of #entry in one list and #event (adding entry to >>>>>>> PI blocking list) with and without the three latter patches. Here >>>>>>> is the result: >>>>>>> ------------------------------------------------------------- >>>>>>> | | | | >>>>>>> | Items | Maximum of #entry | #event | >>>>>>> | | | | >>>>>>> ------------------------------------------------------------- >>>>>>> | | | | >>>>>>> |W/ the patches | 6 | 22740 | >>>>>>> | | | | >>>>>>> ------------------------------------------------------------- >>>>>>> | | | | >>>>>>> |W/O the patches| 128 | 46481 | >>>>>>> | | | | >>>>>>> ------------------------------------------------------------- >>>>>> >>>>>> Any chance you could trace how long the list traversal took? It would >>>>>> be good for future reference to have an idea what kinds of timescales >>>>>> we're talking about. >>>>> >>>>> Hi. >>>>> >>>>> I made a simple test to get the time consumed by the list traversal. >>>>> Apply below patch and create one hvm guest with 128 vcpus and a >>>>> passthrough >> 40 NIC. >>>>> All guest vcpu are pinned to one pcpu. collect data by >>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by >>>>> xentrace_format. When the list length is about 128, the traversal time >>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's >>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of >>>>> 1us >>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most >>>>> 2900 vcpus can be added into the list. >>>> >>>> Great, thanks Chao Gao, that's useful. >>> >>>Looks like Chao Gao has been dropped ... >>> >>>> I'm not sure a fixed latency -- >>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged >>>> to have interrupts staggered at 500us intervals it could easily lock up >>>> the cpu for nearly a full second. But I'm having trouble formulating a >>>> good limit scenario. >>>> >>>> In any case, 22us should be safe from a security standpoint*, and 128 >>>> should be pretty safe from a "make the common case fast" standpoint: >>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up >>>> traffic will be the least of your performance problems I should think. >>>> >>>> -George >>>> >>>> * Waiting for Jan to contradict me on this one. :-) >>> >>>22us would certainly be fine, if this was the worst case scenario. >>>I'm not sure the value measured for 128 list entries can be easily >>>scaled to several thousands of them, due cache and/or NUMA >>>effects. I continue to think that we primarily need theoretical >>>proof of an upper boundary on list length being enforced, rather >>>than any measurements or randomized balancing. And just to be >>>clear - if someone overloads their system, I do not see a need to >>>have a guaranteed maximum list traversal latency here. All I ask >>>for is that list traversal time scales with total vCPU count divided >>>by pCPU count. >> >> Thanks, Jan & George. >> >> I think it is more clear to me about what should I do next step. >> >> In my understanding, we should distribute the wakeup interrupts like >> this: >> 1. By default, distribute it to the local pCPU ('local' means the vCPU >> is on the pCPU) to make the common case fast. >> 2. With the list grows to a point where we think it may consumers too >> much time to traverse the list, also distribute wakeup interrupt to local >> pCPU, ignoring admin intentionally overloads their system. >> 3. When the list length reachs the theoretic average maximum (means >> maximal vCPU count divided by pCPU count), distribute wakeup interrupt >> to another underutilized pCPU. >> >> But, I am confused about that If we don't care someone overload their >> system, why we need the stage #3? If not, I have no idea to meet Jan's >> request, the list traversal time scales with total vCPU count divided by >> pCPU count. Or we will reach stage #3 before stage #2? > >Things is that imo point 2 is too fuzzy to be of any use, i.e. 3 should >take effect immediately. We don't mean to ignore any admin decisions >here, it is just that if they overload their systems, the net effect of 3 >may still not be good enough to provide smooth behavior. But that's >then a result of them overloading their systems in the first place. IOW, >you should try to evenly distribute vCPU-s as soon as their count on >a given pCPU exceeds the calculated average. Very helpful and reasonable. Thank you, Jan. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |