[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
On Mon, May 08, 2017 at 03:24:47AM -0600, Jan Beulich wrote: >(Chao Gao got lost from the recipients list again; re-adding) > >>>> On 08.05.17 at 11:13, <george.dunlap@xxxxxxxxxx> wrote: >> On 08/05/17 17:15, Chao Gao wrote: >>> On Wed, May 03, 2017 at 04:21:27AM -0600, Jan Beulich wrote: >>>>>>> On 03.05.17 at 12:08, <george.dunlap@xxxxxxxxxx> wrote: >>>>> On 02/05/17 06:45, Chao Gao wrote: >>>>>> On Wed, Apr 26, 2017 at 05:39:57PM +0100, George Dunlap wrote: >>>>>>> On 26/04/17 01:52, Chao Gao wrote: >>>>>>>> I compared the maximum of #entry in one list and #event (adding entry >>>>>>>> to >>>>>>>> PI blocking list) with and without the three latter patches. Here >>>>>>>> is the result: >>>>>>>> ------------------------------------------------------------- >>>>>>>> | | | | >>>>>>>> | Items | Maximum of #entry | #event | >>>>>>>> | | | | >>>>>>>> ------------------------------------------------------------- >>>>>>>> | | | | >>>>>>>> |W/ the patches | 6 | 22740 | >>>>>>>> | | | | >>>>>>>> ------------------------------------------------------------- >>>>>>>> | | | | >>>>>>>> |W/O the patches| 128 | 46481 | >>>>>>>> | | | | >>>>>>>> ------------------------------------------------------------- >>>>>>> >>>>>>> Any chance you could trace how long the list traversal took? It would >>>>>>> be good for future reference to have an idea what kinds of timescales >>>>>>> we're talking about. >>>>>> >>>>>> Hi. >>>>>> >>>>>> I made a simple test to get the time consumed by the list traversal. >>>>>> Apply below patch and create one hvm guest with 128 vcpus and a >>>>>> passthrough >> 40 NIC. >>>>>> All guest vcpu are pinned to one pcpu. collect data by >>>>>> 'xentrace -D -e 0x82000 -T 300 trace.bin' and decode data by >>>>>> xentrace_format. When the list length is about 128, the traversal time >>>>>> is in the range of 1750 cycles to 39330 cycles. The physical cpu's >>>>>> frequence is 1795.788MHz, therefore the time consumed is in the range of >>>>>> 1us >>>>>> to 22us. If 0.5ms is the upper bound the system can tolerate, at most >>>>>> 2900 vcpus can be added into the list. >>>>> >>>>> Great, thanks Chao Gao, that's useful. >>>> >>>> Looks like Chao Gao has been dropped ... >>>> >>>>> I'm not sure a fixed latency -- >>>>> say 500us -- is the right thing to look at; if all 2900 vcpus arranged >>>>> to have interrupts staggered at 500us intervals it could easily lock up >>>>> the cpu for nearly a full second. But I'm having trouble formulating a >>>>> good limit scenario. >>>>> >>>>> In any case, 22us should be safe from a security standpoint*, and 128 >>>>> should be pretty safe from a "make the common case fast" standpoint: >>>>> i.e., if you have 128 vcpus on a single runqueue, the IPI wake-up >>>>> traffic will be the least of your performance problems I should think. >>>>> >>>>> -George >>>>> >>>>> * Waiting for Jan to contradict me on this one. :-) >>>> >>>> 22us would certainly be fine, if this was the worst case scenario. >>>> I'm not sure the value measured for 128 list entries can be easily >>>> scaled to several thousands of them, due cache and/or NUMA >>>> effects. I continue to think that we primarily need theoretical >>>> proof of an upper boundary on list length being enforced, rather >>>> than any measurements or randomized balancing. And just to be >>>> clear - if someone overloads their system, I do not see a need to >>>> have a guaranteed maximum list traversal latency here. All I ask >>>> for is that list traversal time scales with total vCPU count divided >>>> by pCPU count. >>> >>> Thanks, Jan & George. >>> >>> I think it is more clear to me about what should I do next step. >>> >>> In my understanding, we should distribute the wakeup interrupts like >>> this: >>> 1. By default, distribute it to the local pCPU ('local' means the vCPU >>> is on the pCPU) to make the common case fast. >>> 2. With the list grows to a point where we think it may consumers too >>> much time to traverse the list, also distribute wakeup interrupt to local >>> pCPU, ignoring admin intentionally overloads their system. >>> 3. When the list length reachs the theoretic average maximum (means >>> maximal vCPU count divided by pCPU count), distribute wakeup interrupt >>> to another underutilized pCPU. >> >> By "maximal vCPU count" do you mean, "total number of active vcpus on >> the system"? Or some other theoretical maximum vcpu count (e.g., 32k >> domans * 512 vcpus each or something)? > >The former. Ok. Actually I meant the latter. But now, I realize I was wrong. > >> What about saying that the limit of vcpus for any given pcpu will be: >> (v_tot / p_tot) + K >> where v_tot is the total number of vcpus on the system, p_tot is the >> total number of pcpus in the system, and K is a fixed number (such as >> 128) such that 1) the additional time walking the list is minimal, and >> 2) in the common case we should never come close to reaching that number? >> >> Then the algorithm for choosing which pcpu to have the interrupt >> delivered to would be: >> 1. Set p = current_pcpu >> 2. if len(list(p)) < v_tot / p_tot + k, choose p >> 3. Otherwise, choose another p and goto 2 >> >> The "choose another p" could be random / pseudorandom selection, or it >> could be some other mechanism (rotate, look for pcpus nearby on the >> topology, choose the lowest one, &c). But as long as we check the >> length before assigning it, it should satisfy Jan. Very clear and helpful. Othewise, I may need spending several months to reach this solution. Thanks, George. :) _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |