[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [PATCH 00 of 11 v3] NUMA aware credit scheduling
Hello Everyone, V3 of the NUMA aware scheduling series. It is nothing more than v2 with all the review comments I got, addressed... Or so I think. :-) I added a new patch in the series (#3), for dealing with the suboptimal SMT load balancing in credit there, instead than within what now is patch #4. I ran the following benchmarks (again): * SpecJBB is all about throughput, so pinning is likely the ideal solution. * Sysbench-memory is the time it takes for writing a fixed amount of memory (and then it is the throughput that is measured). What we expect is locality to be important, but at the same time the potential imbalances due to pinning could have a say in it. * LMBench-proc is the time it takes for a process to fork a fixed amount of children. This is much more about latency than throughput, with locality of memory accesses playing a smaller role and, again, imbalances due to pinning being a potential issue. Summarizing, we expect pinning to win on throughput biased benchmarks (SpecJBB and Sysbench), while not having any affinity should be better when latency is important, especially under (over)load (i.e., on LMBench). NUMA aware scheduling tries to get the best out of the two approaches: take advantage of locality, but in a flexible way. Therefore, it would be nice for it to sit in the middle: - not as bad as no-affinity (or, if you prefer, almost as good as pinning) when looking at SpecJBB and Sysbench results; - not as bad as pinning (or, if you prefer, almost as good as no-affinity) when looking at LMBench results. On a 2 nodes, 16 cores system, where I can have from 2 to 10 VMs (2 vCPUs, 960MB RAM each) executing the benchmarks concurrently, here's what I get: ---------------------------------------------------- | SpecJBB2005, throughput (the higher the better) | ---------------------------------------------------- | #VMs | No affinity | Pinning | NUMA scheduling | | 2 | 43318.613 | 49715.158 | 49822.545 | | 6 | 29587.838 | 33560.944 | 33739.412 | | 10 | 19223.962 | 21860.794 | 20089.602 | ---------------------------------------------------- | Sysbench memory, throughput (the higher the better) ---------------------------------------------------- | #VMs | No affinity | Pinning | NUMA scheduling | | 2 | 469.37667 | 534.03167 | 555.09500 | | 6 | 411.45056 | 437.02333 | 463.53389 | | 10 | 292.79400 | 309.63800 | 305.55167 | ---------------------------------------------------- | LMBench proc, latency (the lower the better) | ---------------------------------------------------- | #VMs | No affinity | Pinning | NUMA scheduling | ---------------------------------------------------- | 2 | 788.06613 | 753.78508 | 750.07010 | | 6 | 986.44955 | 1076.7447 | 900.21504 | | 10 | 1211.2434 | 1371.6014 | 1285.5947 | ---------------------------------------------------- Which, reasoning in terms of %-performances increase/decrease, means NUMA aware scheduling does as follows, as compared to no-affinity at all and to pinning: ---------------------------------- | SpecJBB2005 (throughput) | ---------------------------------- | #VMs | No affinity | Pinning | | 2 | +13.05% | +0.21% | | 6 | +12.30% | +0.53% | | 10 | +4.31% | -8.82% | ---------------------------------- | Sysbench memory (throughput) | ---------------------------------- | #VMs | No affinity | Pinning | | 2 | +15.44% | +3.79% | | 6 | +11.24% | +5.72% | | 10 | +4.18% | -1.34% | ---------------------------------- | LMBench proc (latency) | NOTICE: -x.xx% = GOOD here ---------------------------------- | #VMs | No affinity | Pinning | ---------------------------------- | 2 | -5.66% | -0.50% | | 6 | -9.58% | -19.61% | | 10 | +5.78% | -6.69% | ---------------------------------- So, not bad at all. :-) In particular, when not in overload, NUMA scheduling is the absolute best of all the three, even when it was expectable for one of the other approaches to win. In fact, if looking at the 2 and 6 VMs cases, it beats (although by a very small amount in the SpecJBB case) pinning in both the throughput biased benchmarks, as well as it beats no-affinity on LMBench. Of course it does a lot better than no-pinning on throughput and than pinning on latency (exp. with 6 VMs), but that was expected. Regarding the overloaded case, NUMA scheduling scores "in the middle", i.e., better than no-affinity but worse than pinning on throughput benchs, and the vice-versa on the latency bench, as it was expectable and intended, and this is fine as it is right what we expected from it. It must be noticed, however, that the benefits are not as huge as in the non-overloaded case. I chased the reason and found out that our load-balancing approach --in particular the fact we rely on tickling idle pCPUs to come pick up new work by themselves-- couples particularly bad with the new concept of node affinity. I spent some time looking for a simple "fix" for this, but it does not seem amendable to me, so I'll prepare a patch, using a completely different approach, and send it separately from this series (hopefully on top of it, in case this will have hit the repo at that time :-D). For now, I really think we can be happy with the performance figures this series enables... After all, I'm overloading the box by 20% (without counting Dom0 vCPUs!) and still seeing improvements, although perhaps not as huge as they could have been. Thoughts? Here are the patches included in the series. I '*'-ed ones already received one or more acks during previous rounds. Of course, I retained these Ack-s only for the patches that have not been touched, or only underwent minor cluenups. Of course, feel free to re-review everything, whatever your Ack is there or not! * [ 1/11] xen, libxc: rename xenctl_cpumap to xenctl_bitmap * [ 2/11] xen, libxc: introduce node maps and masks [ 3/11] xen: sched_credit: when picking, make sure we get an idle one, if any [ 4/11] xen: sched_credit: let the scheduler know about node-affinity * [ 5/11] xen: allow for explicitly specifying node-affinity * [ 6/11] libxc: allow for explicitly specifying node-affinity * [ 7/11] libxl: allow for explicitly specifying node-affinity [ 8/11] libxl: optimize the calculation of how many VCPUs can run on a candidate * [ 9/11] libxl: automatic placement deals with node-affinity * [10/11] xl: add node-affinity to the output of `xl list` [11/11] docs: rearrange and update NUMA placement documentation Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |