[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH 00 of 11 v3] NUMA aware credit scheduling

To: xen-devel@xxxxxxxxxxxxx
From: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
Date: Fri, 01 Feb 2013 12:01:10 +0100
Cc: Marcus Granado <Marcus.Granado@xxxxxxxxxxxxx>, Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Anil Madhavapeddy <anil@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxxxxx>, Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Daniel De Graaf <dgdegra@xxxxxxxxxxxxx>, Matt Wilson <msw@xxxxxxxxxx>
Delivery-date: Fri, 01 Feb 2013 11:04:11 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hello Everyone,

V3 of the NUMA aware scheduling series. It is nothing more than v2 with all
the review comments I got, addressed... Or so I think. :-)

I added a new patch in the series (#3), for dealing with the suboptimal SMT
load balancing in credit there, instead than within what now is patch #4.

I ran the following benchmarks (again):

 * SpecJBB is all about throughput, so pinning is likely the ideal solution.

 * Sysbench-memory is the time it takes for writing a fixed amount of memory
   (and then it is the throughput that is measured). What we expect is
   locality to be important, but at the same time the potential imbalances
   due to pinning could have a say in it.

 * LMBench-proc is the time it takes for a process to fork a fixed amount of
   children. This is much more about latency than throughput, with locality
   of memory accesses playing a smaller role and, again, imbalances due to
   pinning being a potential issue.

Summarizing, we expect pinning to win on throughput biased benchmarks (SpecJBB
and Sysbench), while not having any affinity should be better when latency is
important, especially under (over)load (i.e., on LMBench). NUMA aware
scheduling tries to get the best out of the two approaches: take advantage of
locality, but in a flexible way. Therefore, it would be nice for it to sit in
the middle:
 - not as bad as no-affinity (or, if you prefer, almost as good as pinning)
   when looking at SpecJBB and Sysbench results;
 - not as bad as pinning (or, if you prefer, almost as good as no-affinity)
   when looking at LMBench results.

On a 2 nodes, 16 cores system, where I can have from 2 to 10 VMs (2 vCPUs,
960MB RAM each) executing the benchmarks concurrently, here's what I get:

 ----------------------------------------------------
 | SpecJBB2005, throughput (the higher the better)  |
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 |    2 |  43318.613  | 49715.158 |    49822.545    |
 |    6 |  29587.838  | 33560.944 |    33739.412    |
 |   10 |  19223.962  | 21860.794 |    20089.602    |
 ----------------------------------------------------
 | Sysbench memory, throughput (the higher the better)
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 |    2 |  469.37667  | 534.03167 |    555.09500    |
 |    6 |  411.45056  | 437.02333 |    463.53389    |
 |   10 |  292.79400  | 309.63800 |    305.55167    |
 ----------------------------------------------------
 | LMBench proc, latency (the lower the better)     |
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 ----------------------------------------------------
 |    2 |  788.06613  | 753.78508 |    750.07010    |
 |    6 |  986.44955  | 1076.7447 |    900.21504    |
 |   10 |  1211.2434  | 1371.6014 |    1285.5947    |
 ----------------------------------------------------

Which, reasoning in terms of %-performances increase/decrease, means NUMA aware
scheduling does as follows, as compared to no-affinity at all and to pinning:

     ----------------------------------
     | SpecJBB2005 (throughput)       |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     |    2 |   +13.05%   |  +0.21%   |
     |    6 |   +12.30%   |  +0.53%   |
     |   10 |    +4.31%   |  -8.82%   |
     ----------------------------------
     | Sysbench memory (throughput)   |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     |    2 |   +15.44%   |  +3.79%   |
     |    6 |   +11.24%   |  +5.72%   |
     |   10 |    +4.18%   |  -1.34%   |
     ----------------------------------
     | LMBench proc (latency)         | NOTICE: -x.xx% = GOOD here
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     ----------------------------------
     |    2 |    -5.66%   |  -0.50%   |
     |    6 |    -9.58%   | -19.61%   |
     |   10 |    +5.78%   |  -6.69%   |
     ----------------------------------

So, not bad at all. :-) In particular, when not in overload, NUMA scheduling is
the absolute best of all the three, even when it was expectable for one of the
other approaches to win. In fact, if looking at the 2 and 6 VMs cases, it beats
(although by a very small amount in the SpecJBB case) pinning in both the
throughput biased benchmarks, as well as it beats no-affinity on LMBench.  Of
course it does a lot better than no-pinning on throughput and than pinning on
latency (exp. with 6 VMs), but that was expected.

Regarding the overloaded case, NUMA scheduling scores "in the middle", i.e.,
better than no-affinity but worse than pinning on throughput benchs, and the
vice-versa on the latency bench, as it was expectable and intended, and this is
fine as it is right what we expected from it. It must be noticed, however, that
the benefits are not as huge as in the non-overloaded case. I chased the reason
and found out that our load-balancing approach --in particular the fact we rely
on tickling idle pCPUs to come pick up new work by themselves-- couples
particularly bad with the new concept of node affinity. I spent some time
looking for a simple "fix" for this, but it does not seem amendable to me, so
I'll prepare a patch, using a completely different approach, and send it
separately from this series (hopefully on top of it, in case this will have hit
the repo at that time :-D). For now, I really think we can be happy with the
performance figures this series enables... After all, I'm overloading the box
by 20% (without counting Dom0 vCPUs!) and still seeing improvements, although
perhaps not as huge as they could have been. Thoughts?

Here are the patches included in the series. I '*'-ed ones already received one
or more acks during previous rounds. Of course, I retained these Ack-s only for
the patches that have not been touched, or only underwent minor cluenups. Of
course, feel free to re-review everything, whatever your Ack is there or not!

 * [ 1/11] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
 * [ 2/11] xen, libxc: introduce node maps and masks
   [ 3/11] xen: sched_credit: when picking, make sure we get an idle one, if any
   [ 4/11] xen: sched_credit: let the scheduler know about node-affinity
 * [ 5/11] xen: allow for explicitly specifying node-affinity
 * [ 6/11] libxc: allow for explicitly specifying node-affinity
 * [ 7/11] libxl: allow for explicitly specifying node-affinity
   [ 8/11] libxl: optimize the calculation of how many VCPUs can run on a 
candidate
 * [ 9/11] libxl: automatic placement deals with node-affinity
 * [10/11] xl: add node-affinity to the output of `xl list`
   [11/11] docs: rearrange and update NUMA placement documentation

Thanks and Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH 00 of 11 v3] NUMA aware credit scheduling
  - From: Dario Faggioli
- [Xen-devel] [PATCH 10 of 11 v3] xl: add node-affinity to the output of `xl list`
  - From: Dario Faggioli
- [Xen-devel] [PATCH 11 of 11 v3] docs: rearrange and update NUMA placement documentation
  - From: Dario Faggioli
- [Xen-devel] [PATCH 08 of 11 v3] libxl: optimize the calculation of how many VCPUs can run on a candidate
  - From: Dario Faggioli
- [Xen-devel] [PATCH 09 of 11 v3] libxl: automatic placement deals with node-affinity
  - From: Dario Faggioli
- [Xen-devel] [PATCH 07 of 11 v3] libxl: allow for explicitly specifying node-affinity
  - From: Dario Faggioli
- [Xen-devel] [PATCH 04 of 11 v3] xen: sched_credit: let the scheduler know about node-affinity
  - From: Dario Faggioli
- [Xen-devel] [PATCH 06 of 11 v3] libxc: allow for explicitly specifying node-affinity
  - From: Dario Faggioli
- [Xen-devel] [PATCH 05 of 11 v3] xen: allow for explicitly specifying node-affinity
  - From: Dario Faggioli
- [Xen-devel] [PATCH 03 of 11 v3] xen: sched_credit: when picking, make sure we get an idle one, if any
  - From: Dario Faggioli
- [Xen-devel] [PATCH 01 of 11 v3] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
  - From: Dario Faggioli
- [Xen-devel] [PATCH 02 of 11 v3] xen, libxc: introduce xc_nodemap_t
  - From: Dario Faggioli

Prev by Date: Re: [Xen-devel] blkback disk I/O limit patch
Next by Date: [Xen-devel] [PATCH 02 of 11 v3] xen, libxc: introduce xc_nodemap_t
Previous by thread: [Xen-devel] xenstore stubdom on Xen 4.2.1 (XSM/FLASK problem)
Next by thread: [Xen-devel] [PATCH 02 of 11 v3] xen, libxc: introduce xc_nodemap_t
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.