[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [PATCH 0/6] xen: sched: improve scalability of Credit1, and optimize a bit both Credit1 and Credit2
Hello, This series aims at introducing some optimization and performance improvement in Credit1 (in certain specific situations), but Credit2 is lightly touched as well. The core of the series is patches 3 and 4, which aim at both redistributing and reducing spinlock contention during load balancing. In fact, Credit1 load balancing is based on "work stealing". This means that, when a pCPU would go idle, it looks around inside other pCPUs' runqueues, to see if there are vCPUs waiting to run, and steal the first one it finds. This process of going around pCPUs happens in a NUMA node wise fashion, and always starts from the first pCPU on each node. That may lead to higher scheduler lock pressure on lower ID pCPUs (of each node), as well as stealing happening more frequently from them. And this is what patch 4 aims at fixing. This is not necessarily expected to improve performance per-se, although a fairer lock pressure is likely to bring benefits. Still about load balancing, when deciding whether or not to try to steal work from a pCPU, we only consider the ones that are non-idle. A pCPU which is running a vCPU and does not have any other vCPU in its runqueue waiting to run, is not idle, but there's nothing we can steal. It's therefore possible that we check a number of pCPUs, which include at least trying to take their runqueue lock, only to figure out that there's no vCPU we can grab, and we need to continue checking other processors. On a large system, in situations where the load (i.e., the number of runnable and running vCPUs) is only _slightly_ higher than the number of pCPUs, this can have a significant performance impact. A way of improving this situation, is to keep track of not only whether pCPUs are idle or not, but also which ones have more than one runnable vCPU, which basically means they have at least one vCPU ready to be stolen by anyone that would otherwise go idle. And this exactly is what is done in patch 3. Finally, patch 6 does to Credit2, something similar to what patch 3 does to Credit1, although the context is, actually, different. In fact, there are places in Credit2, where we just want the scheduler to give us one pCPU from a certain runqueue. We do that by means of cpumask_any(), which is great, but comes at a price. As a matter of fact, we don't really care much which one, as a subsequent call to runq_tickle() will override such choice anyway. But --within runqueue tickle itself-- the pCPU we choose is at last used as an hint, so we really don't want to totally give up and introduce biases (by, e.g., just using cpumask_first(). We, therefore, use an approach similar to the one in patch 3, i.e., we record and remember which pCPU we choose for last, and start from it next time. As said already, the performance benefit of this series are to be expected on large systems, with very specific load conditions. I've done some benchmarking on a 16 CPUs NUMA box that I have at hand. I've run three experiments. A Xen compile ('MAKEXEN') inside a 16 vCPUs guest. 2 Xen compiles running concurrently inside two 16 vCPUs VMs. And a Xen compile and Iperf ('IPERF') running concurrently inside two 16 vCPUs VMs Here's the result for Credit1. For MAKEXEN, lower is better, while for IPERF, higher is. Average and standard dviation over 10 runs is what's shown in the tables below. |CREDIT1 | |-------------------------------------------------------------------| |MAKEXEN, 1VM |MAKEXEN, 2VMs |vm1: MAKEXEN vm2: IPERF | |baseline patched|baseline patched|baseline patched baseline patched| |----------------|----------------|---------------------------------| avg | 18.154 17.906| 52.832 51.088| 29.306 28.936 15.840 18.580| stdd| 0.580 0.059| 1.061 1.717| 0.757 0.296 4.264 2.492| So, with this patch applied, Xen compiles a little bit faster, and Iperf achieves higher throughput, which is great. :-D As far as Credit2 goes, here's the numbers: |CREDIT2 | |-------------------------------------------------------------------| |MAKEXEN, 1VM |MAKEXEN, 2VMs |vm1: MAKEXEN vm2: IPERF | |baseline patched|baseline patched|baseline patched baseline patched| |----------------|----------------|---------------------------------| avg | 18.062 17.894| 53.136 52.968| 32.754 32.880 18.160 19.240| stdd| 0.331 0.205| 0.886 0.566| 0.787 0.548 1.910 1.842| In this case, the expected impact of the series is smaller, and that in fact matches what we get, with baseline and patched numbers very very close. What I wanted to verify is that I was not introducing regressions, which seems to be confirmed. Thanks and Regards, Dario --- Dario Faggioli (6): xen: credit1: simplify csched_runq_steal() a little bit. xen: credit: (micro) optimize csched_runq_steal(). xen: credit1: increase efficiency and scalability of load balancing. xen: credit1: treat pCPUs more evenly during balancing. xen/tools: tracing: add record for credit1 runqueue stealing. xen: credit2: avoid cpumask_any() in pick_cpu(). tools/xentrace/formats | 1 tools/xentrace/xenalyze.c | 11 ++ xen/common/sched_credit.c | 199 +++++++++++++++++++++++++++++------------- xen/common/sched_credit2.c | 22 ++++- xen/include/xen/perfc_defn.h | 1 5 files changed, 169 insertions(+), 65 deletions(-) -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |