[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [benchmarks] Linux scheduling domain *magic* tricks



Hello Oracle chaps,
plus George,
plus Juergen,
plus everyone on xen-devel, :-)

As promised, I'll have a deep look at the tests and benchmarks results
that Elena dumped on us all ASAP. ÂHowever, this is only fair if I also
spam you with an huge load of numbers onto which you can scratch (or
bang?!?) your heads, isn't it? :-D

So, here we are. I'm starting a new thread because this is somewhat
independent from the topology related side of things, which Elena is
talking about (and which myself and Juergen were also investigating and
working on already).

In fact, Linux's scheduling domain can be configured in a variety of
ways, by means of a set of flags (and normally done during Linux's
boot). In a way, everything there is really related to cpu topology
(scheduling domains _are_ the Linux's scheduler interface to cpu
topology!). But strictly speaking, there are 'pure topology' related
flags, and more abstract 'behavioral' flags.

This is the list of these flags, BTW:
http://lxr.free-electrons.com/source/include/linux/sched.h#L981
/*
Â* sched-domains (multiprocessor balancing) declarations:
Â*/
#define SD_LOAD_BALANCEÂÂÂÂÂÂÂÂÂ0x0001ÂÂ/* Do load balancing on this domain. */
#define SD_BALANCE_NEWIDLEÂÂÂÂÂÂ0x0002ÂÂ/* Balance when about to become idle */
#define SD_BALANCE_EXECÂÂÂÂÂÂÂÂÂ0x0004ÂÂ/* Balance on exec */
#define SD_BALANCE_FORKÂÂÂÂÂÂÂÂÂ0x0008ÂÂ/* Balance on fork, clone */
#define SD_BALANCE_WAKEÂÂÂÂÂÂÂÂÂ0x0010ÂÂ/* Balance on wakeup */
#define SD_WAKE_AFFINEÂÂÂÂÂÂÂÂÂÂ0x0020ÂÂ/* Wake task to waking CPU */
#define SD_SHARE_CPUCAPACITYÂÂÂÂ0x0080ÂÂ/* Domain members share cpu power */
#define SD_SHARE_POWERDOMAINÂÂÂÂ0x0100ÂÂ/* Domain members share power domain */
#define SD_SHARE_PKG_RESOURCESÂÂ0x0200ÂÂ/* Domain members share cpu pkg 
resources */
#define SD_SERIALIZEÂÂÂÂÂÂÂÂÂÂÂÂ0x0400ÂÂ/* Only a single load balancing 
instance */
#define SD_ASYM_PACKINGÂÂÂÂÂÂÂÂÂ0x0800ÂÂ/* Place busy groups earlier in the 
domain */
#define SD_PREFER_SIBLINGÂÂÂÂÂÂÂ0x1000ÂÂ/* Prefer to place tasks in a sibling 
domain */
#define SD_OVERLAPÂÂÂÂÂÂÂÂÂÂÂÂÂÂ0x2000ÂÂ/* sched_domains of this level overlap 
*/
#define SD_NUMAÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ0x4000ÂÂ/* cross-node balancing */

To check how scheduling domains are configured (and to change it), look
here: /proc/sys/kernel/sched_domain/cpu*/domain*/flags

I noticed some oddities in the way Linux's and Xen's schedulers
interacted in some cases, and I noticed that changing the 'behavioral'
flags had an impact. I did run a preliminary set of experiments with
Unixbench, with the following results:

(Hint, look at the "Execl Throughput" and "Process Creation" rows, in
the 1x case.)

# ./Run -c 1 (1 parallel copy of each benchmark inside a 4 vcpus HVM guest)
ÂÂÂÂFlagsÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4143ÂÂÂ ÂÂ4135ÂÂÂÂÂÂ4131ÂÂÂÂÂÂ4151ÂÂÂ
ÂÂÂ4147ÂÂÂÂÂÂ4115ÂÂÂÂÂÂ4099ÂÂÂÂÂÂ4128
1 x Dhrystone 2 using register variablesÂÂ2299.0ÂÂ Â2298.4ÂÂÂÂ2302.0ÂÂÂÂ2311.4ÂÂ
ÂÂ2312.1ÂÂÂÂ2312.1ÂÂÂÂ2299.2ÂÂÂÂ2301.6
1 x Double-Precision WhetstoneÂÂÂÂÂÂÂÂÂÂÂÂÂ619.5ÂÂ ÂÂ619.5ÂÂÂÂÂ619.8ÂÂÂÂÂ619.0ÂÂ
ÂÂÂ619.0ÂÂÂÂÂ619.1ÂÂÂÂÂ619.2ÂÂÂÂÂ619.6
1 x Execl ThroughputÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ458.0 Â ÂÂ449.6 ÂÂÂ1017.0ÂÂÂÂÂ449.4 Â
ÂÂ1012.1ÂÂÂÂ1017.4ÂÂÂÂ1018.2ÂÂÂÂ1022.6
1 x File Copy 1024 bufsize 2000 maxblocks 2188.8ÂÂ Â2317.4ÂÂÂÂ2403.1ÂÂÂÂ2412.5ÂÂ
ÂÂ2420.8ÂÂÂÂ2423.8ÂÂÂÂ2422.7ÂÂÂÂ2430.5
1 x File Copy 256 bufsize 500 maxblocksÂÂÂ1459.7ÂÂ Â1576.1ÂÂÂÂ1648.3ÂÂÂÂ1647.7ÂÂ
ÂÂ1649.4ÂÂÂÂ1663.5ÂÂÂÂ1652.4ÂÂÂÂ1649.0
1 x File Copy 4096 bufsize 8000 maxblocks 3467.8ÂÂ Â3581.9ÂÂÂÂ3621.1ÂÂÂÂ3624.7ÂÂ
ÂÂ3635.9ÂÂÂÂ3619.8ÂÂÂÂ3606.1ÂÂÂÂ3608.8
1 x Pipe ThroughputÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ1518.3ÂÂ Â1505.3ÂÂÂÂ1519.0ÂÂÂÂ1514.7ÂÂ
ÂÂ1518.9ÂÂÂÂ1516.5ÂÂÂÂ1517.2ÂÂÂÂ1518.0
1 x Pipe-based Context SwitchingÂÂÂÂÂÂÂÂÂÂÂ803.7ÂÂ ÂÂ798.7ÂÂÂÂÂ801.8ÂÂÂÂÂ801.4ÂÂ
ÂÂÂ797.9ÂÂÂÂÂ132.9ÂÂÂÂÂÂ92.0ÂÂÂÂÂ809.7
1 x Process CreationÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ404.3 Â ÂÂ931.8ÂÂÂÂÂ942.5ÂÂÂÂÂ950.4ÂÂ
ÂÂÂ932.7ÂÂÂÂÂ967.4ÂÂÂÂÂ960.1ÂÂÂÂÂ962.7
1 x Shell Scripts (1 concurrent)ÂÂÂÂÂÂÂÂÂÂ1304.4ÂÂ Â1256.4ÂÂÂÂ1755.1ÂÂÂÂ1259.5ÂÂ
ÂÂ1756.5ÂÂÂÂ1741.3ÂÂÂÂ1726.0ÂÂÂÂ1819.6
1 x Shell Scripts (8 concurrent)ÂÂÂÂÂÂÂÂÂÂ4564.2ÂÂ Â4704.1ÂÂÂÂ4714.0ÂÂÂÂ4691.8ÂÂ
ÂÂ4710.2ÂÂÂÂ4570.8ÂÂÂÂ4571.0ÂÂÂÂ1694.6*
1 x System Call OverheadÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ2251.1Â ÂÂ2249.6ÂÂÂÂ2250.1ÂÂÂÂ2248.9ÂÂ
ÂÂ2250.3ÂÂÂÂ2249.9ÂÂÂÂ2251.0ÂÂÂÂ2249.0
ÂÂÂÂSystem Benchmarks Index Score ======= 1380.2Â== 1495.1 == 1662.2 == 1511.4 
== 1661.5Â==Â1431.8 == 1384.9 == 1536.5
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ+/-ÂÂÂÂÂÂÂÂÂÂ0.00%ÂÂÂÂ+8.32%ÂÂÂ+20.43%ÂÂÂÂ+9.51%ÂÂ
Â+20.38%ÂÂÂÂ+3.74%ÂÂÂÂ+0.34%ÂÂÂ+11.32%

# ./Run -c 4 (4 parallel copies of each benchmark inside a 4 vcpus HVM guest)
ÂÂÂÂFlagsÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4143ÂÂÂ ÂÂ4135ÂÂÂÂÂÂ4131ÂÂÂÂÂÂ4151ÂÂÂ
ÂÂÂ4147ÂÂÂÂÂÂ4115ÂÂÂÂÂÂ4099ÂÂÂÂÂÂ4128
4 x Dhrystone 2 using register variablesÂÂ8619.4ÂÂÂÂ8551.3ÂÂÂÂ8661.7ÂÂÂÂ8694.1ÂÂ
ÂÂ8731.8ÂÂÂÂ8578.0ÂÂÂÂ8591.7ÂÂÂÂ2293.4
4 x Double-Precision WhetstoneÂÂÂÂÂÂÂÂÂÂÂÂ2351.8ÂÂÂÂ2348.9ÂÂÂÂ2352.2ÂÂÂÂ2352.8ÂÂ
ÂÂ2351.7ÂÂÂÂ2351.3ÂÂÂÂ2352.4ÂÂÂÂ2470.6
4 x Execl ThroughputÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ3264.3ÂÂÂÂ3346.9ÂÂÂÂ3743.0ÂÂÂÂ3365.7ÂÂ
ÂÂ3726.6ÂÂÂÂ3745.7ÂÂÂÂ3759.2ÂÂÂÂ1017.0
4 x File Copy 1024 bufsize 2000 maxblocks 2741.7ÂÂÂÂ2789.7ÂÂÂÂ2871.5ÂÂÂÂ2842.5ÂÂ
ÂÂ2793.5ÂÂÂÂ2935.9ÂÂÂÂ2846.5ÂÂÂÂ2376.7
4 x File Copy 256 bufsize 500 maxblocksÂÂÂ1736.6ÂÂÂÂ1754.9ÂÂÂÂ1841.4ÂÂÂÂ1815.1ÂÂ
ÂÂ1763.5ÂÂÂÂ1829.7ÂÂÂÂ1836.6ÂÂÂÂ1579.4
4 x File Copy 4096 bufsize 8000 maxblocks 4457.1ÂÂÂÂ4461.9ÂÂÂÂ4284.4ÂÂÂÂ4566.8ÂÂ
ÂÂ4476.7ÂÂÂÂ4619.0ÂÂÂÂ4815.9ÂÂÂÂ3648.2
4 x Pipe ThroughputÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ5724.0ÂÂÂÂ5719.3ÂÂÂÂ5732.4ÂÂÂÂ5747.6ÂÂ
ÂÂ5747.2ÂÂÂÂ5720.8ÂÂÂÂ5740.1ÂÂÂÂ1509.6
4 x Pipe-based Context SwitchingÂÂÂÂÂÂÂÂÂÂ2847.8ÂÂÂÂ2841.9ÂÂÂÂ2831.7ÂÂÂÂ2826.2ÂÂ
ÂÂ2844.5ÂÂÂÂ2433.2ÂÂÂÂ2832.3ÂÂÂÂÂ745.2
4 x Process CreationÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ1863.1ÂÂÂÂ3383.8ÂÂÂÂ3358.6ÂÂÂÂ3365.9ÂÂ
ÂÂ3339.1ÂÂÂÂ3206.9ÂÂÂÂ3338.2ÂÂÂÂÂ924.7
4 x Shell Scripts (1 concurrent)ÂÂÂÂÂÂÂÂÂÂ5126.8ÂÂÂÂ4992.7ÂÂÂÂ6739.1ÂÂÂÂ4973.7ÂÂ
ÂÂ6773.8ÂÂÂÂ6770.9ÂÂÂÂ6806.4ÂÂÂÂ1823.5
4 x Shell Scripts (8 concurrent)ÂÂÂÂÂÂÂÂÂÂ5969.8ÂÂÂÂ6021.7ÂÂÂÂ6258.9ÂÂÂÂ6018.9ÂÂ
ÂÂ6302.5ÂÂÂÂ6284.0ÂÂÂÂ6323.6ÂÂÂÂ1683.6
4 x System Call OverheadÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ6647.9ÂÂÂÂ6661.2ÂÂÂÂ6672.6ÂÂÂÂ6669.9ÂÂ
ÂÂ6665.0ÂÂÂÂ6641.8ÂÂÂÂ6649.9ÂÂÂÂ2244.1
ÂÂÂÂSystem Benchmarks Index Score ======= 3786.1Â== 3987.8 == 4155.4 == 4018.1 
== 4151.5 == 4116.1 == 4195.1 == 1695.8
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ+/-ÂÂÂÂÂÂÂÂÂÂ0.00%ÂÂÂÂ+5.33%ÂÂÂÂ+9.75%ÂÂÂÂ+6.13%ÂÂ
ÂÂ+9.65%ÂÂÂÂ+8.72%ÂÂÂ+10.80%ÂÂÂ-55.21%

4131 and 4147 are the ones that looked more promising.

Linux's default, for an HVM guest is 4143:
 LOAD_BALANCE  Â|
 BALANCE_NEWIDLE |
 BALANCE_EXEC  Â|
 BALANCE_FORK  Â|
 WAKE_AFFINE   |
 PREFER_SIBLINGS

Using 4131 means:
 LOAD_BALANCE  Â|
 BALANCE_NEWIDLE |
 WAKE_AFFINE   |
 PREFER_SIBLINGS

4147 means:
 LOAD_BALANCE  Â|
 BALANCE_NEWIDLE |
 BALANCE_WAKE  Â|
 WAKE_AFFINE   |
 PREFER_SIBLINGS

For now, I focused on 4131 (as the results of other ah-hoc benchmarks
were hinting at that), but I want to investigate 4147 too (see below).

Basically, with 4143 as the scheduling domain's (there's only one
domain in one of our HVM guests right now) flags, I'm telling the Linux
scheduler that its load balancing logic should *not* trigger as a
result of fork() or exec(), nor when a task wakes-up (seems a bit
aggressive, but still...).

So, I arranged for comparing the performance of the default set of
flags with 4131, in an extensive way.

Here's what I did (it's a bit of a long explanation, but it's for
making sure you know what each benchmarking configuration did).

I selected the following benchmarks:
Â- makexen: how long it takes to compile Xen
ÂÂÂÂÂÂÂÂÂÂÂÂ(results: lower == better)
Â- iperf: iperf from guest(s) toward the host
ÂÂÂÂÂÂÂÂÂÂ(results: higher == etter)
Â- sysbench-cpu: pure number crunching
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ(results: lower == better)
Â- sysbench-oltp: concurrent database transactions
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ(results: higher == better)
Â- unixbench: runs a set of tests and compute a global perf index
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ(results: higher == better [1])

The actual workload was always run in guests. A varying number of HVM
guests was used. Number of vcpus and amount of memory of the guests was
also varying.

All the benchmarks were "homogeneous", i.e., all the guests used in a
particular instance were equal (in terms of number of vcpus and amount
of RAM), and all run the same workload. They also were "synchronous",
i.e., all the guests started running the workload at the same time.

Each benchmark was repeated 5 times. This first set of results shows
the average of all the output samples of all the iterations from all
the guests involved in each benchmark.[2]

The benchmarks were run on a 24 pCPUs (arranged in 2 NUMA nodes) host,
with 32GB of RAM. Xen version was always the same (what staging was a
few weeks ago). Linux dom0 kernel was 4.3.0. Guests' kernels were
4.2.0. Results are collected for just the default case, and for the
case where I reconfigured the guests' scheduling domain's flag to 4131
(yes, only the flags of the guests for now, I can rerun changing
dom0's flags as well).

A particular benchmark is characterized as follows:
Â* host load: basically, how many guest vcpus were being used. It can
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂbe sequential, small, medium, large, full, overload or
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂoverwhelmed
Â* guests size: how big (in terms of vcpus and memory) the guests were.
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂIt can be sequential, small, medium or large
Â* guest load: how busy the guests were kept. I can be sequential,
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂmoderate, full or overbooked.

In some more details:
Â- host load:
ÂÂÂÂ* sequential load means there is only 1 VM
ÂÂÂÂ* small load means that total number of guest vcpus was ~1/3 of
ÂÂÂÂÂÂthe host pcpus (i.e., 24/3 = 8 vcpus)
ÂÂÂÂ* medium load means total number of guest vcpus was ~1/2 of hostÂ
ÂÂÂÂÂÂpcpus (i.e., 12, but it was 16 some of the times)
ÂÂÂÂ* large load means total number of guest vcpus was ~2/3 of hostÂ
ÂÂÂÂÂÂpcpus (i.e. 16, but it was actually 20)
ÂÂÂÂ* full load means total number of guest vcpus was == to host pcpus
ÂÂÂÂÂÂ(i.e., 24)
ÂÂÂÂ* overload means total number of guest vcpus was ~2/3 of host pcpus
ÂÂÂÂÂÂ(i.e., 36, but it was 32 some of the times)
ÂÂÂÂ* overwhelmed means total number of guest vcpus was 2x host pcpus
ÂÂÂÂÂÂ(i.e., 48)

Â- guest size:
ÂÂÂÂ* sequential guests had 1 vcpu and 2048MB of RAM
ÂÂÂÂ* small guests had 4 vcpus and 4096MB of RAM
ÂÂÂÂ* medium guests had 8 vcpus and 5120MB of RAM
ÂÂÂÂ* large guests had 12 vcpus and 8192MB of RAM

Â- guest load:
ÂÂÂÂ* sequential means the benchmark was sequentially run (e.g.,
ÂÂÂÂÂÂmake -j1, unixbench -c1, sysbench --num-threads=1, etc.)
ÂÂÂÂ* moderate means the benchmark was keeping half of the guest's
ÂÂÂÂÂÂvcpus budy (e.g., make -j4 on in 8 vcpus guest)
ÂÂÂÂ* full means the benchmark was keeping all the guest's vcpus busyÂ
ÂÂÂÂÂÂ(e.g., make -j8 in an 8 vcpus guest)
ÂÂÂÂ* overbooked means the benchmark was running with 2x degree of
ÂÂÂÂÂÂparallelism wrt to guests vcpus (e.g., make -j16 in an 8 vcpusÂ
ÂÂÂÂÂÂguest)

Combining these three 'parameters', several benchmark configurations
were generated. Not all the possible combinations are meaningful, but I
ended up with quite a few cases.

Let me just put down a couple of examples, to make sure what happened
during a particular benchmark can be fully understood. So, for
instance, the sysboltp-smallhload-smallvm-fullwkload benchmark was:
Â+ running sysbench --oltp
Â+ running it inside 2 HVM guests at the same time (small host load)
Â+ the guests had 4 vcpus (small vms)
Â+ running it with --num-threads=4 (full workload)

Another one, makexen-overldhload-medvm-modrtwkload was:
Â+ running a Xen build
Â+ running it in 4 HVM guests at the same time (32 vcpus total,
ÂÂÂoverloaded host)
Â+ the guest had 8 vcpus (medium vms) + running it with -j4 (moderate
   workload)

And so on and so forth. The first column in the attached results files,
identify the specific benchmark configuration, according to this
characterization. The other various columns shows, for each workload,
the performance with default flags and with 4131, and the percent
increase we got by using 4131. Of course, when lower is better (like
in makexen and the sysbench-es) an actual increase is a bad thing.
In fact, I highlighted with an '*' instances where changing flags to
4131 caused a regressions.

I'm too tired right now to do an appropriate analysis, but just very
quickly:
 - for Xen build-alike workloads, using 4131 is just awesome; :-)
 - for Unixbench, likewise (at least the runs that I have);
 - for iperf, there are issues;
 - for the two sysbench-es, there are a few issues; cpu is worse
   than OLTP, which is somewhat good, as OLTP is more
   representative of real workloads, while cpu is purely syntetic.

The iperf (and perhaps also the OLTP) results is what makes me want to
re-try everything with 4147. In fact, I expect (although, nothing more
than a speculation) that allowing the load balancer to act upon a
Linux's task's wakeup, has the potential of making things better for
that kind of workload. We shall see whether that affect other ones that
much...

So, now I need to go to bed (esp. considering that I'm up since
4:30 AM, for catching a flight). I'll continue looking at and
working on this... in the meanwhile, feel free to voice your
opinions. :-D

Thanks and Regards,
Dario

[1] There must have been an issue with Unixbench runs, and I don't have
the numbers from all the configurations I wanted to test, so I'm
attaching what I have, and re-running.

[2] We can aggregate data like this, because of the homogeneous and
synchronous nature of the benchmarks themselves. Another interesting
analysis that could be attempted is about fairness. I.e., we can check
by how much the performance of each of the guests involved in each
instance varies between each others (ideally, very few). I haven't done
this kind of analysis yet.
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: mainline-vs-4131.txt
Description: Text document

Attachment: mainline-vs-4131_unixbench.txt
Description: Text document

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.