[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [benchmarks] Linux scheduling domain *magic* tricks
Hello Oracle chaps, plus George, plus Juergen, plus everyone on xen-devel, :-) As promised, I'll have a deep look at the tests and benchmarks results that Elena dumped on us all ASAP. ÂHowever, this is only fair if I also spam you with an huge load of numbers onto which you can scratch (or bang?!?) your heads, isn't it? :-D So, here we are. I'm starting a new thread because this is somewhat independent from the topology related side of things, which Elena is talking about (and which myself and Juergen were also investigating and working on already). In fact, Linux's scheduling domain can be configured in a variety of ways, by means of a set of flags (and normally done during Linux's boot). In a way, everything there is really related to cpu topology (scheduling domains _are_ the Linux's scheduler interface to cpu topology!). But strictly speaking, there are 'pure topology' related flags, and more abstract 'behavioral' flags. This is the list of these flags, BTW: http://lxr.free-electrons.com/source/include/linux/sched.h#L981 /* Â* sched-domains (multiprocessor balancing) declarations: Â*/ #define SD_LOAD_BALANCEÂÂÂÂÂÂÂÂÂ0x0001ÂÂ/* Do load balancing on this domain. */ #define SD_BALANCE_NEWIDLEÂÂÂÂÂÂ0x0002ÂÂ/* Balance when about to become idle */ #define SD_BALANCE_EXECÂÂÂÂÂÂÂÂÂ0x0004ÂÂ/* Balance on exec */ #define SD_BALANCE_FORKÂÂÂÂÂÂÂÂÂ0x0008ÂÂ/* Balance on fork, clone */ #define SD_BALANCE_WAKEÂÂÂÂÂÂÂÂÂ0x0010ÂÂ/* Balance on wakeup */ #define SD_WAKE_AFFINEÂÂÂÂÂÂÂÂÂÂ0x0020ÂÂ/* Wake task to waking CPU */ #define SD_SHARE_CPUCAPACITYÂÂÂÂ0x0080ÂÂ/* Domain members share cpu power */ #define SD_SHARE_POWERDOMAINÂÂÂÂ0x0100ÂÂ/* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCESÂÂ0x0200ÂÂ/* Domain members share cpu pkg resources */ #define SD_SERIALIZEÂÂÂÂÂÂÂÂÂÂÂÂ0x0400ÂÂ/* Only a single load balancing instance */ #define SD_ASYM_PACKINGÂÂÂÂÂÂÂÂÂ0x0800ÂÂ/* Place busy groups earlier in the domain */ #define SD_PREFER_SIBLINGÂÂÂÂÂÂÂ0x1000ÂÂ/* Prefer to place tasks in a sibling domain */ #define SD_OVERLAPÂÂÂÂÂÂÂÂÂÂÂÂÂÂ0x2000ÂÂ/* sched_domains of this level overlap */ #define SD_NUMAÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ0x4000ÂÂ/* cross-node balancing */ To check how scheduling domains are configured (and to change it), look here: /proc/sys/kernel/sched_domain/cpu*/domain*/flags I noticed some oddities in the way Linux's and Xen's schedulers interacted in some cases, and I noticed that changing the 'behavioral' flags had an impact. I did run a preliminary set of experiments with Unixbench, with the following results: (Hint, look at the "Execl Throughput" and "Process Creation" rows, in the 1x case.) # ./Run -c 1 (1 parallel copy of each benchmark inside a 4 vcpus HVM guest) ÂÂÂÂFlagsÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4143 ÂÂ4135ÂÂÂÂÂÂ4131ÂÂÂÂÂÂ4151 ÂÂÂ4147ÂÂÂÂÂÂ4115ÂÂÂÂÂÂ4099ÂÂÂÂÂÂ4128 1 x Dhrystone 2 using register variablesÂÂ2299.0 Â2298.4ÂÂÂÂ2302.0ÂÂÂÂ2311.4 ÂÂ2312.1ÂÂÂÂ2312.1ÂÂÂÂ2299.2ÂÂÂÂ2301.6 1 x Double-Precision WhetstoneÂÂÂÂÂÂÂÂÂÂÂÂÂ619.5 ÂÂ619.5ÂÂÂÂÂ619.8ÂÂÂÂÂ619.0 ÂÂÂ619.0ÂÂÂÂÂ619.1ÂÂÂÂÂ619.2ÂÂÂÂÂ619.6 1 x Execl ThroughputÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ458.0  ÂÂ449.6 ÂÂÂ1017.0ÂÂÂÂÂ449.4  ÂÂ1012.1ÂÂÂÂ1017.4ÂÂÂÂ1018.2ÂÂÂÂ1022.6 1 x File Copy 1024 bufsize 2000 maxblocks 2188.8 Â2317.4ÂÂÂÂ2403.1ÂÂÂÂ2412.5 ÂÂ2420.8ÂÂÂÂ2423.8ÂÂÂÂ2422.7ÂÂÂÂ2430.5 1 x File Copy 256 bufsize 500 maxblocksÂÂÂ1459.7 Â1576.1ÂÂÂÂ1648.3ÂÂÂÂ1647.7 ÂÂ1649.4ÂÂÂÂ1663.5ÂÂÂÂ1652.4ÂÂÂÂ1649.0 1 x File Copy 4096 bufsize 8000 maxblocks 3467.8 Â3581.9ÂÂÂÂ3621.1ÂÂÂÂ3624.7 ÂÂ3635.9ÂÂÂÂ3619.8ÂÂÂÂ3606.1ÂÂÂÂ3608.8 1 x Pipe ThroughputÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ1518.3 Â1505.3ÂÂÂÂ1519.0ÂÂÂÂ1514.7 ÂÂ1518.9ÂÂÂÂ1516.5ÂÂÂÂ1517.2ÂÂÂÂ1518.0 1 x Pipe-based Context SwitchingÂÂÂÂÂÂÂÂÂÂÂ803.7 ÂÂ798.7ÂÂÂÂÂ801.8ÂÂÂÂÂ801.4 ÂÂÂ797.9ÂÂÂÂÂ132.9ÂÂÂÂÂÂ92.0ÂÂÂÂÂ809.7 1 x Process CreationÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ404.3  ÂÂ931.8ÂÂÂÂÂ942.5ÂÂÂÂÂ950.4 ÂÂÂ932.7ÂÂÂÂÂ967.4ÂÂÂÂÂ960.1ÂÂÂÂÂ962.7 1 x Shell Scripts (1 concurrent)ÂÂÂÂÂÂÂÂÂÂ1304.4 Â1256.4ÂÂÂÂ1755.1ÂÂÂÂ1259.5 ÂÂ1756.5ÂÂÂÂ1741.3ÂÂÂÂ1726.0ÂÂÂÂ1819.6 1 x Shell Scripts (8 concurrent)ÂÂÂÂÂÂÂÂÂÂ4564.2 Â4704.1ÂÂÂÂ4714.0ÂÂÂÂ4691.8 ÂÂ4710.2ÂÂÂÂ4570.8ÂÂÂÂ4571.0ÂÂÂÂ1694.6* 1 x System Call OverheadÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ2251.1 ÂÂ2249.6ÂÂÂÂ2250.1ÂÂÂÂ2248.9 ÂÂ2250.3ÂÂÂÂ2249.9ÂÂÂÂ2251.0ÂÂÂÂ2249.0 ÂÂÂÂSystem Benchmarks Index Score ======= 1380.2Â== 1495.1 == 1662.2 == 1511.4 == 1661.5Â==Â1431.8 == 1384.9 == 1536.5 ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ+/-ÂÂÂÂÂÂÂÂÂÂ0.00%ÂÂÂÂ+8.32%ÂÂÂ+20.43%ÂÂÂÂ+9.51% Â+20.38%ÂÂÂÂ+3.74%ÂÂÂÂ+0.34%ÂÂÂ+11.32% # ./Run -c 4 (4 parallel copies of each benchmark inside a 4 vcpus HVM guest) ÂÂÂÂFlagsÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4143 ÂÂ4135ÂÂÂÂÂÂ4131ÂÂÂÂÂÂ4151 ÂÂÂ4147ÂÂÂÂÂÂ4115ÂÂÂÂÂÂ4099ÂÂÂÂÂÂ4128 4 x Dhrystone 2 using register variablesÂÂ8619.4ÂÂÂÂ8551.3ÂÂÂÂ8661.7ÂÂÂÂ8694.1 ÂÂ8731.8ÂÂÂÂ8578.0ÂÂÂÂ8591.7ÂÂÂÂ2293.4 4 x Double-Precision WhetstoneÂÂÂÂÂÂÂÂÂÂÂÂ2351.8ÂÂÂÂ2348.9ÂÂÂÂ2352.2ÂÂÂÂ2352.8 ÂÂ2351.7ÂÂÂÂ2351.3ÂÂÂÂ2352.4ÂÂÂÂ2470.6 4 x Execl ThroughputÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ3264.3ÂÂÂÂ3346.9ÂÂÂÂ3743.0ÂÂÂÂ3365.7 ÂÂ3726.6ÂÂÂÂ3745.7ÂÂÂÂ3759.2ÂÂÂÂ1017.0 4 x File Copy 1024 bufsize 2000 maxblocks 2741.7ÂÂÂÂ2789.7ÂÂÂÂ2871.5ÂÂÂÂ2842.5 ÂÂ2793.5ÂÂÂÂ2935.9ÂÂÂÂ2846.5ÂÂÂÂ2376.7 4 x File Copy 256 bufsize 500 maxblocksÂÂÂ1736.6ÂÂÂÂ1754.9ÂÂÂÂ1841.4ÂÂÂÂ1815.1 ÂÂ1763.5ÂÂÂÂ1829.7ÂÂÂÂ1836.6ÂÂÂÂ1579.4 4 x File Copy 4096 bufsize 8000 maxblocks 4457.1ÂÂÂÂ4461.9ÂÂÂÂ4284.4ÂÂÂÂ4566.8 ÂÂ4476.7ÂÂÂÂ4619.0ÂÂÂÂ4815.9ÂÂÂÂ3648.2 4 x Pipe ThroughputÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ5724.0ÂÂÂÂ5719.3ÂÂÂÂ5732.4ÂÂÂÂ5747.6 ÂÂ5747.2ÂÂÂÂ5720.8ÂÂÂÂ5740.1ÂÂÂÂ1509.6 4 x Pipe-based Context SwitchingÂÂÂÂÂÂÂÂÂÂ2847.8ÂÂÂÂ2841.9ÂÂÂÂ2831.7ÂÂÂÂ2826.2 ÂÂ2844.5ÂÂÂÂ2433.2ÂÂÂÂ2832.3ÂÂÂÂÂ745.2 4 x Process CreationÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ1863.1ÂÂÂÂ3383.8ÂÂÂÂ3358.6ÂÂÂÂ3365.9 ÂÂ3339.1ÂÂÂÂ3206.9ÂÂÂÂ3338.2ÂÂÂÂÂ924.7 4 x Shell Scripts (1 concurrent)ÂÂÂÂÂÂÂÂÂÂ5126.8ÂÂÂÂ4992.7ÂÂÂÂ6739.1ÂÂÂÂ4973.7 ÂÂ6773.8ÂÂÂÂ6770.9ÂÂÂÂ6806.4ÂÂÂÂ1823.5 4 x Shell Scripts (8 concurrent)ÂÂÂÂÂÂÂÂÂÂ5969.8ÂÂÂÂ6021.7ÂÂÂÂ6258.9ÂÂÂÂ6018.9 ÂÂ6302.5ÂÂÂÂ6284.0ÂÂÂÂ6323.6ÂÂÂÂ1683.6 4 x System Call OverheadÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ6647.9ÂÂÂÂ6661.2ÂÂÂÂ6672.6ÂÂÂÂ6669.9 ÂÂ6665.0ÂÂÂÂ6641.8ÂÂÂÂ6649.9ÂÂÂÂ2244.1 ÂÂÂÂSystem Benchmarks Index Score ======= 3786.1Â== 3987.8 == 4155.4 == 4018.1 == 4151.5 == 4116.1 == 4195.1 == 1695.8 ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ+/-ÂÂÂÂÂÂÂÂÂÂ0.00%ÂÂÂÂ+5.33%ÂÂÂÂ+9.75%ÂÂÂÂ+6.13% ÂÂ+9.65%ÂÂÂÂ+8.72%ÂÂÂ+10.80%ÂÂÂ-55.21% 4131 and 4147 are the ones that looked more promising. Linux's default, for an HVM guest is 4143:  LOAD_BALANCE  Â|  BALANCE_NEWIDLE |  BALANCE_EXEC  Â|  BALANCE_FORK  Â|  WAKE_AFFINE   |  PREFER_SIBLINGS Using 4131 means:  LOAD_BALANCE  Â|  BALANCE_NEWIDLE |  WAKE_AFFINE   |  PREFER_SIBLINGS 4147 means:  LOAD_BALANCE  Â|  BALANCE_NEWIDLE |  BALANCE_WAKE  Â|  WAKE_AFFINE   |  PREFER_SIBLINGS For now, I focused on 4131 (as the results of other ah-hoc benchmarks were hinting at that), but I want to investigate 4147 too (see below). Basically, with 4143 as the scheduling domain's (there's only one domain in one of our HVM guests right now) flags, I'm telling the Linux scheduler that its load balancing logic should *not* trigger as a result of fork() or exec(), nor when a task wakes-up (seems a bit aggressive, but still...). So, I arranged for comparing the performance of the default set of flags with 4131, in an extensive way. Here's what I did (it's a bit of a long explanation, but it's for making sure you know what each benchmarking configuration did). I selected the following benchmarks: Â- makexen: how long it takes to compile Xen ÂÂÂÂÂÂÂÂÂÂÂÂ(results: lower == better) Â- iperf: iperf from guest(s) toward the host ÂÂÂÂÂÂÂÂÂÂ(results: higher == etter) Â- sysbench-cpu: pure number crunching ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ(results: lower == better) Â- sysbench-oltp: concurrent database transactions ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ(results: higher == better) Â- unixbench: runs a set of tests and compute a global perf index ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ(results: higher == better [1]) The actual workload was always run in guests. A varying number of HVM guests was used. Number of vcpus and amount of memory of the guests was also varying. All the benchmarks were "homogeneous", i.e., all the guests used in a particular instance were equal (in terms of number of vcpus and amount of RAM), and all run the same workload. They also were "synchronous", i.e., all the guests started running the workload at the same time. Each benchmark was repeated 5 times. This first set of results shows the average of all the output samples of all the iterations from all the guests involved in each benchmark.[2] The benchmarks were run on a 24 pCPUs (arranged in 2 NUMA nodes) host, with 32GB of RAM. Xen version was always the same (what staging was a few weeks ago). Linux dom0 kernel was 4.3.0. Guests' kernels were 4.2.0. Results are collected for just the default case, and for the case where I reconfigured the guests' scheduling domain's flag to 4131 (yes, only the flags of the guests for now, I can rerun changing dom0's flags as well). A particular benchmark is characterized as follows: Â* host load: basically, how many guest vcpus were being used. It can ÂÂÂÂÂÂÂÂÂÂÂÂÂÂbe sequential, small, medium, large, full, overload or ÂÂÂÂÂÂÂÂÂÂÂÂÂÂoverwhelmed Â* guests size: how big (in terms of vcpus and memory) the guests were. ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂIt can be sequential, small, medium or large Â* guest load: how busy the guests were kept. I can be sequential, ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂmoderate, full or overbooked. In some more details: Â- host load: ÂÂÂÂ* sequential load means there is only 1 VM ÂÂÂÂ* small load means that total number of guest vcpus was ~1/3 of ÂÂÂÂÂÂthe host pcpus (i.e., 24/3 = 8 vcpus) ÂÂÂÂ* medium load means total number of guest vcpus was ~1/2 of host ÂÂÂÂÂÂpcpus (i.e., 12, but it was 16 some of the times) ÂÂÂÂ* large load means total number of guest vcpus was ~2/3 of host ÂÂÂÂÂÂpcpus (i.e. 16, but it was actually 20) ÂÂÂÂ* full load means total number of guest vcpus was == to host pcpus ÂÂÂÂÂÂ(i.e., 24) ÂÂÂÂ* overload means total number of guest vcpus was ~2/3 of host pcpus ÂÂÂÂÂÂ(i.e., 36, but it was 32 some of the times) ÂÂÂÂ* overwhelmed means total number of guest vcpus was 2x host pcpus ÂÂÂÂÂÂ(i.e., 48) Â- guest size: ÂÂÂÂ* sequential guests had 1 vcpu and 2048MB of RAM ÂÂÂÂ* small guests had 4 vcpus and 4096MB of RAM ÂÂÂÂ* medium guests had 8 vcpus and 5120MB of RAM ÂÂÂÂ* large guests had 12 vcpus and 8192MB of RAM Â- guest load: ÂÂÂÂ* sequential means the benchmark was sequentially run (e.g., ÂÂÂÂÂÂmake -j1, unixbench -c1, sysbench --num-threads=1, etc.) ÂÂÂÂ* moderate means the benchmark was keeping half of the guest's ÂÂÂÂÂÂvcpus budy (e.g., make -j4 on in 8 vcpus guest) ÂÂÂÂ* full means the benchmark was keeping all the guest's vcpus busy ÂÂÂÂÂÂ(e.g., make -j8 in an 8 vcpus guest) ÂÂÂÂ* overbooked means the benchmark was running with 2x degree of ÂÂÂÂÂÂparallelism wrt to guests vcpus (e.g., make -j16 in an 8 vcpus ÂÂÂÂÂÂguest) Combining these three 'parameters', several benchmark configurations were generated. Not all the possible combinations are meaningful, but I ended up with quite a few cases. Let me just put down a couple of examples, to make sure what happened during a particular benchmark can be fully understood. So, for instance, the sysboltp-smallhload-smallvm-fullwkload benchmark was: Â+ running sysbench --oltp Â+ running it inside 2 HVM guests at the same time (small host load) Â+ the guests had 4 vcpus (small vms) Â+ running it with --num-threads=4 (full workload) Another one, makexen-overldhload-medvm-modrtwkload was: Â+ running a Xen build Â+ running it in 4 HVM guests at the same time (32 vcpus total, ÂÂÂoverloaded host) Â+ the guest had 8 vcpus (medium vms) + running it with -j4 (moderate workload) And so on and so forth. The first column in the attached results files, identify the specific benchmark configuration, according to this characterization. The other various columns shows, for each workload, the performance with default flags and with 4131, and the percent increase we got by using 4131. Of course, when lower is better (like in makexen and the sysbench-es) an actual increase is a bad thing. In fact, I highlighted with an '*' instances where changing flags to 4131 caused a regressions. I'm too tired right now to do an appropriate analysis, but just very quickly: - for Xen build-alike workloads, using 4131 is just awesome; :-) - for Unixbench, likewise (at least the runs that I have); - for iperf, there are issues; - for the two sysbench-es, there are a few issues; cpu is worse than OLTP, which is somewhat good, as OLTP is more representative of real workloads, while cpu is purely syntetic. The iperf (and perhaps also the OLTP) results is what makes me want to re-try everything with 4147. In fact, I expect (although, nothing more than a speculation) that allowing the load balancer to act upon a Linux's task's wakeup, has the potential of making things better for that kind of workload. We shall see whether that affect other ones that much... So, now I need to go to bed (esp. considering that I'm up since 4:30 AM, for catching a flight). I'll continue looking at and working on this... in the meanwhile, feel free to voice your opinions. :-D Thanks and Regards, Dario [1] There must have been an issue with Unixbench runs, and I don't have the numbers from all the configurations I wanted to test, so I'm attaching what I have, and re-running. [2] We can aggregate data like this, because of the homogeneous and synchronous nature of the benchmarks themselves. Another interesting analysis that could be attempted is about fairness. I.e., we can check by how much the performance of each of the guests involved in each instance varies between each others (ideally, very few). I haven't done this kind of analysis yet. -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) Attachment:
mainline-vs-4131.txt Attachment:
mainline-vs-4131_unixbench.txt Attachment:
signature.asc _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |