Xen project Mailing List

Re: [Xen-devel] schedulers and topology exposing questions

On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote: > On 22/01/16 16:54, Elena Ufimtseva wrote: > > Hello all! > > > > Dario, Gerorge or anyone else, your help will be appreciated. > > > > Let me put some intro to our findings. I may forget something or put > > something > > not too explicit, please ask me. > > > > Customer filled a bug where some of the applications were running slow in > > their HVM DomU setups. > > These running times were compared against baremetal running same kernel > > version as HVM DomU. > > > > After some investigation by different parties, the test case scenario was > > found > > where the problem was easily seen. The test app is a udp server/client pair > > where > > client passes some message n number of times. > > The test case was executed on baremetal and Xen DomU with kernel version > > 2.6.39. > > Bare metal showed 2x times better result that DomU. > > > > Konrad came up with a workaround that was setting the flag for domain > > scheduler in linux > > As the guest is not aware of SMT-related topology, it has a flat topology > > initialized. > > Kernel has domain scheduler flags for scheduling domain CPU set to 4143 for > > 2.6.39. > > Konrad discovered that changing the flag for CPU sched domain to 4655 > > works as a workaround and makes Linux think that the topology has SMT > > threads. > > This workaround makes the test to complete almost in same time as on > > baremetal (or insignificantly worse). > > > > This workaround is not suitable for kernels of higher versions as we > > discovered. > > > > The hackish way of making domU linux think that it has SMT threads (along > > with matching cpuid) > > made us thinks that the problem comes from the fact that cpu topology is > > not exposed to > > guest and Linux scheduler cannot make intelligent decision on scheduling. > > > > Joao Martins from Oracle developed set of patches that fixed the > > smt/core/cashe > > topology numbering and provided matching pinning of vcpus and enabling > > options, > > allows to expose to guest correct topology. > > I guess Joao will be posting it at some point. > > > > With this patches we decided to test the performance impact on different > > kernel versionand Xen versions. > > > > The test described above was labeled as IO-bound test. > > So just to clarify: The client sends a request (presumably not much more > than a ping) to the server, and waits for the server to respond before > sending another one; and the server does the reverse -- receives a > request, responds, and then waits for the next request. Is that right? Yes. > > How much data is transferred? 1 packet, UDP > > If the amount of data transferred is tiny, then the bottleneck for the > test is probably the IPI time, and I'd call this a "ping-pong" > benchmark[1]. I would only call this "io-bound" if you're actually > copying large amounts of data. What we found is that on baremetal the scheduler would put both apps on the same CPU and schedule them right after each other. This would have a high IPI as the scheduler would poke itself. On Xen it would put the two applications on seperate CPUs - and there would be hardly any IPI. Digging deeper in the code I found out that if you do an UDP sendmsg without any timeouts - it would put it in a queue and just call schedule. On baremetal the schedule would result in scheduler picking up the other task, and starting it - which would dequeue immediately. On Xen - the schedule() would go HLT.. and then later be woken up by the VIRQ_TIMER. And since the two applications were on seperate CPUs - the single packet would just stick in the queue until the VIRQ_TIMER arrived. I found out that if I expose the SMT topology to the guest (which is what baremetal sees) suddenly the Linux scheduler would behave the same way as under baremetal. To be fair - this is a very .. ping-pong no-CPU bound workload. If the amount of communication was huge it would probably behave a bit differently - as the queue would fill up - and by the time the VIRQ_TIMER hit the other CPU - it would have a nice chunk of data to eat through. > > Regarding placement wrt topology: If two threads are doing a large > amount of communication, then putting them close in the topology will > increase perfomance, because they share cache, and the IPI distance > between them is much shorter. If they rarely run at the same time, > being on the same thread is probably the ideal. This is a ping-pong type workload - very much serialized. > > On the other hand, if two threads are running mostly independently, and > each one is using a lot of cache, then having the threads at opposite > ends of the topology will increase performance, since that will increase > the aggregate cache used by both. The ideal in this case would > certainly be for each thread to run on a separate socket. > > At the moment, neither the Credit1 and Credit2 schedulers take > communication into account; they only account for processing time, and > thus silently assume that all workloads are cache-hungry and > non-communicating. And this is very much the opposite of that :-) > > [1] https://www.google.co.uk/search?q=ping+pong+benchmark > > > We have run io-bound test with and without smt-patches. The improvement > > comparing > > to base case (no smt patches, flat topology) shows 22-23% gain. > > > > While we have seen improvement with io-bound tests, the same did not happen > > with cpu-bound workload. > > As cpu-bound test we use kernel module which runs requested number of > > kernel threads > > and each thread compresses and decompresses some data. > > > > Here is the setup for tests: > > Intel Xeon E5 2600 > > 8 cores, 25MB Cashe, 2 sockets, 2 threads per core. > > Xen 4.4.3, default timeslice and ratelimit > > Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+. > > Dom0: kernel 4.1.0, 2 vcpus, not pinned. > > DomU has 8 vcpus (except some cases). > > > > > > For io-bound tests results were better with smt patches applied for every > > kernel. > > > > For cpu-bound test the results were different depending on wether > > vcpus were pinned or not, how many vcpus were assigned to the guest. > > Looking through your mail, I can't quite figure out if "io-bound tests > with the smt patches applied" here means "smt+pinned" or just "smt" > (unpinned). (Or both.) > > Assuming that the Linux kernel takes process communication into account > in its scheduling decisions, I would expect smt+pinning to have the kind > of performance improvement you observe. I would expect that smt without > pinning would have very little effect -- or might be actively worse, > since the topology information would then be actively wrong as soon as > the scheduler moved the vcpus. > > The fact that exposing topology of the cpu-bound workload didn't help > sounds expected to me -- the Xen scheduler already tries to optimize for > the cpu-bound case, so in the [non-smt, unpinned] case probably places > things on the physical hardware similar to the way Linux places it in > the [smt, pinned] case. > > > Please take a look at the graph captured by xentrace -e 0x0002f000 > > On the graphs X is time in seconds since xentrace start, Y is the pcpu > > number, > > the graph itself represent the event when scheduler places vcpu to pcpu. > > > > The graphs #1 & #2: > > trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test, one > > client/server > > trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test, 8 > > kernel theads > > config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69 kernel. > > > > As can be seen here scheduler places the vcpus correctly on empty cores. > > As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this? > > Well it looks like vcpu0 does the lion's share of the work, while the > other vcpus more or less share the work. So the scheduler gives vcpu0 > its own socket (more or less), while the other ones share the other > socket (optimizing for maximum cache usage). > > > Take a look at > > trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png > > where I split data per vcpus. > > > > > > Now to cpu-bound tests. > > When smt patches applied and vcpus pinned correctly to match the topology > > and > > guest become aware of the topology, cpu-bound tests did not show > > improvement with kernel 2.6.39. > > With upstream kernel we see some improvements. The tes was repeated 5 times > > back to back. > > The number of vcpus was increased to 16 to match the test case where linux > > was not > > aware of the topology and assumed all cpus as cores. > > > > On some iterations one can see that vcpus are being scheduled as expected. > > For some runs the vcpus are placed on came core (core/thread) (see > > trace_cpu_16vcpus_8threads_5runs.out.plot.err.png). > > It doubles the time it takes for test to complete (first three runs show > > close to baremetal execution time). > > > > END: cycles: 31209326708 (29 seconds) > > END: cycles: 30928835308 (28 seconds) > > END: cycles: 31191626508 (29 seconds) > > END: cycles: 50117313540 (46 seconds) > > END: cycles: 49944848614 (46 seconds) > > > > Since the vcpus are pinned, then my guess is that Linux scheduler makes > > wrong decisions? > > Hmm -- could it be that the logic detecting whether the threads are > "cpu-bound" (and thus want their own cache) vs "communicating" (and thus > want to share a thread) is triggering differently in each case? > > Or maybe neither is true, and placement from the Linux side is more or > less random. :-) > > > So I ran the test with smt patches enabled, but not pinned vcpus. > > > > result is also shows the same as above (see > > trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png): > > Also see the per-cpu graph > > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_pervcpu.png). > > > > END: cycles: 49740185572 (46 seconds) > > END: cycles: 45862289546 (42 seconds) > > END: cycles: 30976368378 (28 seconds) > > END: cycles: 30886882143 (28 seconds) > > END: cycles: 30806304256 (28 seconds) > > > > I cut the timeslice where its seen that vcpu0 and vcpu2 run on same core > > while other cores are idle: > > > > 35v2 9.881103815 7 > > > > 35v0 9.881104013 6 > > > > 35v2 9.892746452 7 > > > > 35v0 9.892746546 6 -> vcpu0 gets scheduled right after vcpu2 on same core > > > > 35v0 9.904388175 6 > > > > 35v2 9.904388205 7 -> same here > > > > 35v2 9.916029791 7 > > > > 35v0 9.916029992 6 > > > > > > Disabling smt option in linux config (what essentially means that guest > > does not > > have correct topology and its just flat shows slightly better results - > > there > > are no cores and threads being scheduled in pair while other cores are > > empty. > > > > END: cycles: 41823591845 (38 seconds) > > END: cycles: 41105093568 (38 seconds) > > END: cycles: 30987224290 (28 seconds) > > END: cycles: 31138979573 (29 seconds) > > END: cycles: 31002228982 (28 seconds) > > > > and graph is attached > > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt0_ups.out.plot.err.png). > > This is a bit strange. You're showing that for *unpinned* vcpus, with > empty cores, there are vcpus sharing the same thread for significant > periods of time? That definitely shouldn't happen. > > It looks like you still have a fairly "bimodal" distribution even in the > "no-smt unpinned" scenario -- just 28<->38 rather than 28<->45-ish. > > Could you try a couple of these tests with the credit2 scheduler, just > to see? You'd have to make sure and use one of the versions that has > hard pinning enabled; I don't think that made 4.6, so you'd have to use > xen-unstable I think. > > > I may have forgotten something here.. Please ask me questions if I did. > > > > Maybe you have some ideas what can be done here? > > > > We try to make guests topology aware but looks like for cpu bound workloads > > its > > not that easy. > > Any suggestions are welcome. > > Well one option is always, as you say, to try to expose the topology to > the guest. But that is a fairly limited solution -- in order for that > information to be accurate, the vcpus need to be pinned, which in turn > means 1) a lot more effort required by admins, and 2) a lot less > opportunity for sharing of resources which is one of the big 'wins' for > virtualization. > > The other option is, as Dario said, to remove all topology information > from Linux, and add functionality to the Xen schedulers to attempt to > identify vcpus which are communicating or sharing in some other way, and > try to co-locate them. This is a lot easier and more flexible for > users, but a lot more work for us. > > -George > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.