[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] schedulers and topology exposing questions
On Wed, Jan 27, 2016 at 02:01:35PM +0000, Dario Faggioli wrote: > On Fri, 2016-01-22 at 11:54 -0500, Elena Ufimtseva wrote: > > Hello all! > > > Hey, here I am again, > > > Konrad came up with a workaround that was setting the flag for domain > > scheduler in linux > > As the guest is not aware of SMT-related topology, it has a flat > > topology initialized. > > Kernel has domain scheduler flags for scheduling domain CPU set to > > 4143 for 2.6.39. > > Konrad discovered that changing the flag for CPU sched domain to 4655 > > > So, as you've seen, I also have been up to doing quite a few of > benchmarking doing soemthing similar (I used more recent kernels, and > decided to test 4131 as flags. > > In your casse, according to this: > Âhttp://lxr.oss.org.cn/source/include/linux/sched.h?v=2.6.39#L807 > > 4655 means: >  SD_LOAD_BALANCE    Â| >  SD_BALANCE_EXEC    Â| >  > SD_BALANCE_WAKE    Â| >  SD_PREFER_LOCAL    Â| [*] >  > SD_SHARE_PKG_RESOURCES | >  SD_SERIALIZE > > and another bit (0x4000) that I don't immediately see what it is. > > Things have changed a bit since then, it appears. However, I'm quite sure > I've tested turning on SD_SERIALIZE in 4.2.0 and 4.3.0, and results were > really pretty bad (as you also seem to say later). > > > works as a workaround and makes Linux think that the topology has SMT > > threads. > > > Well, yes and no. :-). I don't want to make this all a terminology > bunfight, something that also matters here is how many scheduling > domains you have. > > To check that (although in recent kernels) you check here: > > Âls /proc/sys/kernel/sched_domain/cpu2/ (any cpu is ok) > > and see how many domain[0-9] you have. > > On baremetal, on an HT cpu, I've got this: > > $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name > SMT > MC > > So, two domains, one of which is the SMT one. If you check their flags, > they're different: > > $ cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags > 4783 > 559 > > So, yes, you are right in saying that 4655 is related to SMT. In fact, > it is what (among other things) tells the load balancer that *all* the > cpus (well, all the scheduling groups, actually) in this domain are SMT > siblings... Which is a legitimate thing to do, but it's not what > happens on SMT baremetal. > > At least is consistent, IMO. I.e., it still creates a pretty flat > topology, like there was a big core, of which _all_ the vcpus are part > of, as SMT siblings. > > The other option (the one I'm leaning toward) was too get rid of that > one flag. I've only done preliminary experiments with it on and off, > and the ones with it off were better looking, so I did keep it off for > the big run... but we can test with it again. > > > This workaround makes the test to complete almost in same time as on > > baremetal (or insignificantly worse). > > > > This workaround is not suitable for kernels of higher versions as we > > discovered. > > > There may be more than one reason for this (as said, a lot changed!) > but it matches what I've found when SD_SERIALIZE was kept on for the > scheduling domain where all the vcpus are. > > > The hackish way of making domU linux think that it has SMT threads > > (along with matching cpuid) > > made us thinks that the problem comes from the fact that cpu topology > > is not exposed to > > guest and Linux scheduler cannot make intelligent decision on > > scheduling. > > > As said, I think it's the other way around: we expose too much of it > (and this is more of an issue for PV rather than for HVM). Basically, > either you do the pinning you're doing or, whatever you expose, will be > *wrong*... and the only way to expose not wrong data is to actually > don't expose anything! :-) > > > The test described above was labeled as IO-bound test. > > > > We have run io-bound test with and without smt-patches. The > > improvement comparing > > to base case (no smt patches, flat topology) shows 22-23% gain. > > > I'd be curious to see the content of the /proc/sys/kernel/sched_domain > directory and subdirectories with Joao's patches applied. > > > While we have seen improvement with io-bound tests, the same did not > > happen with cpu-bound workload. > > As cpu-bound test we use kernel module which runs requested number of > > kernel threads > > and each thread compresses and decompresses some data. > > > That is somewhat what I would have expected, although up to what > extent, it's hard to tell in advance. > > It also matches my findings, both for the results I've already shared > on list, and for others that I'll be sharing in a bit. > > > Here is the setup for tests: > > Intel Xeon E5 2600 > > 8 cores, 25MB Cashe, 2 sockets, 2 threads per core. > > Xen 4.4.3, default timeslice and ratelimit > > Kernels: 2.6.39, 4.1.0, 4.3.0-rc7+. > > Dom0: kernel 4.1.0, 2 vcpus, not pinned. > > DomU has 8 vcpus (except some cases). > > > > > > For io-bound tests results were better with smt patches applied for > > every kernel. > > > > For cpu-bound test the results were different depending on wether > > vcpus were > > pinned or not, how many vcpus were assigned to the guest. > > > Right. In general, this also makes sense... Can we see the actual > numbers? I mean the results of the tests with improvements/regressions > highlighted, in addition to the traces that you already shared? > > > Please take a look at the graph captured by xentrace -e 0x0002f000 > > On the graphs X is time in seconds since xentrace start, Y is the > > pcpu number, > > the graph itself represent the event when scheduler places vcpu to > > pcpu. > > > > The graphs #1 & #2: > > trace_iobound_nosmt_dom0notpinned.out.plot.err.png - io bound test, > > one client/server > > trace_cpuboud_nosmt_dom0notpinned.out.plot.err.png - cpu bound test, > > 8 kernel theads > > config: domu, 8vcpus not pinned, smt patches not applied, 2.3.69 > > kernel. > > > Ok, so this is the "baseline", the result of just running your tests in > a pretty standard Xen and Dom0 and DomU status and configurations, > right? > > > As can be seen here scheduler places the vcpus correctly on empty > > cores. > > As seen on both, vcpu0 gets scheduled on pcpu 31. Why is this? > > Take a look at > > trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png > > where I split data per vcpus. > > > Well, why not, I would say? I mean, where a vcpu starts to run at an > arbitrary point in time, especially if the system is otherwise idle > before, it can be considered random (it's not, it depends on both the > vcpu's and system's previous history, but in a non-linear way, and that > is not in the graph anyway). > > In any case, since there are idle cores, the fact that vcpus do not > move much, even if they're not pinned, I consider it a good thing, > don't you? If vcpuX wakes up on processor Y, where it has always run > before, and it find out it can still run there, migrating somewhere > else would be pure overhead. > > The only potential worry of mine about > trace_cpuboud_nosmt_dom0notpinned.out.plot.err_pervcpu.nodom0.png, is > that vcpus 4 and 7 (or 4 and 2, colors are too similar to be sure) run > for some time (the burst around t=17), on pcpus 5 and 6. Are these two > pcpus SMT siblings? Doing the math myself on pCPUs IDs, I don't think > they are, so all would be fine. If they are, that should not happen. > > However, you're using 4.4, so even if you had an issue there, we don't > know if it's still in staging. > > In any case and just to be sure, can you produce the output of `xl > vcpu-list', while this case is running? > > > Now to cpu-bound tests. > > When smt patches applied and vcpus pinned correctly to match the > > topology and > > guest become aware of the topology, cpu-bound tests did not show > > improvement with kernel 2.6.39. > > With upstream kernel we see some improvements. The tes was repeated 5 > > times back to back. > > > Again, 'some' being? > > > The number of vcpus was increased to 16 to match the test case where > > linux was not > > aware of the topology and assumed all cpus as cores. > >  > > On some iterations one can see that vcpus are being scheduled as > > expected. > > For some runs the vcpus are placed on came core (core/thread) (see > > trace_cpu_16vcpus_8threads_5runs.out.plot.err.png). > > It doubles the time it takes for test to complete (first three runs > > show close to baremetal execution time). > > > No, sorry, I don't think I fully understood this part. So: > Â1. can you point me at where (time equal to ?) what you are saying >   happens? > Â2. more important, you are saying that the vcpus are pinned. If you >   pin the vcpus they just should not move. Period. If they move, >   if's a bug, no matter where they go and what the other SMT sibling >   of the pcpu where they go is busy or idle! :-O > >   So, are you saying that you pinned the vcpus of the guest and you >   see them moving and/or not being _always_ scheduled where you >   pinned them? Can we see `xl vcpu-list' again, to see how they're >   actually pinned. > > > END: cycles: 31209326708 (29 seconds) > > END: cycles: 30928835308 (28 seconds) > > END: cycles: 31191626508 (29 seconds) > > END: cycles: 50117313540 (46 seconds) > > END: cycles: 49944848614 (46 seconds) > > > > Since the vcpus are pinned, then my guess is that Linux scheduler > > makes wrong decisions? > > > Ok, so now it seems to me that you agree that the vcpus don't have much > alternatives. > > If yes (which would be of great relief for me :-) ), it could indeed be > that indeed the Linux scheduler is working suboptimally. > > Perhaps it's worth trying running the benchmark inside the guest with > the Linux's threads pinned to the vcpus. That should give you perfectly > consistent results over all the 5 runs. > > One more thing. You say you the guest has 16 vcpus, and that there are > 8 threads running inside it. However, I seem to be able to identify in > the graphs at least a few vertical lines where more than 8 vcpus are > running on some pcpu. So, if Linux is working well, and it really only > has to place 8 vcpus, it would put them on different cores. However, if > at some point in time, there is more than that it has to place, it will > have to necessarily 'invade' an already busy core. Am I right in seeing > those lines, or are my eyes deceiving me? (I think a per-vcpu breakup > of the graph above, like you did for dom0, would help figure this out). > > > So I ran the test with smt patches enabled, but not pinned vcpus. > > > AFAICT, This does not make much sense. So, if I understood correctly > what you mean, by doing as you say, you're telling Linux that, for > instance, vcpu0 and vcpu1 are SMT siblings, but then Xen is free to run > vcpu0 and vcpu1 at the same time wherever it likes... same core, > different core on same socket, different socket, etc. Correct. I did run this to see what happens in this pseudo-random case. > > This, I would say, bring us back to the pseudo-random situation we have > by default already, without any patching and any pinning, or just in a > different variant of it. > > > result is also shows the same as above (see > > trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.png) > > : > > Also see the per-cpu graph > > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_per > > vcpu.png). > > > Ok. I'll look at this graph better with the aim of showing an example > of my theory above (as soon as my brain, which is not in it's best > shape today) will manage to deal with all the colors (I'm not > complaining, BTW, there's not another way in which you can show things, > it's just me! :-D). At the same time, if you think I can improve data representation, it will be awesome! > > > END: cycles: 49740185572 (46 seconds) > > END: cycles: 45862289546 (42 seconds) > > END: cycles: 30976368378 (28 seconds) > > END: cycles: 30886882143 (28 seconds) > > END: cycles: 30806304256 (28 seconds) > > > > I cut the timeslice where its seen that vcpu0 and vcpu2 run on same > > core while other cores are idle: > > > > 35v2 9.881103815 > > 7 > > 35v0 9.881104013 6 > >  > > 35v2 9.892746452 > > 7 > > 35v0 9.892746546 6ÂÂÂ-> vcpu0 gets scheduled right after vcpu2 on > > same core > >  > > 35v0 9.904388175 > > 6 > > 35v2 9.904388205 7 -> same here > >  > > 35v2 9.916029791 > > 7 > > 35v0 9.916029992 > > 6 > > > Yes, this, in theory, should not happen. However, our (but Linux's, or > any other OS's one --perhaps in its own way--) can't always be > _instantly_ perfect! In this case, for instance, the SMT load balancing > logic, in Credit1, is triggered: > Â- from outside of sched_credit.c, by vcpu_migrate(), which is called >  Âupon in response to a bunch of events, but _not_ at every vcpu >  Âwakeup > Â- from inside sched_credit.c,Âcsched_vcpu_acct(), if the vcpu was it >  Âhas been active for a while > > This means, it is not triggered upon each and every vcpu wakeup (it > might, but not for the vcpu that is waking up). So, seeing a samples of > a vcpu not being scheduled according to optimal SMT load balancing, > especially right after it woke up, it is expectable. Then, after a > while, the logic should indeed trigger (via csched_vcpu_acct()) and > move away the vcpu to an idle core. > > To tell how long the SMT perfect balancing violation happens, and > whether or not it happens as a consequence of tasks wakeups, we need > more records from the trace file, coming from around the point where > the violation happens. > > Does this make sense to you? Dario, thanks for explanations. I am going to verify some numbers and also I am collecting more trace data. I am going to send it shortly, sorry for the delay. Elena > > Regards, and thanks for sharing all this! :-) > Dario > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |