[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] schedulers and topology exposing questions
On Wed, 2016-01-27 at 11:03 -0500, Elena Ufimtseva wrote: > On Wed, Jan 27, 2016 at 10:27:01AM -0500, Konrad Rzeszutek Wilk > wrote: > > On Wed, Jan 27, 2016 at 03:10:01PM +0000, George Dunlap wrote: > > > On 27/01/16 14:33, Konrad Rzeszutek Wilk wrote: > > > > On Tue, Jan 26, 2016 at 11:21:36AM +0000, George Dunlap wrote: > > > > > On 22/01/16 16:54, Elena Ufimtseva wrote: > > > > > > Hello all! > > > > > > > > > > > > Dario, Gerorge or anyone else,ÂÂyour help will be > > > > > > appreciated. > > > > > > > > > > > > Let me put some intro to our findings. I may forget > > > > > > something or put something > > > > > > not too explicit, please ask me. > > > > > > > > > > > > Customer filled a bug where some of the applications were > > > > > > running slow in their HVM DomU setups. > > > > > > These running times were compared against baremetal running > > > > > > same kernel version as HVM DomU. > > > > > > > > > > > > After some investigation by different parties, the test > > > > > > case scenario was found > > > > > > where the problem was easily seen. The test app is a udp > > > > > > server/client pair where > > > > > > client passes some message n number of times. > > > > > > The test case was executed on baremetal and Xen DomU with > > > > > > kernel version 2.6.39. > > > > > > Bare metal showed 2x times better result that DomU. > > > > > > > > > > > > Konrad came up with a workaround that was setting the flag > > > > > > for domain scheduler in linux > > > > > > As the guest is not aware of SMT-related topology, it has a > > > > > > flat topology initialized. > > > > > > Kernel has domain scheduler flags for scheduling domain CPU > > > > > > set to 4143 for 2.6.39. > > > > > > Konrad discovered that changing the flag for CPU sched > > > > > > domain to 4655 > > > > > > works as a workaround and makes Linux think that the > > > > > > topology has SMT threads. > > > > > > This workaround makes the test to complete almost in same > > > > > > time as on baremetal (or insignificantly worse). > > > > > > > > > > > > This workaround is not suitable for kernels of higher > > > > > > versions as we discovered. > > > > > > > > > > > > The hackish way of making domU linux think that it has SMT > > > > > > threads (along with matching cpuid) > > > > > > made us thinks that the problem comes from the fact that > > > > > > cpu topology is not exposed to > > > > > > guest and Linux scheduler cannot make intelligent decision > > > > > > on scheduling. > > > > > > > > > > > > Joao Martins from Oracle developed set of patches that > > > > > > fixed the smt/core/cashe > > > > > > topology numbering and provided matching pinning of vcpus > > > > > > and enabling options, > > > > > > allows to expose to guest correct topology. > > > > > > I guess Joao will be posting it at some point. > > > > > > > > > > > > With this patches we decided to test the performance impact > > > > > > on different kernel versionand Xen versions. > > > > > > > > > > > > The test described above was labeled as IO-bound test. > > > > > > > > > > So just to clarify: The client sends a request (presumably > > > > > not much more > > > > > than a ping) to the server, and waits for the server to > > > > > respond before > > > > > sending another one; and the server does the reverse -- > > > > > receives a > > > > > request, responds, and then waits for the next request.ÂÂIs > > > > > that right? > > > > > > > > Yes. > > > > > > > > > > How much data is transferred? > > > > > > > > 1 packet, UDP > > > > > > > > > > If the amount of data transferred is tiny, then the > > > > > bottleneck for the > > > > > test is probably the IPI time, and I'd call this a "ping- > > > > > pong" > > > > > benchmark[1].ÂÂI would only call this "io-bound" if you're > > > > > actually > > > > > copying large amounts of data. > > > > > > > > What we found is that on baremetal the scheduler would put both > > > > apps > > > > on the same CPU and schedule them right after each other. This > > > > would > > > > have a high IPI as the scheduler would poke itself. > > > > On Xen it would put the two applications on seperate CPUs - and > > > > there > > > > would be hardly any IPI. > > > > > > Sorry -- why would the scheduler send itself an IPI if it's on > > > the same > > > logical cpu (which seems pretty pointless), but *not* send an IPI > > > to the > > > *other* processor when it was actually waking up another task? > > > > > > Or do you mean high context switch rate? > > > > Yes, very high. > > > > > > > Digging deeper in the code I found out that if you do an UDP > > > > sendmsg > > > > without any timeouts - it would put it in a queue and just call > > > > schedule. > > > > > > You mean, it would mark the other process as runnable somehow, > > > but not > > > actually send an IPI to wake it up?ÂÂIs that a new "feature" > > > designed > > > > Correct - because the other process was not on its vCPU runqueue. > > > > > for large systems, to reduce the IPI traffic or something? > > > > This is just a normal Linux scheduler. The only way it would do an > > IPI > > to the other CPU was if the UDP message had an timeout. The default > > timeout is infite so it didn't bother to send an IPI. > > > > > > > > > On baremetal the schedule would result in scheduler picking up > > > > the other > > > > task, and starting it - which would dequeue immediately. > > > > > > > > On Xen - the schedule() would go HLT.. and then later be woken > > > > up by the > > > > VIRQ_TIMER. And since the two applications were on seperate > > > > CPUs - the > > > > single packet would just stick in the queue until the > > > > VIRQ_TIMER arrived. > > > > > > I'm not sure I understand the situation right, but it sounds a > > > bit like > > > what you're seeing is just a quirk of the fact that Linux doesn't > > > always > > > send IPIs to wake other processes up (either by design or by > > > accident), > > > > It does and it does not :-) > > > > > but relies on scheduling timers to check for work to > > > do.ÂÂPresumably > > > > It .. I am not explaining it well. The Linux kernel scheduler when > > called for 'schedule' (from the UDP sendmsg) would either pick the > > next > > appliction and do a context swap - of if there were none - go to > > sleep. > > [Kind of - it also may do an IPI to the other CPU if requested ,but > > that requires > > some hints from underlaying layers] > > Since there were only two apps on the runqueue - udp sender and udp > > receiver > > it would run them back-to back (this is on baremetal) > > > > However if SMT was not exposed - the Linux kernel scheduler would > > put those > > on each CPU runqueue. Meaning each CPU only had one app on its > > runqueue. > > > > Hence no need to do an context switch. > > [unless you modified the UDP message to have a timeout, then it > > would > > send an IPI] > > > they knew that low performance on ping-pong workloads might be a > > > possibility when they wrote the code that way; I don't see a > > > reason why > > > we should try to work around that in Xen. > > > > Which is not what I am suggesting. > > > > Our first ideas was that since this is a Linux kernel schduler > > characteristic > > - let us give the guest all the information it needs to do this. > > That is > > make it look as baremetal as possible - and that is where the vCPU > > pinning and the exposing of SMT information came about. That (Elena > > pls correct me if I am wrong) did indeed show that the guest was > > doing > > what we expected. > > > > But naturally that requires pinning and all that - and while it is > > a useful > > case for those that have the vCPUs to spare and can do it - that is > > not > > a general use-case. > > > > So Elena started looking at the CPU bound and seeing how Xen > > behaves then > > and if we can improve the floating situation as she saw some > > abnormal > > behavious. > > Maybe its normal? :) > > While having satisfactory results with ping-pong test and having > Joao's > SMT patches in hand, we decided to try cpu-bound workload. > And that is where exposing SMT does not work that well. > I mostly here refer to the case where two vCPUs are being placed on > same > core while there are idle cores. > > This I think what Dario is asking me more details about in another > reply and I am going to > answer his questions. > Yes, exactly. We need to see more trace entries around the one where we see the vcpus being placed on SMT-siblings. You can well send, or upload somewhere, the full trace, and I'll have a look myself as soon as I can. :-) > > I do not see any way to fix the udp single message mechanism except > > by modifying the Linux kernel scheduler - and indeed it looks like > > later > > kernels modified their behavior. Also doing the vCPU pinning and > > SMT exposing > > did not hurt in those cases (Elena?). > > Yes, the drastic performance differences with bare metal were only > observed with 2.6.39-based kernel. > For this ping-pong udp test exposing the SMT topology to the kernels > if > higher versions did help as tests show about 20 percent performance > improvement comparing to the tests where SMT topology is not exposed. > This assumes that SMT exposure goes along with pinning. > > > kernel. > hypervisor. :-D :-D :-D Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) Attachment:
signature.asc _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |