Xen project Mailing List

Re: [Xen-devel] multi-core VMM

To: "G.R." <firemeteor@xxxxxxxxxxxxxxxxxxxxx>

From: Dario Faggioli <raistlin@xxxxxxxx>

Date: Fri, 11 Jan 2013 12:24:19 +0100

Cc: ZHANG Zhi <zhizhang@xxxxxxxxxx>, François-Frédéric Ozog <ff@xxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Fri, 11 Jan 2013 11:26:03 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Fri, 2013-01-11 at 00:44 +0800, G.R. wrote: > Hmm, I know that there is a 'SMT scheduler support' in the linux kernel > config. > But I never tried to check out how it works. > Ok, sorry for joining late this thread. In my defence, it was quite hard to follow, with wrong reply-to-s, missing messages, etc. :-) That being said, I can contribute with what I remember about Linux SMT and what I know about Xen SMT support. Well, although they're very different if you look at the actual code, both approaches takes into account the differences between a full-fledged core/processor/whatever you want to call it, and an hyperthread (HT). Simplifying the thing quite a bit, we could say that both at least try not to run a task/vCPU on the sibling of a busy hyperthread, in case there are idle cores[1]. So, at least in Xen's credit scheduler, if you have 3 core with 2 HT each, 2 running vCPU and a waking-up one, the scheduler will try to send the new vCPU to the idle core. That's how it works here... Hope this helps sorting out your doubts (feel free to ask further if not). Notice that all this is well buried in the internals of the Xen (credit) scheduler, and it is _not_ exported to almost anyone. From outside the scheduler, all the 'ways' are seen like a pCPU. Yes, `xl info -n' (on a Xen host) will tell you which pCPUs belong to which socket, which socket(s?) to which node, etc., and you certainly can infer what pCPUs are hyperdreads, cores or full processors, but that does not change the fact that your 2 sockets, 4 cores-per-socket, 2 threads-per-cores system will be seen as a 16 pCPUs Xen host (and the same for Linux on baremetal, try `cat /proc/cpuinfo' there and you'll see 16 entries). > Having application developer to control is not an effective way. > And I never heard of such API... > Me neither, although that does not mean it can't exist, at least for Linux on baremetal. Exporting such king of information to guests (here, I'm mostly interested in this, given where we are, rather than in a Linux baremetal HT API) is something that could probably be done, but not without having clear what the situation is, what limitations we would be introducing and whether the benefits surpasses the costs and pay for the efforts. :-) First of all, the vast majority of systems are still (or at least so I think) homogeneous: with hyperthread enabled, every pCPU is a thread! Pick, for instance, the system I was talking about before (2 sockets, 4 cores, 2 threads): you either keep HT enabled, and you have 16 threads, or disable it, and you have 8 "full-cores". So, in this case, we can say that either all the information is already exported, or that there is not much information to export in the first place! :-) However, let's assume you can put your hands on an heterogeneous system, where some cores have hyperthreading enabled and others some don't. For instance, you can sort of create one by off-lining some of the pCPUs (which, as said, are hyperthreads), assuming that leaving only one of the 2 HTs on-line on a certain core would allow us to call it a "full-core". IOW, let's say that, on core 0, you have both its HTs on-line, so pCPU 0 and 1 are HTs, while on core 1 you off-line one of its 2 HTs (say pCPU 3), so that pCPU 2 can be considered a full core. Now, if you run a 2 vCPUs guest on such host, as I was saying above, Xen will almost always run one of the guest's vCPU on either pCPU 0 or 1, and the other one on pCPU 3 (where the almost come from the fact that this depends on the load, since, even if the guest is the only one in the system, there always also are Dom0's vCPUs!). Now you can think that telling the guest that one of its vCPU is a thread and one is a full-core might be nice, and it is definitely true that this would trigger the guest OS's logic for dealing with SMT (and, as said, Linux has some). However, among its two vCPUs, which one you'd advertise to the guest as a thread and which one as a core? It's heuristics, so it's not impossible that you end up executing _both_ the guest's vCPUs on the same core (i.e., on pCPUs 0 and 1), if for instance the Dom0 vCPUs are very busy, or if you start a new guest at some point, in which case you'll be giving the guest OS the wrong information, potentially worsening the performances. The only way to avoid the above that I can think of is vCPU-to-pCPU pinning. In fact, if you statically pin, at guest creation time, vCPU 0 to pCPU 0 and vCPU 1 to pCPU 2 (thanks to the fact that you disabled, by off-lining it, pCPU 3, the sibling HT of pCPU 2), you'll never see vCPU 1 running on a busy HT. However, this not only means that you have to pin the guest's vCPUs at the time you create the guest itself, but also that you can't change this during the whole guest lifetime, so, no vCPU re-pinning, no mangling with cpupools, probably not even suspension and live migration (unless you're sure you're going to resume the guest at a time and on a host with the exact same characteristics). So, with all this limitations, is it still worth? Don't get me wrong, I'm not at all saying it never is, what I'm saying is it would be a solution for a very small subset of use cases, so we better concentrate on something else for now... But, hey, if something comes out that implements this (of course, if it doesn't cause regressions on other workloads, if it doesn't complicate the code too much, ...), I'd at least look at the patches with interest! :-P > Maybe I just need to check out the linux solution... > I still tend to believe there are HW support for application behavior > statistics. > I'm not sure I understood what you mean with "HW support for application behaviour statistics". There certainly are some information you can try to infer from things like hardware counters, etc., and since SMT scheduling (like all scheduling? :-D) is mostly heuristics, we sure could use them to try to do better. On the down side, if talking for example about hardware performance counters, most of them are hardware/CPU/chipset specific and dependant, so it's very hard to take advantage of them in something that has to run on a wide range of CPU and chipsets, from different manufacturers and different time periods! Of course you can construct _in_software_ almost every statistic you like, but that will definitely come at some price, especially if it has to be accurate enough to be useful for scheduling decisions, which are, by their nature, a quite high frequency activity. Again, I hope this helps. Thanks and Regards, Dario [1] Unless there are power management constraints. Xen, for instance, as a parameter that you can use to tell the scheduler (well, at least the credit scheduler) to do right the opposite, i.e., consolidate work instead of spreading it -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.