[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] multi-core VMM

On Fri, 2013-01-11 at 00:44 +0800, G.R. wrote:
> Hmm, I know that there is a 'SMT scheduler support' in the linux kernel 
> config.
> But I never tried to check out how it works.
Ok, sorry for joining late this thread. In my defence, it was quite hard
to follow, with wrong reply-to-s, missing messages, etc. :-)

That being said, I can contribute with what I remember about Linux SMT
and what I know about Xen SMT support. Well, although they're very
different if you look at the actual code, both approaches takes into
account the differences between a full-fledged core/processor/whatever
you want to call it, and an hyperthread (HT). Simplifying the thing
quite a bit, we could say that both at least try not to run a task/vCPU
on the sibling of a busy hyperthread, in case there are idle cores[1].
So, at least in Xen's credit scheduler, if you have 3 core with 2 HT
each, 2 running vCPU and a waking-up one, the scheduler will try to send
the new vCPU to the idle core. That's how it works here... Hope this
helps sorting out your doubts (feel free to ask further if not).

Notice that all this is well buried in the internals of the Xen (credit)
scheduler, and it is _not_ exported to almost anyone. From outside the
scheduler, all the 'ways' are seen like a pCPU. Yes, `xl info -n' (on a
Xen host) will tell you which pCPUs belong to which socket, which
socket(s?) to which node, etc., and you certainly can infer what pCPUs
are hyperdreads, cores or full processors, but that does not change the
fact that your 2 sockets, 4 cores-per-socket, 2 threads-per-cores system
will be seen as a 16 pCPUs Xen host (and the same for Linux on
baremetal, try `cat /proc/cpuinfo' there and you'll see 16 entries).

> Having application developer to control is not an effective way.
> And I never heard of such API...
Me neither, although that does not mean it can't exist, at least for
Linux on baremetal.

Exporting such king of information to guests (here, I'm mostly
interested in this, given where we are, rather than in a Linux baremetal
HT API) is something that could probably be done, but not without having
clear what the situation is, what limitations we would be introducing
and whether the benefits surpasses the costs and pay for the
efforts. :-)

First of all, the vast majority of systems are still (or at least so I
think) homogeneous: with hyperthread enabled, every pCPU is a thread!
Pick, for instance, the system I was talking about before (2 sockets, 4
cores, 2 threads): you either keep HT enabled, and you have 16 threads,
or disable it, and you have 8 "full-cores". So, in this case, we can say
that either all the information is already exported, or that there is
not much information to export in the first place! :-)

However, let's assume you can put your hands on an heterogeneous system,
where some cores have hyperthreading enabled and others some don't. For
instance, you can sort of create one by off-lining some of the pCPUs
(which, as said, are hyperthreads), assuming that leaving only one of
the 2 HTs on-line on a certain core would allow us to call it a
IOW, let's say that, on core 0, you have both its HTs on-line, so pCPU 0
and 1 are HTs, while on core 1 you off-line one of its 2 HTs (say pCPU
3), so that pCPU 2 can be considered a full core. Now, if you run a 2
vCPUs guest on such host, as I was saying above, Xen will almost always
run one of the guest's vCPU on either pCPU 0 or 1, and the other one on
pCPU 3 (where the almost come from the fact that this depends on the
load, since, even if the guest is the only one in the system, there
always also are Dom0's vCPUs!). Now you can think that telling the guest
that one of its vCPU is a thread and one is a full-core might be nice,
and it is definitely true that this would trigger the guest OS's logic
for dealing with SMT (and, as said, Linux has some). However, among its
two vCPUs, which one you'd advertise to the guest as a thread and which
one as a core? It's heuristics, so it's not impossible that you end up
executing _both_ the guest's vCPUs on the same core (i.e., on pCPUs 0
and 1), if for instance the Dom0 vCPUs are very busy, or if you start a
new guest at some point, in which case you'll be giving the guest OS the
wrong information, potentially worsening the performances.

The only way to avoid the above that I can think of is vCPU-to-pCPU
pinning. In fact, if you statically pin, at guest creation time, vCPU 0
to pCPU 0 and vCPU 1 to pCPU 2 (thanks to the fact that you disabled, by
off-lining it, pCPU 3, the sibling HT of pCPU 2), you'll never see vCPU
1 running on a busy HT. However, this not only means that you have to
pin the guest's vCPUs at the time you create the guest itself, but also
that you can't change this during the whole guest lifetime, so, no vCPU
re-pinning, no mangling with cpupools, probably not even suspension and
live migration (unless you're sure you're going to resume the guest at a
time and on a host with the exact same characteristics). So, with all
this limitations, is it still worth? Don't get me wrong, I'm not at all
saying it never is, what I'm saying is it would be a solution for a very
small subset of use cases, so we better concentrate on something else
for now... But, hey, if something comes out that implements this (of
course, if it doesn't cause regressions on other workloads, if it
doesn't complicate the code too much, ...), I'd at least look at the
patches with interest! :-P

> Maybe I just need to check out the linux solution...
> I still tend to believe there are HW support for application behavior
> statistics.
I'm not sure I understood what you mean with "HW support for application
behaviour statistics". There certainly are some information you can try
to infer from things like hardware counters, etc., and since SMT
scheduling (like all scheduling? :-D) is mostly heuristics, we sure
could use them to try to do better. On the down side, if talking for
example about hardware performance counters, most of them are
hardware/CPU/chipset specific and dependant, so it's very hard to take
advantage of them in something that has to run on a wide range of CPU
and chipsets, from different manufacturers and different time periods!

Of course you can construct _in_software_ almost every statistic you
like, but that will definitely come at some price, especially if it has
to be accurate enough to be useful for scheduling decisions, which are,
by their nature, a quite high frequency activity.

Again, I hope this helps.

Thanks and Regards,

[1] Unless there are power management constraints. Xen, for instance, as
a parameter that you can use to tell the scheduler (well, at least the
credit scheduler) to do right the opposite, i.e., consolidate work
instead of spreading it

<<This happens because I choose it to happen!>> (Raistlin Majere)
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.