[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] schedulers and topology exposing questions

On Tue, 2016-01-26 at 11:21 +0000, George Dunlap wrote:
> On 22/01/16 16:54, Elena Ufimtseva wrote:
> Regarding placement wrt topology: If two threads are doing a large
> amount of communication, then putting them close in the topology will
> increase perfomance, because they share cache, and the IPI distance
> between them is much shorter.ÂÂIf they rarely run at the same time,
> being on the same thread is probably the ideal.
Yes, this make sense to me... a bit hard to do the guessing right, but
if we could, it would be a good thing to do.

> On the other hand, if two threads are running mostly independently,
> and
> each one is using a lot of cache, then having the threads at opposite
> ends of the topology will increase performance, since that will
> increase
> the aggregate cache used by both.ÂÂThe ideal in this case would
> certainly be for each thread to run on a separate socket.
> At the moment, neither the Credit1 and Credit2 schedulers take
> communication into account; they only account for processing time,
> and
> thus silently assume that all workloads are cache-hungry and
> non-communicating.
I don't think Linux's scheduler does anything like that either. One can
say that --speaking again about the flags of the scheduling domains--
you can try to use the SD_BALANCE_FORK and EXEC, together with the
knowledge of who runs first, between father or child, something like
what you say could be implemented... But even in this case, it's not at
all explicit, and it would only be effective near fork() and exec()

The same is true, with all the due differences, for the other flags.

Also, there is a great amount of logic to deal with task groups, and,
e.g., provide fairness to task groups instead than to single tasks,
etc. An I guess one can assume that the tasks in the same group does
communicate, and things like that, but that's again nothing
specifically taking any communication pattern into account (and it
changed --growing and loosing features-- quite frenetically over the
last years, so I don't think this has a say in what Elena is seeing.

> Assuming that the Linux kernel takes process communication into
> account
> in its scheduling decisions, I would expect smt+pinning to have the
> kind
> of performance improvement you observe.ÂÂI would expect that smt
> without
> pinning would have very little effect -- or might be actively worse,
> since the topology information would then be actively wrong as soon
> as
> the scheduler moved the vcpus.
We better check then (if Linux has these characteristics), because I
don't think it does. I can, and will, check myself, just not right now.

So I ran the test with smt patches enabled, but not pinned vcpus.

> > 
> > result is also shows the same as above (see
> > trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err.pn
> > g):
> > Also see the per-cpu graph
> > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt1_ups.out.plot.err_p
> > ervcpu.png).
> > 
> > END: cycles: 49740185572 (46 seconds)
> > END: cycles: 45862289546 (42 seconds)
> > END: cycles: 30976368378 (28 seconds)
> > END: cycles: 30886882143 (28 seconds)
> > END: cycles: 30806304256 (28 seconds)
> > 
> > I cut the timeslice where its seen that vcpu0 and vcpu2 run on same
> > core while other cores are idle:
> > 
> > 35v2 9.881103815
> > 35v0 9.881104013 6
> > 35v2 9.892746452
> > 35v0 9.892746546 6ÂÂÂ-> vcpu0 gets scheduled right after vcpu2 on
> > same core
> > 35v0 9.904388175
> > 35v2 9.904388205 7 -> same here
> > 35v2 9.916029791
> > 35v0 9.916029992
> > 
> > Disabling smt option in linux config (what essentially means that
> > guest does not
> > have correct topology and its just flat shows slightly better
> > results - there
> > are no cores and threads being scheduled in pair while other cores
> > are empty.
> > 
> > END: cycles: 41823591845 (38 seconds)
> > END: cycles: 41105093568 (38 seconds)
> > END: cycles: 30987224290 (28 seconds)
> > END: cycles: 31138979573 (29 seconds)
> > END: cycles: 31002228982 (28 seconds)
> > 
> > and graph is attached
> > (trace_cpu_16vcpus_8threads_5runs_notpinned_smt0_ups.out.plot.err.p
> > ng).
> This is a bit strange.ÂÂYou're showing that for *unpinned* vcpus,
> with
> empty cores, there are vcpus sharing the same thread for significant
> periods of time?ÂÂThat definitely shouldn't happen.
I totally agree with George on this, the key word of what's he is
saying being _significant_. In fact, this is the perfect summary/bottom
line of my explanation of how our SMT load balancer in Credit1 works...

From just looking at the graph, I can't spot many places where this
happens really for a significant amount of time. Am I wrong? For
knowing for sure, we need to check the full trace.

Ah, given the above, one could ask why we do not change Credit1 to
actually do the SMT load balancing more frequently, e.g., at each vcpu
wakeup. That is certainly a possibility, but there is the risk that the
overhead of doing that too frequently (and there is indeed some
overhead!) absorb the benefits of a more efficient placing (and I do
think that will be the case, the wakeup path, in Credit1, is already
complex and crowded enough! :-/).

> Could you try a couple of these tests with the credit2 scheduler,
> just
> to see?ÂÂYou'd have to make sure and use one of the versions that has
> hard pinning enabled; I don't think that made 4.6, so you'd have to
> use
> xen-unstable I think.
Nope, sorry, I would not do that, yet. I've got things half done
already, but I have been sidetracked by other things (including this
one), and so Credit2 is not yet in a shape where running the benchmarks
with it, even if using staging, would represent a fair comparison
between Credit1.

I'll get back to the work that is still pending to make that possible
in a bit (after FOSDEM, i.e., next week).

> > We try to make guests topology aware but looks like for cpu bound
> > workloads its
> > not that easy.
> > Any suggestions are welcome.
> well one option is always, as you say, to try to expose the topologyÂ
> to
> But that is a fairly limited solution -- in order for that
> information to be accurate, the vcpus need to be pinned, which in
> turn
> means 1) a lot more effort required by admins, and 2) a lot less
> opportunity for sharing of resources which is one of the big 'wins'
> for
> virtualization.

Exactly. I indeed think that it would be good to support this mode, but
it has to be put very clear that, either the pinning does not ever ever
ever ever ever...ever change, or you'll get back to pseudo-random
scheduler(s)' behavior, leading to unpredictable and inconsistent
performance (maybe better, maybe worse, but certainly inconsistent).

Or, when pinning changes, we figure out a way to tell Linux (the
scheduler and every pother component that needs to know) that something
like that happened. As far as scheduling domains goes, I think there is
a way to ask the kernel to rebuild the hierarcy, but I've never tried
that, and I don't know it's available to userspace (already).

> The other option is, as Dario said, to remove all topology
> information
> from Linux, and add functionality to the Xen schedulers to attempt to
> identify vcpus which are communicating or sharing in some other way,
> and
> try to co-locate them.ÂÂThis is a lot easier and more flexible for
> users, but a lot more work for us.
This is the only way that Linux will see something that is not wrong,
if pinning is not used, so I think we really want this, and want it to
be the default (and Juergen has patches for this! :-D)

Thanks again and Regards,
<<This happens because I choose it to happen!>> (Raistlin Majere)
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.