[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Performance evaluation and Questions: Eliminating Xen (RTDS) scheduler overhead on dedicated CPU



Hey, guys,

I know, I know, I'm soooo much late to the party! Sorry, I got trapped
into a thing that I really needed to finish... :-/

I've got no intention to resurrect this old thread, just wanted to
pointed out a few things.

On Wed, 2015-04-08 at 16:52 -0400, Meng Xu wrote:
> 2015-04-08 5:13 GMT-04:00 George Dunlap <george.dunlap@xxxxxxxxxxxxx>:
> > On 04/07/2015 09:25 PM, Meng Xu wrote:
> >> Hi George, Dario and Konrad,
> >>
> >> I finished a prototype of the RTDS scheduler with the dedicated CPU
> >> feature and did some quick evaluation on this feature. Right now, I
> >> need to refactor the code (because it is kind of messy when I was
> >> exploring different approaches :() and will send out the clean patch
> >> later (this week or next week). But the design follows our discussion
> >> at 
> >> http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02854.html.
> >>
The idea of 'dedicated CPU' makes sense. It's also always been quite
common, in the Linux community, to see it as a real-time oriented
feature. I personally don't agree much, as real-time is about
determinism and dedicating a CPU to a task (in our case, that would mean
dedicating a pCPU to a vCPU and then, in the guest, that vCPU to a task)
does not automatically gives you determinism.

Sure, it cuts off some overhead and some sources of unpredictable
behavior (e.g., scheduler code), but not all of them (what about, for
instance, caches shared with non-isolated pCPUs). No, IMO, if you want
determinism, you should make the code deterministic, not get rid of
it! :-D

In fact, Linux has a feature similar the one Meng investigated, and that
has traditionally been used (at least until I was involved with Linux
scheduling) by HPC people, database engines and high frequency trading
use cases (which are also often categorized as 'real-time workloads' but
just aren't, IMO).

It's called isolcpus. For sure there was a boot time parameter for it,
and it looks like it is still there:
http://wiki.linuxcnc.org/cgi-bin/wiki.pl?The_Isolcpus_Boot_Parameter_And_GRUB2
http://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html
http://lxr.linux.no/linux+v3.19.1/Documentation/kernel-parameters.txt#L1530

I'm not sure whether they grew interfaces to setup this at runtime, but
doubt it.

For us, I'm not sure whether something like that would be useful. To be
fruitfully used together with something similar to Linux's isolcpus, it
need to look like how Meng is doing it, i.e., it ought to be possible to
handle single vCPUs, not full domains. However...

> > Secondly, adding an entirely new interface, as implementing the
> > "dedicated cpu" would require, on the other hand, is a fairly
> > significant cost.  It's costly for users to learn and configure the new
> > interface, it's costly to document, and once it's there we have to
> > continue to support it perhaps for a long time to come; and the feature
> > itself is also fairly complicated and increases the code maintenance.
> >
> > So the performance improvement you've shown so far I think is nowhere
> > near high enough a benefit to outweigh this cost.
> 
... I agree with George on this...

> OK. I see and agree.
> 
... and I'm happy you also do! :-D

> > And in any case, as you say, it looks like the source of the overhead is
> > the very frequent invocation of the RTDS scheduler.
>
Exactly! I'd put it this way: there are more urgent and more useful
optimization, in general, but especially in RTDS, to be done before
thinking at something like this.

>   You could probably
> > get the same kinds of benefits without adding any new interfaces by
> > reducing the amount of time the scheduler gets invoked when there are no
> > other tasks to run on that cpu.
> 
Exactly. And again, that is particularly relevant to RTDS, as numbers
show. Looking again at Linux world, this (i.e., avoiding invoking the
scheduler when there is only one task on a CPU) is also something
they've introduced rather recently.

It's called full dynticks:
https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt
http://ertl.jp/~shinpei/conf/ospert13/slides/FredericWeisbecker.pdf
https://lwn.net/Articles/549580/
http://thread.gmane.org/gmane.linux.kernel/1485210 [*]

[*] check out Linus' replies... "awesome" as usual, he even managed to
rant about virtualization, all by himself!! :-P

That is IMO a line of action that may deserve some investigation.
RTDS-wise, for sure... the Credit-s, are not at all bad from that
perspective (as your numbers also show), but it might be possible to do
better.

> Yes. This is what Dagaen (cc.ed) is doing right now. He had a RFC
> patch and sent to me last week. We  are working on refining the patch
> before sending it out to the mailing list.
> 
I'll be super glad to see this! :-D

> >
> > What I was expecting you to test, for the RTDS scheduler, was the
> > wake-up latency.  Have you looked at that at all?
> 
Indeed, that would be really interesting.

> Ah, I didn't realize this... Do you have any concrete evaluation plan for 
> this?
> In my mind, I can issue hypercalls in domU to wake-up and sleep a vcpu
> and measure how long it takes to wake up a vcpu. Maybe you have some
> better idea in mind?
> (The wake up latency of a vcpu will depends on the priority of the
> vcpu and how heavy loaded the system is, in my speculation.)
> 
Yes, that is something that could (should?) be done, as the wakeup
latency of a vcpu is a lower bound for the wakeup latency of in-guest
workloads, so we really want to know where we stand wrt to that, if we
need to improve things, and if yes how.

It's priority and load dependant... yes, of course, but that's why we
have real-time schedulers for, isn't it? :-P Jokes apart, for the actual
'lower bound', we're clearly interesting to measure a vcpu when running
alone on a pCPU, or with top priority.

On the other hand, to look at wakeup latency from within the guest,
cyclictest is the way to go:
https://rt.wiki.kernel.org/index.php/Cyclictest

What we want is to run it inside a guest, under different host and guest
load conditions (and using different schedulers, varying the scheduling
parameters, etc), and see what happens... Ever looked at that? I think
it would be interesting.

I've done something similar while preparing this talk:
https://archive.fosdem.org/2014/schedule/event/virtiaas16/

But never got the chance to repeat the experiments (neither I did any
further reasoning or investigation about how the timestamps are
obtained, TSC emulation, etc., as George pointed out)

That's all... Sorry again for chiming in only now. :-(

Regards,
Dario

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.