[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Design and Question: Eliminate Xen (RTDS) scheduler overhead on dedicated CPU



On Tue, Mar 24, 2015 at 3:27 PM, Meng Xu <xumengpanda@xxxxxxxxx> wrote:
>> The simplest way to get your prototype working, in that case, would be
>> to return the idle vcpu for that pcpu if the guest is blocked.
>
>
> Exactly! Thank you so much for pointing this out!  I did hardwired it always
> to return the vcpu that is supposed to be blocked. Now I totally understand
> what happened. :-)
>
> But this lead to another issue to my design:
> If I return the idle vcpu when the dedicated VCPU is blocked, it will do the
> context_switch(prev, next); when the dedicated VCPU is unblocked, another
> context_switch() is triggered.
> It means that we can not eliminate the context_switch overhead for the
> dedicated CPU.
> The ideal performance for the dedicated VCPU on the dedicated CPU should be
> super-close to the bare-metal CPU. Here we still have the context_switch
> overhead, which is about  1500-2000  cycles.
>
> Can we avoid the context switch overhead?

If you look at xen/arch/x86/domain.c:context_switch(), you'll see that
it's already got clever algorithms for avoiding as much context switch
work as possible.  In particular, __context_switch() (which on x86
does the actual work of context switching) won't be called when
switching *into* the idle vcpu; nor will it be called if you're
switching from the idle vcpu back to the vcpu it switched away from
(curr_vcpu == next).  Not familiar with the arm path, but hopefully
they do something similar.

IOW, a context switch to the idle domain isn't really a context switch. :-)

> However, because credit2 scheduler counts the credit in domain level, the
> function of counting the credit burned should not be avoided.

Actually, that's not true.  In credit2, the weight is set at a domain
level, but that only changes the "burn rate".  Individual vcpus are
assigned and charged their own credits; and credit of a vcpu in one
runqueue has no comparison to or direct effect on the credit of a vcpu
in another runqueue.  It wouldn't be at all inconsistent to simply not
do the credit calculation for a "dedicated" vcpu.  The effect on other
vcpus would be exactly the same as having that vcpu on a runqueue by
itself.

>> But it's not really accurate to say
>> that you're avoiding the scheduler entirely.  At the moment, as far as
>> I can tell, you're still going through all the normal schedule.c
>> machinery between wake-up and actually running the vm; and the normal
>> machinery for interrupt delivery.
>
>
> Yes. :-(
> Ideally, I want to isolate all such interference from the dedicated CPU so
> that the dedicated VCPU on it will have the high-performance that is close
> to the bare-metal cpu. However, I'm concerning about how complex it will be
> and how it will affect the existing functions that relies on  interrupts.

Right; so there are several bits of overhead you might address:

1. The overhead of scheduling calculations -- credit, load balancing,
sorting lists, &c; and regular scheduling interrupts.

2. The overhead in the generic code of having the flexibility to run
more than one vcpu.  This would (probably) be measured in the number
of instructions from a waking interrupt to actually running the guest
OS handler.

3. The maintenance things that happen in softirq context, like
periodic clock synchronization, &c.

Addressing #1 is fairly easy.  The most simple thing to do would be to
make a new scheduler and use cpupools; but it shouldn't be terribly
difficult to build the functionality within existing schedulers.

My guess is that #2 would involve basically rewriting a parallel set
of entry / exit routines which were pared down to an absolute minimum,
and then having machinery in place to switch a CPU to use those
routines (with a specific vcpu) rather than the current, more
fully-functional ones.   It might also require cutting back on the
functionality given to the guest as well in terms of hypecalls --
making this "minimalist" Xen environment work with all the existing
hypercalls might be a lot of work.

That sounds like a lot of very complicated work, and before you tried
it I think you'd want to be very much convinced that it would pay off
in terms of reduced wake-up latency.  Getting from 5000 cycles down to
1000 cycles might be worth it; getting from 1400 cycles down to 1000,
or 5000 cycles down to 4600, maybe not so much. :-)

I'm not sure exactly what #3 would entail; it might involve basically
taking the cpu offline from Xen's perspective.  (Again, not sure if
it's possible or worth it.)

You might take a look at this presentation from FOSDEM last year, to
see if you can get any interesting ideas:

https://archive.fosdem.org/2014/schedule/event/virtiaas13/

Any opinions, Dario / Jan / Tim?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.