[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Performance evaluation and Questions: Eliminating Xen (RTDS) scheduler overhead on dedicated CPU



On 04/07/2015 09:25 PM, Meng Xu wrote:
> Hi George, Dario and Konrad,
> 
> I finished a prototype of the RTDS scheduler with the dedicated CPU
> feature and did some quick evaluation on this feature. Right now, I
> need to refactor the code (because it is kind of messy when I was
> exploring different approaches :() and will send out the clean patch
> later (this week or next week). But the design follows our discussion
> at http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02854.html.
> 
> In a nutshell of the design, when a CPU is marked as dedicated CPU,
> the scheduler on that CPU will return the dedicated VCPU on it with a
> negative time so that it disable the scheduler timer on that CPU and
> other CPUs will no longer send SCHEDULE_SOFTIRQ to the dedicated CPU.
> The scheduler on the dedicated CPU may still be invoked when the
> dedicated VCPU is blocked/unblocked by the domU. Once this situation
> occurs, the schedule go though a fast pass to just return an idle
> VCPU/the dedicated VCPU instead of go through the runq.
> 
> I did the following evaluation to show the benefits of introducing
> this dedicated CPU feature:
> 
> I created a simple cpu-intensive task which just do the multiplication
> for specified times:
>         start = rdtsc();
>         while ( i++ < cpu_measurement->multiply_times )
>             result += i * i;
>         finish = rdtsc();
>         latencies[k] = finish - start;
> 
> I run this task and measure the execution time of above piece of code
> on different environments: native Linux on bare metal, domU on Xen
> with RTDS scheduler and domU on Xen with RTDS scheduler with dedicated
> CPU feature, domU on Xen with Credit/Credit2 scheduler.
> 
> The difference between the execution time in virtualization
> environment and the execution time on native linux on bare metal is
> the virtualization overhead introduced by Xen.
> 
> I want to see that
> 1) The virtualization overhead decreases a lot after the dedicated CPU
> feature is employed for RTDS scheduler (because the execution of the
> task will no longer suffer the scheduler overhead any more).
> 2) The frequency of invoking the scheduler on the dedicated CPU
> becomes very low once the dedicated CPU feature is applied.
> 
> The result is as follows:
> When the cpu-intensive task did the multiplication for 1024 times, the
> execution time of the piece of code is:
> 9264 cycles on native linux on bare metal;
> 10320 cycles on Xen RTDS scheduler with dedicated CPU feature;
> 10324 cycles on Xen RTDS scheduler without dedicated CPU feature;
> 
> We didn't see the improvement of the dedicated CPU feature here
> because the execution time is too short and it may not experience the
> scheduler overhead yet.
> 
> When the cpu-intensive task did the multiplication for  536870912
> times, the execution time of the piece of code is:
> 4838016028  cycles on native linux on bare metal;
> 4839649567 cycles on Xen RTDS scheduler with dedicated CPU feature;
> 4855509977 cycles on Xen RTDS scheduler without dedicated CPU feature;

Hey Meng!  Thanks for looking at this.

One thing: it's not entirely clear to me whether the numbers for
"without dedicated CPU feature" are still with the equivalent of
"pinning' -- i.e., is it guaranteed that no other vcpu will be run on
the same cpu as the test program?

Assuming that's the case, the numbers you give above show a 0.3%
improvement for the "dedicated" cpu for cpu-intensive workloads.


> We can see that the dedicated CPU feature did save time for the
> cpu-intensive task. Without the dedicated CPU feature, the hypervisor
> scheduler may steal time from the domU and delay the execution of the
> task inside domU.
> 
> I did vary the number of multiplications of the above piece of code in
> cpu-intensive task, and draw a figure to show the relation of the
> overhead and the execution time of the task on native linux. The
> figure can be found at
> http://www.cis.upenn.edu/~mengxu/xen-ml/cpu-base-alone_multiply_0_0_100.virtOhVSwcetnative.pdf.
> Please note the x-axis is the "log" value of the execution time. So
> the overhead is actually linear to the execution time of the task.

I'm not sure I can gain any useful information out of this graph.  A
more useful comparison  would be to graph the execution time as an
*overhead* compared to the Linux execution time.  For instance, in the
numbers above, you'd have Linux = 1, RTDS+dedicated = 1.000337, RTDS =
1.00361.

But what it sounds like you're saying is that if you did such a graph,
the overhead would be pretty flat.  That's what I'd expect -- a fairly
constant overhead, regardless of how long you were running the test.

> As to the frequence of invoking the RTDS scheduler with/without the
> dedicated CPU feature, I add some code to trace which event triggers
> the scheduler on the dedicated cpu and how frequent it is.
> 
> Before we apply the dedicated CPU feature to the RTDS scheduler, the
> dedicated CPU 3 was invoked once
> every 3.5us in average.
> (XEN) cpu 3 has invoked 356805936 SCHED_SOFTIRQ (sched) within 1267613845122 
> ns
> (XEN) tasklet_enqueue(0), do_tasklet(0), s_timer_fn(356789129), do_pool(18)
> (XEN) vcpu_yield(0), vcpu_block(10483)
> 
> 
> After we apply the dedicated CPU feature to the RTDS scheduler, the
> dedicated CPU 3 was invoked once every 136ms in average. And the
> scheduler was invoked because of vcpu_block/vcpu_unblock event. (We
> could modify Linux in domU as Konrad suggests to avoid the hypercall
> when vcpu is blocked/unblocked, but I'm unsure if it is better to do
> that since it involves the change in domU.)
> (XEN) cpu 3 has invoked 5396 SCHED_SOFTIRQ (nooh) within 736973916783 ns
> (XEN) tasklet_enqueue(0), do_tasklet(0), s_timer_fn(0), do_pool(0)
> (XEN) vcpu_yield(0), vcpu_block(2698)
> 
> 
> Here are some conclusions/observation we have:
> 1) Dedicated CPU feature can save the scheduler overhead from domU and
> thus reduce the virtualization overhead.
> 2) Scheduler overhead of the current RTDS scheduler in the view of
> application is higher than the scheduler overhead of the current
> credit/credit2 scheduler because the RTDS scheduler is invoked much
> more frequent than the credit/credit2 scheduler. (RTDS scheduler is
> invoked <= every 1ms, while credit2 scheduler is invoked once every
> 30ms.) This shows we do need to move the RTDS scheduler from quantum
> driven to event driven (i.e., timer-driven) and only call the
> scheduler when it is necessary.

So a couple of things.  First, the vast majority of people using
virtualization don't care *that much* about the CPU overhead.  Even in
the case of embedded, a 0.3% overhead reduction would probably translate
to a 0.3% improvement in battery life -- an amount so miniscule that it
would be lost in the noise.

Secondly, adding an entirely new interface, as implementing the
"dedicated cpu" would require, on the other hand, is a fairly
significant cost.  It's costly for users to learn and configure the new
interface, it's costly to document, and once it's there we have to
continue to support it perhaps for a long time to come; and the feature
itself is also fairly complicated and increases the code maintenance.

So the performance improvement you've shown so far I think is nowhere
near high enough a benefit to outweigh this cost.

And in any case, as you say, it looks like the source of the overhead is
the very frequent invocation of the RTDS scheduler.  You could probably
get the same kinds of benefits without adding any new interfaces by
reducing the amount of time the scheduler gets invoked when there are no
other tasks to run on that cpu.

What I was expecting you to test, for the RTDS scheduler, was the
wake-up latency.  Have you looked at that at all?

> 3) There exist some constant virtualization overhead (see the case
> when the the task's execution time is very small 9264 cycles). I don't
> know where this kind of constant virtualization overhead comes from
> and if we can eliminate/bound this kind of overhead. Do you have any
> suggestion/advice on this?

How are you testing this -- running RDTSC?  Do you know what TSC mode
you're running in?  If you're trapping on TSCs, that might account for
some of the overhead for very small cycles.

Other than that, nothing comes to mind off the top of my head.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.