[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] xen: credit2: credit2 can’t reach the throughput as expected



Forwarding to xen-devel, as it was dropped.
---
Hi, Dario,

> On Thu, 2019-02-14 at 07:10 +0000, zheng chuan wrote:
> > Hi, Dario,
> >
> Hi,
> 
> > I have put the test demo in attachment, please run it as follows:
> > 1. compile it
> > gcc upress.c -o upress
> > 2. calculate the loops in dom0 first
> > ./upress -l 100
> > For example, the output is
> > cpu khz : 2200000
> > calculate loops: 4472.
> > We get 4472.
> > 3. give the 20% pressure for each vcpu in guest by ./upress -l 20
-z
> > 4472 & It is better to bind each pressure task to vcpu by taskset.
> >
> Ok, thanks for the code and the instructions, I will give it a try.
> 

If you have questions about the test code, please let me know:)

> > Sorry for the mess picture, you can see the figure below.
> >
> Yeah, that's clearer. However, it is preferred to avoid HTML emails.
In these
> cases, you could put the table in some online accessible document,
and post
> the link. :-)
> 
> Also, let me ask this again, is this coming from actual tracing (like
with
> `xentrace` etc)?
> 
> > The green one means vcpu is running while the red one means idle.
> > In Fig.1, vcpu1 and vcpu2 runs staggeredly, it means vcpu1 runs
20ms
> > and then vcpu2 runs 20ms while vcpu1 is sleeping.
> >
> How do you know it's sleeping and not, for instance, that it has been
> preempted and hence is waiting to run?
> 

It is the schematic diagram in theory.
I'm sorry the xentrace has problem on my machine, I'll put the trace as
soon as I fix it.

> My point being that, when you setup a workload like this, and only
look at the
> throughput you achieve, it is expected that schedulers with longer
timeslices
> do better.
> 
> It would be interesting to look at both throughput and latency,
though.
> In fact, (assuming the analysis is correct) in the Credit1 case, if
two vcpus
> wakes up at about the same time, the one that wins the pcpu runs for
a full
> timeslice, or until it blocks, i.e., in this case, for 20ms.
> This means the other vcpu will have to wait for so long, before being
able to do
> anything.
> 
> > In Fig.2, vcpu1 and vcpu2 runs at the same time, it means vcpu1 and
> > vcpu2 compete for pCPU, and then go to sleep at the same time.
> > Obviously, the smaller time-slice is, the worse competition
happens.
> >
> But the better the latency. :-D
> 
> What I mean is that, achieving best throughput or best latency at the
same
> time is often impossible, and the job of a scheduler is to come up
with a
> trade-off, as well as with tunables for letting people that cares
more about
> either one or the other, to steer it that direction.
> 
> Achieving better latency than Credit1 has been a goal of Credit2,
since design
> time. However, it's possible that we ended up sacrificing throughput
too much,
> or that we lack tunables to let users decide what they want.
> 
> Of course, this is all assuming that the analysis of the problem that
you're
> providing is correct, which I'll be looking into confirming. :-)
> 

I agree that it is difficult to balance between the throughput and
sched_latency:(
In my workload, if we enlarge the difference of credit between vcpus,
the vcpus would be run staggeredly from the long term to see, 
I doubt the sched_latency could be also low if we spread running vcpus
into pCPUs since the pCPUs are not used up to 100%.

I used to test sched_latency in CFS by perf with the scheduler
parameters:
linux-GMwmmh:~ # cat /proc/sys/kernel/sched_min_granularity_ns 
3000000
linux-GMwmmh:~ # cat /proc/sys/kernel/sched_latency_ns 
24000000
linux-GMwmmh:~ #
the vcpu of sched_latency is maxium to 21ms around. 

But I don't know how to compare the sched_latency with Credit2 since
the analyzing tool and scheduler are totally different :(.

> > As you mentioned that the Credit2 does not have a real timeslice,
the
> > vcpu can be preempted by the difference of credit (+
> > sched_ratelimit_us) dynamically.
> >
> Actually, it's:
> 
> difference_of_credit + min(CSCHED2_MIN_TIMER, sched_ratelimit_us)
> 

Thank you for correcting.

> > > Perhaps, one thing that can be done to try to confirm this
> > analysis, would be to
> > > make the scheduling less frequent in Credit2 and, on the other
> > hand, to make
> > > it more frequent in Credit1.
> 
> > Here is the further test result:
> > i. it is interesting that it still works well if I make Credit1 to
> > 1ms by xl sched-credit -s -t 1
> > linux-sodv:~ # xl sched-credit
> > Cpupool Pool-0: tslice=1ms ratelimit=1000us migration-delay=0us
> > Name ID Weight Cap
> > Domain-0 0 256 0
> > Xenstore 1 256 0
> > guest_1 2 256 0
> > guest_2 3 256 0
> >
> Hah, yes, it is interesting indeed! It shows us one more time how not
> predictable Credit1 behavior is, because of all the hacks it
> accumulated over time (some of which, are my doing, I know... :-P).
> 
> > ii. it works well if sched_ratelimit_us is set up to 30ms above.
> > linux-sodv:~ # xl sched-credit2 -s -p Pool-0
> > Cpupool Pool-0: ratelimit=30000us
> >
> Ok, good to know, thanks for doing the experiment.
> 
> If you have time, can you try other values? I mean, while still on
> Credit2, try to set ratelimiting to, like, 20, 15, 10, 5, and report
> what happens?
> 

It still has problem if I set ratelimiting below 30ms.
1): 20ms not stable, sometimes can up to 80% and 160%
xentop - 20:08:14 Xen 4.11.0_02-1
4 domains: 1 running, 3 blocked, 0 paused, 0 crashed, 0 dying, 0
shutdown
Mem: 67079796k total, 67078968k used, 828k free CPUs: 32 @ 2599MHz
NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS
NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID
Domain-0 -----r 112 3.0 64050452 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0
0
guest_1 --b--- 110 67.1 1048832 1.6 1049600 1.6 4 1 636 4 1 0 4072 2195
191451 10983 0
guest_2 --b--- 186 134.8 1048832 1.6 1049600 1.6 8 1 630 4 1 0 4203
1166 191619 10921 0

2): 15 ms
xentop - 20:10:07 Xen 4.11.0_02-1
4 domains: 2 running, 2 blocked, 0 paused, 0 crashed, 0 dying, 0
shutdown
Mem: 67079796k total, 67078968k used, 828k free CPUs: 32 @ 2599MHz
NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS
NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID
Domain-0 -----r 116 2.7 64050452 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0
0
guest_1 --b--- 193 73.9 1048832 1.6 1049600 1.6 4 1 927 5 1 0 4072 2198
191451 10992 0
guest_2 -----r 350 146.6 1048832 1.6 1049600 1.6 8 1 921 6 1 0 4203
1169 191619 10930 0

3): 10 ms
xentop - 20:07:35 Xen 4.11.0_02-1
4 domains: 2 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0
shutdown
Mem: 67079796k total, 67078968k used, 828k free CPUs: 32 @ 2599MHz
NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS
NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID
Domain-0 -----r 111 3.1 64050452 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0
0
guest_1 -----r 81 67.1 1048832 1.6 1049600 1.6 4 1 588 3 1 0 4072 2193
191451 10980 0
guest_2 ------ 130 125.5 1048832 1.6 1049600 1.6 8 1 583 4 1 0 4203
1164 191619 10918 0

4): 5 ms
xentop - 20:07:12 Xen 4.11.0_02-1
4 domains: 3 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0
shutdown
Mem: 67079796k total, 67078968k used, 828k free CPUs: 32 @ 2599MHz
NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS
NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID
Domain-0 -----r 110 2.8 64050452 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0
0
guest_1 -----r 66 64.5 1048832 1.6 1049600 1.6 4 1 386 3 1 0 4072 2187
191451 10835 0
guest_2 -----r 101 124.3 1048832 1.6 1049600 1.6 8 1 381 3 1 0 4203
1161 191619 10822 0

> > However, the sched_ratelimit_us is not so elegant and flexiable
that
> > it guarantees the specific time-slice fixedly.
> >
> Well, I personally never loved it, but it is not completely unrelated
> to what we're seeing and discussing, TBH. It indeed was introduced to
> improve the throughput, in workloads where there was too many wakeups
> (which, in Credit1, also resulted in invoking the scheduler and often
> in context switching, do to boosting).
> 
> > It may very likely cause degrading of the other scheduler criteria
> > like sched_latency.
> > As far as I know, CFS could adjust time-slice according to the
> > nr_queue in runqueue (in__sched_period() ).
> > Could it possible that Credit2 also have the similar ability to
> > adjust time-slice automatically?
> >
> Well, let's see. Credit2 and CFS are very similar, in principle, but
> the code is actually quite different. But yeah, we may be able to
come
> up with something more clever than just plain ratelimiting, for
> adjusting what CFS calls "the granularity".
> 

Yes, but I think it is a little difficult to do the same way like CFS
due to one queue per socket since we can not have the runqueue per-cpu
anymore.

Best Regards.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.