Xen project Mailing List

Re: [Xen-devel] xen: credit2: credit2 can’t reach the throughput as expected

To: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: Dario Faggioli <dfaggioli@xxxxxxxx>

Date: Thu, 14 Feb 2019 16:08:09 +0100

Delivery-date: Thu, 14 Feb 2019 15:07:34 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Forwarding to xen-devel, as it was dropped. --- Hi, Dario, > On Thu, 2019-02-14 at 07:10 +0000, zheng chuan wrote: > > Hi, Dario, > > > Hi, > > > I have put the test demo in attachment, please run it as follows: > > 1. compile it > > gcc upress.c -o upress > > 2. calculate the loops in dom0 first > > ./upress -l 100 > > For example, the output is > > cpu khz : 2200000 > > calculate loops: 4472. > > We get 4472. > > 3. give the 20% pressure for each vcpu in guest by ./upress -l 20 -z > > 4472 & It is better to bind each pressure task to vcpu by taskset. > > > Ok, thanks for the code and the instructions, I will give it a try. > If you have questions about the test code, please let me know:) > > Sorry for the mess picture, you can see the figure below. > > > Yeah, that's clearer. However, it is preferred to avoid HTML emails. In these > cases, you could put the table in some online accessible document, and post > the link. :-) > > Also, let me ask this again, is this coming from actual tracing (like with > `xentrace` etc)? > > > The green one means vcpu is running while the red one means idle. > > In Fig.1, vcpu1 and vcpu2 runs staggeredly, it means vcpu1 runs 20ms > > and then vcpu2 runs 20ms while vcpu1 is sleeping. > > > How do you know it's sleeping and not, for instance, that it has been > preempted and hence is waiting to run? > It is the schematic diagram in theory. I'm sorry the xentrace has problem on my machine, I'll put the trace as soon as I fix it. > My point being that, when you setup a workload like this, and only look at the > throughput you achieve, it is expected that schedulers with longer timeslices > do better. > > It would be interesting to look at both throughput and latency, though. > In fact, (assuming the analysis is correct) in the Credit1 case, if two vcpus > wakes up at about the same time, the one that wins the pcpu runs for a full > timeslice, or until it blocks, i.e., in this case, for 20ms. > This means the other vcpu will have to wait for so long, before being able to do > anything. > > > In Fig.2, vcpu1 and vcpu2 runs at the same time, it means vcpu1 and > > vcpu2 compete for pCPU, and then go to sleep at the same time. > > Obviously, the smaller time-slice is, the worse competition happens. > > > But the better the latency. :-D > > What I mean is that, achieving best throughput or best latency at the same > time is often impossible, and the job of a scheduler is to come up with a > trade-off, as well as with tunables for letting people that cares more about > either one or the other, to steer it that direction. > > Achieving better latency than Credit1 has been a goal of Credit2, since design > time. However, it's possible that we ended up sacrificing throughput too much, > or that we lack tunables to let users decide what they want. > > Of course, this is all assuming that the analysis of the problem that you're > providing is correct, which I'll be looking into confirming. :-) > I agree that it is difficult to balance between the throughput and sched_latency:( In my workload, if we enlarge the difference of credit between vcpus, the vcpus would be run staggeredly from the long term to see, I doubt the sched_latency could be also low if we spread running vcpus into pCPUs since the pCPUs are not used up to 100%. I used to test sched_latency in CFS by perf with the scheduler parameters: linux-GMwmmh:~ # cat /proc/sys/kernel/sched_min_granularity_ns 3000000 linux-GMwmmh:~ # cat /proc/sys/kernel/sched_latency_ns 24000000 linux-GMwmmh:~ # the vcpu of sched_latency is maxium to 21ms around. But I don't know how to compare the sched_latency with Credit2 since the analyzing tool and scheduler are totally different :(. > > As you mentioned that the Credit2 does not have a real timeslice, the > > vcpu can be preempted by the difference of credit (+ > > sched_ratelimit_us) dynamically. > > > Actually, it's: > > difference_of_credit + min(CSCHED2_MIN_TIMER, sched_ratelimit_us) > Thank you for correcting. > > > Perhaps, one thing that can be done to try to confirm this > > analysis, would be to > > > make the scheduling less frequent in Credit2 and, on the other > > hand, to make > > > it more frequent in Credit1. > > > Here is the further test result: > > i. it is interesting that it still works well if I make Credit1 to > > 1ms by xl sched-credit -s -t 1 > > linux-sodv:~ # xl sched-credit > > Cpupool Pool-0: tslice=1ms ratelimit=1000us migration-delay=0us > > Name ID Weight Cap > > Domain-0 0 256 0 > > Xenstore 1 256 0 > > guest_1 2 256 0 > > guest_2 3 256 0 > > > Hah, yes, it is interesting indeed! It shows us one more time how not > predictable Credit1 behavior is, because of all the hacks it > accumulated over time (some of which, are my doing, I know... :-P). > > > ii. it works well if sched_ratelimit_us is set up to 30ms above. > > linux-sodv:~ # xl sched-credit2 -s -p Pool-0 > > Cpupool Pool-0: ratelimit=30000us > > > Ok, good to know, thanks for doing the experiment. > > If you have time, can you try other values? I mean, while still on > Credit2, try to set ratelimiting to, like, 20, 15, 10, 5, and report > what happens? > It still has problem if I set ratelimiting below 30ms. 1): 20ms not stable, sometimes can up to 80% and 160% xentop - 20:08:14 Xen 4.11.0_02-1 4 domains: 1 running, 3 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown Mem: 67079796k total, 67078968k used, 828k free CPUs: 32 @ 2599MHz NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID Domain-0 -----r 112 3.0 64050452 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0 0 guest_1 --b--- 110 67.1 1048832 1.6 1049600 1.6 4 1 636 4 1 0 4072 2195 191451 10983 0 guest_2 --b--- 186 134.8 1048832 1.6 1049600 1.6 8 1 630 4 1 0 4203 1166 191619 10921 0 2): 15 ms xentop - 20:10:07 Xen 4.11.0_02-1 4 domains: 2 running, 2 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown Mem: 67079796k total, 67078968k used, 828k free CPUs: 32 @ 2599MHz NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID Domain-0 -----r 116 2.7 64050452 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0 0 guest_1 --b--- 193 73.9 1048832 1.6 1049600 1.6 4 1 927 5 1 0 4072 2198 191451 10992 0 guest_2 -----r 350 146.6 1048832 1.6 1049600 1.6 8 1 921 6 1 0 4203 1169 191619 10930 0 3): 10 ms xentop - 20:07:35 Xen 4.11.0_02-1 4 domains: 2 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown Mem: 67079796k total, 67078968k used, 828k free CPUs: 32 @ 2599MHz NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID Domain-0 -----r 111 3.1 64050452 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0 0 guest_1 -----r 81 67.1 1048832 1.6 1049600 1.6 4 1 588 3 1 0 4072 2193 191451 10980 0 guest_2 ------ 130 125.5 1048832 1.6 1049600 1.6 8 1 583 4 1 0 4203 1164 191619 10918 0 4): 5 ms xentop - 20:07:12 Xen 4.11.0_02-1 4 domains: 3 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown Mem: 67079796k total, 67078968k used, 828k free CPUs: 32 @ 2599MHz NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID Domain-0 -----r 110 2.8 64050452 95.5 no limit n/a 32 0 0 0 0 0 0 0 0 0 0 guest_1 -----r 66 64.5 1048832 1.6 1049600 1.6 4 1 386 3 1 0 4072 2187 191451 10835 0 guest_2 -----r 101 124.3 1048832 1.6 1049600 1.6 8 1 381 3 1 0 4203 1161 191619 10822 0 > > However, the sched_ratelimit_us is not so elegant and flexiable that > > it guarantees the specific time-slice fixedly. > > > Well, I personally never loved it, but it is not completely unrelated > to what we're seeing and discussing, TBH. It indeed was introduced to > improve the throughput, in workloads where there was too many wakeups > (which, in Credit1, also resulted in invoking the scheduler and often > in context switching, do to boosting). > > > It may very likely cause degrading of the other scheduler criteria > > like sched_latency. > > As far as I know, CFS could adjust time-slice according to the > > nr_queue in runqueue (in__sched_period() ). > > Could it possible that Credit2 also have the similar ability to > > adjust time-slice automatically? > > > Well, let's see. Credit2 and CFS are very similar, in principle, but > the code is actually quite different. But yeah, we may be able to come > up with something more clever than just plain ratelimiting, for > adjusting what CFS calls "the granularity". > Yes, but I think it is a little difficult to do the same way like CFS due to one queue per socket since we can not have the runqueue per-cpu anymore. Best Regards. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.