Xen project Mailing List

Re: [Xen-devel] [PATCH 0/2] xen: credit2: fix vcpu starvation due to too few credits

To: Jürgen Groß <jgross@xxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Thu, 12 Mar 2020 16:27:06 +0000

Authentication-results: esa6.hc3370-68.iphmx.com; dkim=none (message not signed) header.i=none; spf=None smtp.pra=andrew.cooper3@xxxxxxxxxx; spf=Pass smtp.mailfrom=Andrew.Cooper3@xxxxxxxxxx; spf=None smtp.helo=postmaster@xxxxxxxxxxxxxxx

Cc: Charles Arnold <carnold@xxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Glen <glenbarney@xxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Tomas Mozes <hydrapolic@xxxxxxxxx>, Sarah Newman <srn@xxxxxxxxx>

Delivery-date: Thu, 12 Mar 2020 16:27:21 +0000

Ironport-sdr: 73Zn+Qy9t4fAfhDcQLmQSUHAT1rnNKCSbNwPfu0KZ+W19QZ7gZWBwBlROy/xuJvNzYP+6vF543 rprNJL++WHHsBt62YJ8pbbihTU2eZBWcMmGlNOLyjDXjJCb2nx6fZtNaKtAWPjQq9Ej5h8CQR9 +sJzIrUDk4LMzo4B9IT9HnA0jy4JHjEoI5SGn83YYMpKAFxqUHCaGThiwzGEKbOhdRUNOmh4eQ Xeh1rChCGTpCGI6q5Gs4hz4YUzne8vel+cLDamNinvm/pcereB4sH2Xaf6NW/4jDIl8O+Oa7/r iKo=

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 12/03/2020 15:51, Jürgen Groß wrote: > On 12.03.20 14:44, Dario Faggioli wrote: >> Hello everyone, >> >> There have been reports of a Credit2 issue due to which vCPUs where >> being starved, to the point that guest kernel would complain or even >> crash. >> >> See the following xen-users and xen-devel threads: >> https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00018.html >> >> https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html >> >> https://lists.xenproject.org/archives/html/xen-devel/2020-02/msg01158.html >> >> >> I did some investigations, and figured out that the vCPUs in question >> are not scheduled for long time intervals because they somehow manage to >> be given an amount of credits which is less than the credit the idle >> vCPU has. >> >> An example of this situation is shown here. In fact, we can see d0v1 >> sitting in the runqueue while all the CPUs are idle, as it has >> -1254238270 credits, which is smaller than -2^30 = −1073741824: >> >> (XEN) Runqueue 0: >> (XEN) ncpus = 28 >> (XEN) cpus = 0-27 >> (XEN) max_weight = 256 >> (XEN) pick_bias = 22 >> (XEN) instload = 1 >> (XEN) aveload = 293391 (~111%) >> (XEN) idlers: >> 00,00000000,00000000,00000000,00000000,00000000,0fffffff >> (XEN) tickled: >> 00,00000000,00000000,00000000,00000000,00000000,00000000 >> (XEN) fully idle cores: >> 00,00000000,00000000,00000000,00000000,00000000,0fffffff >> [...] >> (XEN) Runqueue 0: >> (XEN) CPU[00] runq=0, sibling=00,..., core=00,... >> (XEN) CPU[01] runq=0, sibling=00,..., core=00,... >> [...] >> (XEN) CPU[26] runq=0, sibling=00,..., core=00,... >> (XEN) CPU[27] runq=0, sibling=00,..., core=00,... >> (XEN) RUNQ: >> (XEN) 0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] >> load=262144 (~100%) >> >> This happens bacause --although very rarely-- vCPUs are allowed to >> execute for much more than the scheduler would want them to. >> >> For example, I have a trace showing that csched2_schedule() is >> invoked at >> t=57970746155ns. At t=57970747658ns (+1503ns) the s_timer is set to >> fire at t=57979485083ns, i.e., 8738928ns in future. That's because >> credit >> of snext is exactly that 8738928ns. Then, what I see is that the next >> call to burn_credits(), coming from csched2_schedule() for the same vCPU >> happens at t=60083283617ns. That is *a lot* (2103798534ns) later than >> when we expected and asked. Of course, that also means that delta is >> 2112537462ns, and therefore credits will sink to -2103798534! > > Current ideas are: > > - Could it be the vcpu is busy for very long time in the hypervisor? > So either fighting with another vcpu for a lock, doing a long > running hypercall, ... Using watchdog=2 might catch that. (There is a counting issue which I've not had time to fix yet, which makes the watchdog more fragile with a smaller timeout, but 2 should be ok.) > > - The timer used is not reliable. > > - The time base is not reliable (tsc or whatever is used for getting > the time has jumped 2 seconds into the future). Worth instrumenting the TSC rendezvous for unexpectedly large jumps? > > - System management mode has kicked in. SMM handlers need to rendezvous to keep their secrets secret these days, but I suppose this is always a possibility. There are non-architectural SMI_COUNT MSRs (0x34 on Intel, can't remember AMD off the top of my head) which can be used to see if any have occurred, and this has proved useful in the past for debugging. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.