[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] crash in csched_load_balance after xl vcpu-pin
On Tue, 2018-04-10 at 09:34 +0000, George Dunlap wrote: > Assuming the bug is this one: > > BUG_ON( cpu != snext->vcpu->processor ); > Yes, it is that one. Another stack trace, this time from a debug=y built hypervisor, of what we are thinking it is the same bug (although reproduced in a slightly different way) is this: (XEN) ----[ Xen-4.7.2_02-36.1.12847.11.PTF x86_64 debug=y Not tainted ]---- (XEN) CPU: 45 (XEN) RIP: e008:[<ffff82d08012508f>] sched_credit.c#csched_schedule+0x361/0xaa9 ... (XEN) Xen call trace: (XEN) [<ffff82d08012508f>] sched_credit.c#csched_schedule+0x361/0xaa9 (XEN) [<ffff82d08012c233>] schedule.c#schedule+0x109/0x5d6 (XEN) [<ffff82d08012fb5f>] softirq.c#__do_softirq+0x7f/0x8a (XEN) [<ffff82d08012fbb4>] do_softirq+0x13/0x15 (XEN) [<ffff82d0801fd5c5>] vmx_asm_do_vmentry+0x25/0x2a (I can provide it all, if necessary.) I've done some analysis, although when we still were not entirely sure that changing the affinities was the actual cause (or, at least, what is triggering the whole thing). In the specific case of this stack trace, the current vcpu running on CPU 45 is d3v11. It is not in the runqueue, because it has been removed, and not added back to it, and the reason is it is not runnable (it has VPF_migrating on in pause_flags). The runqueue of pcpu 45 looks fine (i.e., it is not corrupt or anything like that), it has d3v10,d9v1,d32767v45 in it (in this order) d3v11->processor is 45, so that is also fine. Basically, d3v11 wants to move away from pcpu 45, and this might (but that's not certain) be the reson because we're rescheduling. The fact that there are vcpus wanting to migrate can very well be the cause of affinity being changed. Now, the problem is that, looking into the runqueue, I found out that d3v10->processor=32. I.e., d3v10 is queued in pcpu 45's runqueue, with processor=32, which really shouldn't happen. This leads to the bug triggering, as, in csched_schedule(), we read the head of the runqueue with: snext = __runq_elem(runq->next); and then we pass snext to csched_load_balance(), where the BUG_ON is. Another thing that I've found out, is that all "misplaced" vcpus (i.e., in this and also in other manifestations of this bug) have their csched_vcpu.flags=4, which is CSCHED_FLAGS_VCPU_MIGRATING. This, basically, is again a sign of vcpu_migrate() having been called, on d3v10 as well, which in turn has called csched_vcpu_pick(). > a nasty race condition… a vcpu has just been taken off the runqueue > of the current pcpu, but it’s apparently been assigned to a different > cpu. > Nasty indeed. I've been looking into this on and off, but so far I haven't found the root cause. Now that we know for sure that it is changing affinity that trigger it, the field of the investigation can be narrowed a little bit... But I still am finding hard to spot where the race happens. I'll look more into this later in the afternoon. I'll let know if something comes to mind. > Let me take a look. > Thanks! :-) Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Software Engineer @ SUSE https://www.suse.com/ Attachment:
signature.asc _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |