|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] crash in csched_load_balance after xl vcpu-pin
On Wed, 2018-04-11 at 00:59 +0200, Dario Faggioli wrote:
> [Adding Andrew, not because I expect anything, but just because
> we've chatted about this issue on IRC :-) ]
>
Except, I did not add it. :-P
Anyway...
> On Tue, 2018-04-10 at 22:37 +0200, Olaf Hering wrote:
> > On Tue, Apr 10, Dario Faggioli wrote:
> >
> > BUG_ON(__vcpu_on_runq(CSCHED_VCPU(vc)));
> >
... patch attached.
Olaf, can you give it a try? It should be fine to run it on top of the
last debug patch (the one that produced this crash).
Regards,
Dario
> > (XEN) Xen BUG at sched_credit.c:876
> > (XEN) ----[ Xen-4.11.20180410T125709.50f8ba84a5-
> > 3.bug1087289_411 x86_64 debug=y Not tainted ]----
> > (XEN) CPU: 118
> > (XEN) RIP: e008:[<ffff82d080229ab4>]
> > sched_credit.c#csched_vcpu_migrate+0x27/0x51
> > ...
> > (XEN) Xen call trace:
> > (XEN) [<ffff82d080229ab4>]
> > sched_credit.c#csched_vcpu_migrate+0x27/0x51
> > (XEN) [<ffff82d080236348>] schedule.c#vcpu_move_locked+0xbb/0xc2
> > (XEN) [<ffff82d08023764c>] schedule.c#vcpu_migrate+0x226/0x25b
> > (XEN) [<ffff82d08023935f>] context_saved+0x8d/0x94
> > (XEN) [<ffff82d08027797d>] context_switch+0xe66/0xeb0
> > (XEN) [<ffff82d080236943>] schedule.c#schedule+0x5f4/0x627
> > (XEN) [<ffff82d080239f15>] softirq.c#__do_softirq+0x85/0x90
> > (XEN) [<ffff82d080239f6a>] do_softirq+0x13/0x15
> > (XEN) [<ffff82d08031f5db>] vmx_asm_do_vmentry+0x2b/0x30
> >
>
> Hey... unless I've really put there a totally bogous BUG_ON(), this
> looks interesting and potentially useful.
>
> It says that the vcpu which is being context switched out, and on
> which
> we are calling vcpu_migrate() on, because we found it to be
> VPF_migrating, is actually in the runqueue already when we actually
> get
> to execute vcpu_migrate()->vcpu_move_locked().
>
> Mmm... let's see.
>
> CPU A CPU B
> . .
> schedule(current == v) vcpu_set_affinity(v)
> prev = current // == v .
> schedule_lock(CPU A) .
> csched_schedule() schedule_lock(CPU A)
> if (runnable(v)) //YES x
> runq_insert(v) x
> return next != v x
> schedule_unlock(CPU A) x // takes the lock
> context_switch(prev,next) set_bit(v,
> VPF_migrating) [*]
> context_saved(prev) // still == v .
> v->is_running = 0 schedule_unlock(CPU A)
> SMP_MB .
> if (test_bit(v, VPF_migrating)) // YES!!
> vcpu_migrate(v) .
> for { .
> schedule_lock(CPU A) .
> SCHED_OP(v, pick_cpu) .
> set_bit(v, CSCHED_MIGRATING) .
> return CPU C .
> pick_called = 1 .
> schedule_unlock(CPU A) .
> schedule_lock(CPU A + CPU C) .
> if (pick_called && ...) // YES .
> break .
> } .
> // v->is_running is 0 .
> //!test_and_clear(v, VPF_migrating)) is false!!
> clear_bit(v, VPF_migrating) .
> vcpu_move_locked(v, CPU C) .
> BUG_ON(__vcpu_on_runq(v)) .
>
> [*] after this point, and until someone manages to call
> vcpu_sleep(),
> v sits in CPU A's runqueue with the VPF_migrating pause flag
> set
>
> So, basically, the race is between context_saved() and
> vcpu_set_affinity(). Basically, vcpu_set_affinity() sets the
> VPF_migrating pause flags on a vcpu in a runqueue, with the intent of
> letting either a vcpu_sleep_nosync() or a reschedule remove it from
> there, but context_saved() manage to see the flag, before the removal
> can happen.
>
> And I think this explains also the original BUG at
> sched_credit.c:1694
> (it's just a bit more involved).
>
> As it can be seen above (and also in the code comment in ) there is a
> barrier (which further testify that this is indeed a tricky passage),
> but I guess it is not that effective! :-/
>
> TBH, I have actually never fully understood what that comment really
> meant, what the barrier was protecting, and how... e.g., isn't it
> missing its paired one? In fact, there's another comment, clearly
> related, right in vcpu_set_affinity(). But again I'm a bit at loss at
> properly figuring out what the big idea is.
>
> George, what do you think? Does this make sense?
>
> Well, I'll think more about this, and to a possible fix, tomorrow
> morning.
>
> Regards,
> Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/Attachment:
xen-sched-debug-vcpumigrate-race.patch Attachment:
signature.asc _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |