[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED



On Fri, 2020-01-10 at 18:24 +0000, Julien Grall wrote:
> Hi all,
> 
Hi Julien,

I was looking at this, and I have a couple of questions...

> On 08/01/2020 23:14, Julien Grall wrote:
> > On Wed, 8 Jan 2020 at 21:40, osstest service owner
> > <osstest-admin@xxxxxxxxxxxxxx> wrote:
> > ****************************************
> > Jan  8 15:02:26.943794 (XEN) Panic on CPU 1:
> > Jan  8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed
> > at
> > sched_rt.c:586
> > Jan  8 15:02:26.951492 (XEN)
> > ****************************************
> > 
> So I managed to reproduce it on Arm by hacking the hypercall path to
> call:
> 
> domain_pause_nosync(current->domain);
> domain_unpause(current->domain);
> 
> With a debug build and with a 2 vCPU dom0 the crash happen in a few 
> seconds. When the unit is not scheduled, rt_unit_wake() expects the
> unit 
> to be in none of the queues.
> 
> The interaction is as following:
> 
> CPU0                                  | CPU1
>                                       |
> do_domain_pause()                     |
>   -> atomic_inc(&d->pause_count)      |
>   -> vcpu_sleep_nosync(vCPU A)        |  schedule()
>                               |       -> Lock
>                                  |       -> rt_schedule()
>                                  |          -> snext = runq_pick(...)
>                                  |          /* return unit A (aka
> vCPU A)
>                               |          /* Unit is not runnable */
>                               |          -> Remove from the q
>                                  |     [....]
>                               |       -> Lock
>     -> Lock                   |
>     -> rt_unit_sleep()                |
>      /* Unit not scheduled */ |
>      /* Nothing to do */              |
> 
Thanks a lot for the analysis. As said above, just a few questions, to
be sure I'm understanding properly what is happening.

You have a 2 vCPUs dom0, and how many other vCPUs from other domains?
Or do you only have those 2 dom0 vCPUs and you are actually pausing
dom0?

In general, what is running (I mean which vcpu) on CPU0, when the
domain_pause() happens? And what is running on CPU1 when schedule()
happens?

If you just have the 2 dom0's vCPUs around (and we call them vCPU A and
vCPU B), the only case for which I can imagine runq_pick() returning A
on CPU1 would be if CPU0 would be running vCPU B (and invoked the
hypercall from it) and CPU1 was idle... is this the case?

> When schedule() grab the lock first (as shown above), the unit will
> only 
> be removed from the Q. However, when vcpu_sleep_nosync() grab the
> lock 
> first and the unit was not scheduled, rt_unit_sleep() will remove
> the 
> unit from two queues (runQ/depleteQ and replenishQ).
> 
> So I think we want schedule() to remove the unit from the 2 queues if
> it 
> is not runnable. Any opinions?
> 
Mmm... that may work, but I'm not sure.

In fact, I'm starting to think that patch 7c7b407e777 "xen/sched:
introduce unit_runnable_state()", which added the 'q_remove(snext)' in
rt_schedule() might not be correct.

In fact, if runq_pick() returns a vCPU which is in the runqueue, but is
not runnable (e.g., because we're racing with do_domain_pause(), which
already set pause_count), it's not rt_schedule() job to dequeue it from
anything.

We probably should just ignore it and pick another vCPU, if any (and
idle otherwise). Then, after we release the lock, if will be
rt_unit_sleep(), called by do_domain_pause() in this case, that will
finish the job of properly dequeueing it...

Another strange thing is that, as the code looks right now, runq_pick()
returns the first unit in the runq (i.e., the one with the earliest
deadline), without checking whether it is runnable. Then, in
rt_schedule(), if the unit is not runnable, we (only partially, as you
figured out) dequeue it, and use idle instead, as our candidate for
being the next scheduled unit... But what if there were other
*runnable* units in the runqueue?

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.