[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [xen-unstable test] 145796: tolerable FAIL - PUSHED
Hi Dario, Apologies for the late answer. On 22/01/2020 03:40, Dario Faggioli wrote: On Fri, 2020-01-10 at 18:24 +0000, Julien Grall wrote:Hi all,Hi Julien, I was looking at this, and I have a couple of questions...On 08/01/2020 23:14, Julien Grall wrote:On Wed, 8 Jan 2020 at 21:40, osstest service owner <osstest-admin@xxxxxxxxxxxxxx> wrote: **************************************** Jan 8 15:02:26.943794 (XEN) Panic on CPU 1: Jan 8 15:02:26.945872 (XEN) Assertion '!unit_on_replq(svc)' failed at sched_rt.c:586 Jan 8 15:02:26.951492 (XEN) ****************************************So I managed to reproduce it on Arm by hacking the hypercall path to call: domain_pause_nosync(current->domain); domain_unpause(current->domain); With a debug build and with a 2 vCPU dom0 the crash happen in a few seconds. When the unit is not scheduled, rt_unit_wake() expects the unit to be in none of the queues. The interaction is as following: CPU0 | CPU1 | do_domain_pause() | -> atomic_inc(&d->pause_count) | -> vcpu_sleep_nosync(vCPU A) | schedule() | -> Lock | -> rt_schedule() | -> snext = runq_pick(...) | /* return unit A (aka vCPU A) | /* Unit is not runnable */ | -> Remove from the q | [....] | -> Lock -> Lock | -> rt_unit_sleep() | /* Unit not scheduled */ | /* Nothing to do */ |Thanks a lot for the analysis. As said above, just a few questions, to be sure I'm understanding properly what is happening. You have a 2 vCPUs dom0, and how many other vCPUs from other domains? Or do you only have those 2 dom0 vCPUs and you are actually pausing dom0? Only dom0 with 2 vCPUs is running. On every hypercall, it will try to pause/unpause itself. This is to roughly match the behavior of the Arm guest atomic helpers. In general, what is running (I mean which vcpu) on CPU0, when the domain_pause() happens? And what is running on CPU1 when schedule() happens? If you just have the 2 dom0's vCPUs around (and we call them vCPU A and vCPU B), the only case for which I can imagine runq_pick() returning A on CPU1 would be if CPU0 would be running vCPU B (and invoked the hypercall from it) and CPU1 was idle... is this the case? This is indeed the case. The schedule() on CPU1 has happenned because vCPU A was woken up (e.g an interrupt was received and injected to the vCPU). When schedule() grab the lock first (as shown above), the unit will only be removed from the Q. However, when vcpu_sleep_nosync() grab the lock first and the unit was not scheduled, rt_unit_sleep() will remove the unit from two queues (runQ/depleteQ and replenishQ). So I think we want schedule() to remove the unit from the 2 queues if it is not runnable. Any opinions?Mmm... that may work, but I'm not sure. In fact, I'm starting to think that patch 7c7b407e777 "xen/sched: introduce unit_runnable_state()", which added the 'q_remove(snext)' in rt_schedule() might not be correct. I have tested Xen before this commit and didn't manage to reproduce the crash. As soon as I had the commit, it will crash quite quickly. In fact, if runq_pick() returns a vCPU which is in the runqueue, but is not runnable (e.g., because we're racing with do_domain_pause(), which already set pause_count), it's not rt_schedule() job to dequeue it from anything. We probably should just ignore it and pick another vCPU, if any (and idle otherwise). Then, after we release the lock, if will be rt_unit_sleep(), called by do_domain_pause() in this case, that will finish the job of properly dequeueing it... Another strange thing is that, as the code looks right now, runq_pick() returns the first unit in the runq (i.e., the one with the earliest deadline), without checking whether it is runnable. Then, in rt_schedule(), if the unit is not runnable, we (only partially, as you figured out) dequeue it, and use idle instead, as our candidate for being the next scheduled unit... But what if there were other *runnable* units in the runqueue? My knowledge of the scheduler is quite limited. Maybe Meng would be able to answer to this question? Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |