[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [RFC PATCH] xen/sched/null: avoid crash after failed domU creation
On 4/25/23 03:42, Jan Beulich wrote: > On 25.04.2023 08:36, Juergen Gross wrote: >> On 24.04.23 23:00, Stewart Hildebrand wrote: >>> When creating a domU, but the creation fails, we may end up in a state >>> where a vcpu has not yet been added to the null scheduler, but >>> unit_deassign() is invoked. >> >> This is not really true. The vcpu has been added, but it was offline at >> that time. This resulted in null_unit_insert() returning early and not >> calling unit_assign(). >> >> Later the vcpu was onlined during XEN_DOMCTL_setvcpucontext handling, >> resulting in null_unit_remove() calling unit_deassign(). Makes sense. I'll reword the message in the next revision. >>> In this case, when running a debug build of >>> Xen, we will hit an ASSERT and crash Xen: >>> >>> (XEN) **************************************** >>> (XEN) Panic on CPU 0: >>> (XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379 >>> (XEN) **************************************** >>> >>> To work around this, remove the ASSERT and introduce a check for the >>> case where npc->unit is NULL and simply return false from >>> unit_deassign(). >> >> I think the correct fix would be to call unit_deassign() from >> null_unit_remove() only, if npc->unit isn't NULL. Dario might have a >> different opinion, though. :-) Yes, this seems cleaner to me, thanks for the suggestion. I did a quick test, and this approach works to avoid the crash too. I'll wait a few days in case anyone else wants to chime in, and if there aren't any more comments I'll send out a new patch following this suggestion. > Furthermore, even if the proposed solution was (roughly) followed, ... > >>> --- a/xen/common/sched/null.c >>> +++ b/xen/common/sched/null.c >>> @@ -376,7 +376,14 @@ static bool unit_deassign(struct null_private *prv, >>> const struct sched_unit *uni >>> struct null_pcpu *npc = get_sched_res(cpu)->sched_priv; >>> >>> ASSERT(list_empty(&null_unit(unit)->waitq_elem)); >>> - ASSERT(npc->unit == unit); >>> + >>> + if ( !npc->unit ) >>> + { >>> + dprintk(XENLOG_G_INFO, "%d <-- NULL (%pdv%d)\n", cpu, unit->domain, >>> + unit->unit_id); >>> + return false; >>> + } >>> + > > ... shouldn't the assertion be kept, with the new if() inserted ahead of > it? Plus the log message probably better wouldn't print a unit ID like a > vCPU one, but instead use e.g. %pdu%u? Sure, although, with Juergen's suggested fix in null_unit_remove(), I think we could simply drop this snippet and leave unit_deassign() unmodified. Your suggested print format is an improvement, but perhaps it would be better suited for a separate patch since there are several more instances throughout null.c that would also want to be changed. Example with %pdv%d: # xl create ... (XEN) common/sched/null.c:355: 3 <-- d1v0 # xl destroy ... (XEN) common/sched/null.c:385: 3 <-- NULL (d1v0) Example with %pdu%u: # xl create ... (XEN) common/sched/null.c:355: 3 <-- d1u0 # xl destroy ... (XEN) common/sched/null.c:385: 3 <-- NULL (d1u0)
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |