[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Xen crash in scheduler during cpu hotplug
On 15/05/2024 2:38 pm, Jürgen Groß wrote: > On 15.05.24 15:22, Andrew Cooper wrote: >> On 15/05/2024 1:39 pm, Jan Beulich wrote: >>> On 15.05.2024 13:58, Andrew Cooper wrote: >>>> Just so it doesn't get lost. In XenRT, we've found: >>>> >>>>> (XEN) ----[ Xen-4.19.0-1-d x86_64 debug=y Tainted: H ]---- >>>>> (XEN) CPU: 45 >>>>> (XEN) RIP: e008:[<ffff82d040244cbf>] >>>>> common/sched/credit.c#csched_load_balance+0x41/0x877 >>>>> (XEN) RFLAGS: 0000000000010092 CONTEXT: hypervisor >>>>> (XEN) rax: ffff82d040981618 rbx: ffff82d040981618 rcx: >>>>> 0000000000000000 >>>>> (XEN) rdx: 0000003ff68cd000 rsi: 000000000000002d rdi: >>>>> ffff83103723d450 >>>>> (XEN) rbp: ffff83207caa7d48 rsp: ffff83207caa7b98 r8: >>>>> 0000000000000000 >>>>> (XEN) r9: ffff831037253cf0 r10: ffff83103767c3f0 r11: >>>>> 0000000000000009 >>>>> (XEN) r12: ffff831037237990 r13: ffff831037237990 r14: >>>>> ffff831037253720 >>>>> (XEN) r15: 0000000000000000 cr0: 000000008005003b cr4: >>>>> 0000000000f526e0 >>>>> (XEN) cr3: 000000005bc2f000 cr2: 0000000000000010 >>>>> (XEN) fsb: 0000000000000000 gsb: 0000000000000000 gss: >>>>> 0000000000000000 >>>>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 >>>>> (XEN) Xen code around <ffff82d040244cbf> >>>>> (common/sched/credit.c#csched_load_balance+0x41/0x877): >>>>> (XEN) 48 8b 0c 10 48 8b 49 08 <48> 8b 79 10 48 89 bd b8 fe ff ff 49 >>>>> 8b 4e 28 48 >>>>> <snip> >>>>> (XEN) Xen call trace: >>>>> (XEN) [<ffff82d040244cbf>] R >>>>> common/sched/credit.c#csched_load_balance+0x41/0x877 >>> While this is of course pretty little information, I've still tried to >>> decipher it, first noticing it's credit1 that's being used here. Once >>> forcing csched_load_balance() non-inline (no idea why it is a separate >>> function in your build), I can see a sufficiently matching pattern at >>> approximately the same offset into the function. That's >>> >>> const struct cpupool *c = get_sched_res(cpu)->cpupool; >>> ... >>> const cpumask_t *online = c->res_valid; >>> ... >>> BUG_ON(get_sched_res(cpu) != snext->unit->res); >>> >>> overlapping, with the crash being on the middle of the quoted lines. >>> IOW the CPU pool is still NULL for this sched resource. Cc-ing >>> Jürgen for possible clues ... >> >> We've seen it in 4.13, 4.17 and upstream, after Roger extended the >> existing CPU hotplug testing to try and reproduce the MTRR watchdog >> failure. We've found yet another "no irq for handler" from this too. >> >> It's always a deference at NULL+0x10, somewhere within >> csched_schedule(). > > I think I've found the reason. > > In schedule_cpu_add() the cpupool and granularity are set only after > releasing the scheduling lock. I think those must be inside the locked > region. > > Can you give this one a try (not tested at all)? > > diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c > index 0cb33831d2..babac7aad6 100644 > --- a/xen/common/sched/core.c > +++ b/xen/common/sched/core.c > @@ -3176,6 +3176,8 @@ int schedule_cpu_add(unsigned int cpu, struct > cpupool *c) > > sr->scheduler = new_ops; > sr->sched_priv = ppriv; > + sr->granularity = cpupool_get_granularity(c); > + sr->cpupool = c; > > /* > * Reroute the lock to the per pCPU lock as /last/ thing. In fact, > @@ -3188,8 +3190,6 @@ int schedule_cpu_add(unsigned int cpu, struct > cpupool *c) > /* _Not_ pcpu_schedule_unlock(): schedule_lock has changed! */ > spin_unlock_irqrestore(old_lock, flags); > > - sr->granularity = cpupool_get_granularity(c); > - sr->cpupool = c; > /* The cpu is added to a pool, trigger it to go pick up some > work */ > cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ); This change seems to be standing up to the test, in a way that the previous form very much didn't. Thanks for the quick fix. ~Andrew
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |