[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 04/16] xen/cpu: prevent disable_nonboot_cpus crash on ARM64
On Thu, Mar 13, 2025 at 5:43 PM Jan Beulich <jbeulich@xxxxxxxx> wrote: > > On 11.03.2025 21:47, Julien Grall wrote: > > Hi Mykola, > > > > On 05/03/2025 09:11, Mykola Kvach wrote: > >> If we call disable_nonboot_cpus on ARM64 with system_state set > >> to SYS_STATE_suspend, the following assertion will be triggered: > >> > >> ``` > >> (XEN) [ 25.582712] Disabling non-boot CPUs ... > >> (XEN) [ 25.587032] Assertion '!in_irq() && (local_irq_is_enabled() || > >> num_online_cpus() <= 1)' failed at common/xmalloc_tlsf.c:714 > >> [...] > >> (XEN) [ 25.975069] Xen call trace: > >> (XEN) [ 25.978353] [<00000a000022e098>] xfree+0x130/0x1a4 (PC) > >> (XEN) [ 25.984314] [<00000a000022e08c>] xfree+0x124/0x1a4 (LR) > >> (XEN) [ 25.990276] [<00000a00002747d4>] release_irq+0xe4/0xe8 > >> (XEN) [ 25.996152] [<00000a0000278588>] > >> time.c#cpu_time_callback+0x44/0x60 > >> (XEN) [ 26.003150] [<00000a000021d678>] notifier_call_chain+0x7c/0xa0 > >> (XEN) [ 26.009717] [<00000a00002018e0>] > >> cpu.c#cpu_notifier_call_chain+0x24/0x48 > >> (XEN) [ 26.017148] [<00000a000020192c>] cpu.c#_take_cpu_down+0x28/0x34 > >> (XEN) [ 26.023801] [<00000a0000201944>] cpu.c#take_cpu_down+0xc/0x18 > >> (XEN) [ 26.030281] [<00000a0000225c5c>] > >> stop_machine.c#stopmachine_action+0xbc/0xe4 > >> (XEN) [ 26.038057] [<00000a00002264bc>] > >> tasklet.c#do_tasklet_work+0xb8/0x100 > >> (XEN) [ 26.045229] [<00000a00002268a4>] do_tasklet+0x68/0xb0 > >> (XEN) [ 26.051018] [<00000a000026e120>] domain.c#idle_loop+0x7c/0x194 > >> (XEN) [ 26.057585] [<00000a0000277e30>] start_secondary+0x21c/0x220 > >> (XEN) [ 26.063978] [<00000a0000361258>] 00000a0000361258 > >> ``` > >> > >> This happens because before invoking take_cpu_down via the stop_machine_run > >> function on the target CPU, stop_machine_run requests > >> the STOPMACHINE_DISABLE_IRQ state on that CPU. Releasing memory in > >> the release_irq function then triggers the assertion: > >> > >> /* > >> * Heap allocations may need TLB flushes which may require IRQs to be > >> * enabled (except when only 1 PCPU is online). > >> */ > >> #define ASSERT_ALLOC_CONTEXT() > >> > >> This patch introduces a new tasklet to perform the CPU_DYING call chain for > >> a particular CPU. However, we cannot call take_cpu_down from the tasklet > >> because the __cpu_disable function disables local IRQs, causing the system > >> to crash inside spin_lock_irq, which is called after the tasklet function > >> invocation inside do_tasklet_work: > >> > >> void _spin_lock_irq(spinlock_t *lock) > >> { > >> ASSERT(local_irq_is_enabled()); > >> > >> To resolve this, take_cpu_down is split into two parts. The first part > >> triggers > >> the CPU_DYING call chain, while the second part, __cpu_disable, is invoked > >> from > >> stop_machine_run. > > > > Rather than modifying common code, have you considered allocating from > > the IRQ action from the percpu area? This would also reduce the number > > of possible failure when bringup up a pCPU. > > I'd go further and question whether release_irq() really wants calling when > suspending. At least on x86, a requirement is that upon resume the same > number and kinds of CPUs will come back up. Hence the system will look the > same, including all the interrupts that are in use. Plus resume will be > faster if things are left set up during suspend. I tried that approach and encountered another issue. - in the case of the hardware domain, it triggered a domain watchdog; - in the case of domU, it caused a crash inside the Linux kernel due to an RCU stall. Both scenarios suggest that something is wrong with IRQ delivery to the guest. It might be necessary to revisit the entire logic related to GIC resume/suspend instead. > > Jan Best regards, Mykola
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |