[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 04/16] xen/cpu: prevent disable_nonboot_cpus crash on ARM64


  • To: Julien Grall <julien@xxxxxxx>, Mykola Kvach <xakep.amatop@xxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Thu, 13 Mar 2025 16:42:58 +0100
  • Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Mykola Kvach <mykola_kvach@xxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Thu, 13 Mar 2025 15:43:08 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 11.03.2025 21:47, Julien Grall wrote:
> Hi Mykola,
> 
> On 05/03/2025 09:11, Mykola Kvach wrote:
>> If we call disable_nonboot_cpus on ARM64 with system_state set
>> to SYS_STATE_suspend, the following assertion will be triggered:
>>
>> ```
>> (XEN) [   25.582712] Disabling non-boot CPUs ...
>> (XEN) [   25.587032] Assertion '!in_irq() && (local_irq_is_enabled() || 
>> num_online_cpus() <= 1)' failed at common/xmalloc_tlsf.c:714
>> [...]
>> (XEN) [   25.975069] Xen call trace:
>> (XEN) [   25.978353]    [<00000a000022e098>] xfree+0x130/0x1a4 (PC)
>> (XEN) [   25.984314]    [<00000a000022e08c>] xfree+0x124/0x1a4 (LR)
>> (XEN) [   25.990276]    [<00000a00002747d4>] release_irq+0xe4/0xe8
>> (XEN) [   25.996152]    [<00000a0000278588>] 
>> time.c#cpu_time_callback+0x44/0x60
>> (XEN) [   26.003150]    [<00000a000021d678>] notifier_call_chain+0x7c/0xa0
>> (XEN) [   26.009717]    [<00000a00002018e0>] 
>> cpu.c#cpu_notifier_call_chain+0x24/0x48
>> (XEN) [   26.017148]    [<00000a000020192c>] cpu.c#_take_cpu_down+0x28/0x34
>> (XEN) [   26.023801]    [<00000a0000201944>] cpu.c#take_cpu_down+0xc/0x18
>> (XEN) [   26.030281]    [<00000a0000225c5c>] 
>> stop_machine.c#stopmachine_action+0xbc/0xe4
>> (XEN) [   26.038057]    [<00000a00002264bc>] 
>> tasklet.c#do_tasklet_work+0xb8/0x100
>> (XEN) [   26.045229]    [<00000a00002268a4>] do_tasklet+0x68/0xb0
>> (XEN) [   26.051018]    [<00000a000026e120>] domain.c#idle_loop+0x7c/0x194
>> (XEN) [   26.057585]    [<00000a0000277e30>] start_secondary+0x21c/0x220
>> (XEN) [   26.063978]    [<00000a0000361258>] 00000a0000361258
>> ```
>>
>> This happens because before invoking take_cpu_down via the stop_machine_run
>> function on the target CPU, stop_machine_run requests
>> the STOPMACHINE_DISABLE_IRQ state on that CPU. Releasing memory in
>> the release_irq function then triggers the assertion:
>>
>> /*
>>   * Heap allocations may need TLB flushes which may require IRQs to be
>>   * enabled (except when only 1 PCPU is online).
>>   */
>> #define ASSERT_ALLOC_CONTEXT()
>>
>> This patch introduces a new tasklet to perform the CPU_DYING call chain for
>> a particular CPU. However, we cannot call take_cpu_down from the tasklet
>> because the __cpu_disable function disables local IRQs, causing the system
>> to crash inside spin_lock_irq, which is called after the tasklet function
>> invocation inside do_tasklet_work:
>>
>> void _spin_lock_irq(spinlock_t *lock)
>> {
>>      ASSERT(local_irq_is_enabled());
>>
>> To resolve this, take_cpu_down is split into two parts. The first part 
>> triggers
>> the CPU_DYING call chain, while the second part, __cpu_disable, is invoked 
>> from
>> stop_machine_run.
> 
> Rather than modifying common code, have you considered allocating from 
> the IRQ action from the percpu area? This would also reduce the number 
> of possible failure when bringup up a pCPU.

I'd go further and question whether release_irq() really wants calling when
suspending. At least on x86, a requirement is that upon resume the same
number and kinds of CPUs will come back up. Hence the system will look the
same, including all the interrupts that are in use. Plus resume will be
faster if things are left set up during suspend.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.