Xen project Mailing List

Re: cpupool / credit2 misuse of xfree() (was: Re: [BUG] Xen causes a host hang by using xen-hptool cpu-offline)

Date: Mon, 1 Aug 2022 10:41:46 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=XqgKCYljuqimkIK/KZKu2ZAsOfEoC4UdLFlgBJ4AZ/w=; b=fuQ30MwGkW7QNj+PvVyFa5+nQ25hteF9r+QZHXpeCAgwAtINYjPh4N1DK77PIsXOtnNrLD4mYuR7o496PJgV9mIpmXmAEqLmq4RmMRNx1uQu5drnc7+N6/0pkeb9cM37eMXR4idU55vIjg4JEv8pfoG7hKceoBrpyXD5cCLAj34mu9zvAgOdmzpcZFO9eHDaDEtt0N4e07CjUVD/2o0Xm9w9PxX/MZkpV+MUxLgqMhrPRdn7zeMs8wbBGlYJxKdxcDpgl21D/MhURJZGeg99iwWbeWAS3iir0FDnVU9Wr8O2laGHU0n8M2C2Pn59JAWwPx3V4Gsr7XkzHBi0r1CZuw==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Ybv4tRi6V8ARLWCglPLJaiLv6BETxkuLf9At/kOK2+rPkVp18xfljwq0cIXcU/RpCV5nBzZinvZoHatgQZin9y+Vyz2v+1XzUn8ADzhQJaPDgH2NczIbV25jFFq8TsHDfBwfW069fYHQ79q+TizLN8qTqC7F8EnjOdRBG0mQPN+hIR8cRxLE6WjbTvicXFoyeU3sMHTqlO4UGWhppaUcjVUFPeCW+TCzUtQMSa6a9l7xQ04HGZnk3u1FmBqsbCtAoA9aTlv+nzU7Q9Q+yUUWa607XzHXt3zfKkbB3twDdWgwdzrYNl2IDHqcIEgD/fMOciJ4dsAAKpnsH7JhF4IlEw==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, "Gao, Ruifeng" <ruifeng.gao@xxxxxxxxx>

Delivery-date: Mon, 01 Aug 2022 08:41:58 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 01.08.2022 09:38, Juergen Gross wrote: > On 27.07.22 08:32, Jan Beulich wrote: >> On 27.07.2022 03:19, Gao, Ruifeng wrote: >>> Problem Description: >>> Trying to execute "/usr/local/sbin/xen-hptool cpu-offline <cpuid>", the >>> host will hang immediately. >>> >>> Version-Release and System Details: >>> Platform: Ice Lake Server >>> Host OS: Red Hat Enterprise Linux 8.3 (Ootpa) >>> Kernel: 5.19.0-rc6 >>> HW: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz >>> Xen Version: 4.17-unstable(ab2977b027-dirty) >>> >>> Reproduce Steps: >>> 1. Boot from Xen and check the information: >>> [root@icx-2s1 ~]# xl info >>> host : icx-2s1 >>> release : 5.19.0-rc6 >>> xen_version : 4.17-unstable >>> xen_caps : xen-3.0-x86_64 hvm-3.0-x86_32 hvm-3.0-x86_32p >>> hvm-3.0-x86_64 >>> platform_params : virt_start=0xffff800000000000 >>> xen_changeset : Thu Jul 14 19:45:36 2022 +0100 git:ab2977b027-dirty >>> 2. Execute the cpu-offline command, here cpuid is 48 as an example: >>> [root@icx-2s1 ~]# /usr/local/sbin/xen-hptool cpu-offline 48 >>> >>> Actual Results: >>> The host will hang immediately. >> >> Well, it crashes (which is an important difference). Also you've hidden >> the important details (allowing to easily identify what area the issue >> is in) quite well in the attachment. >> >> Jürgen (and possibly George / Dario), >> >> this >> >> (XEN) Xen call trace: >> (XEN) [<ffff82d04023be76>] R xfree+0x150/0x1f7 >> (XEN) [<ffff82d040248795>] F >> common/sched/credit2.c#csched2_free_udata+0xc/0xe >> (XEN) [<ffff82d040259169>] F schedule_cpu_rm+0x38d/0x4b3 >> (XEN) [<ffff82d0402430ca>] F >> common/sched/cpupool.c#cpupool_unassign_cpu_finish+0x17e/0x22c >> (XEN) [<ffff82d04021d402>] F >> common/sched/cpupool.c#cpu_callback+0x3fb/0x4dc >> (XEN) [<ffff82d040229fc3>] F notifier_call_chain+0x6b/0x96 >> (XEN) [<ffff82d040204df7>] F >> common/cpu.c#cpu_notifier_call_chain+0x1b/0x33 >> (XEN) [<ffff82d040204e33>] F common/cpu.c#_take_cpu_down+0x24/0x2b >> (XEN) [<ffff82d040204e43>] F common/cpu.c#take_cpu_down+0x9/0x10 >> (XEN) [<ffff82d040231517>] F >> common/stop_machine.c#stopmachine_action+0x86/0x96 >> (XEN) [<ffff82d040231cc5>] F common/tasklet.c#do_tasklet_work+0x72/0xa5 >> (XEN) [<ffff82d040231f42>] F do_tasklet+0x58/0x8a >> (XEN) [<ffff82d040320b60>] F arch/x86/domain.c#idle_loop+0x8d/0xee >> (XEN) >> (XEN) >> (XEN) **************************************** >> (XEN) Panic on CPU 48: >> (XEN) Assertion '!in_irq() && (local_irq_is_enabled() || num_online_cpus() >> <= 1)' failed at common/xmalloc_tlsf.c:704 >> (XEN) **************************************** >> >> is pointing at the problem quite clearly. Conceptually I think it >> has always been wrong to call xfree() from stop-machine context. It >> just so happened that we got away with that so far, because the CPU >> being brought down was the only one using respective functions (and >> hence there was no other risk of locking issues). >> >> Question is whether we want to continue building upon this (and >> hence the involved assertion would need to "learn" to ignore >> stop-machine context) or whether instead the freeing of the memory >> here can be deferred, e.g. to be taken care of by the CPU driving >> the offlining process. > > This is even more complicated. > > I think ASSERT_ALLOC_CONTEXT() will trigger more often, especially > with core scheduling enabled. In fact I think this is the reason why > I've seen very rare strange failures with core scheduling when trying > cpu hotplug operations, as there are even xmalloc() calls in stop > machine context. > > I'm seeing the following possibilities: > > 1) Pre-allocating the needed data and deferring freeing of no longer > needed data when taking a cpu down. Apart form some refactoring > in common/sched/cpupool.c and common/sched/core.c this should be > doable. > > 2) In case stop_machine() is called for action on only one cpu allow > memory allocations and freeing with interrupts off and flush the > TLBs locally when enabling interrupts again. This would require > rather limited changes, but wouldn't be as clean as the other > approach. > > Any preferences? I'd be fine with both variants and could write the > patches. I'd prefer 1 over 2, but in the unlikely event that 1 ends up unwieldy I could live with an extensively commented form of 2. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.