[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: kernel BUG around vmap/vfree - xen_enter_lazy_mmu()/xen_leave_lazy_mmu() - Linux 7.0-rc1
- To: Jürgen Groß <jgross@xxxxxxxx>, Kevin Brodsky <kevin.brodsky@xxxxxxx>, Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
- From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
- Date: Fri, 8 May 2026 11:28:22 +0100
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=/V48m4L42VS4xbszVUHsA8cxurEWR078XX2Jt5QIHzI=; b=Gl/utOxmG4DK7EHG2c9+6DIXrxxZuEuDosHHzOQNhBh8u2VO43lWRiR8RRfLq8DVyLzx57lwLNKWnEtyLPEeB3hrH/8inRZHVQIad75A4hB5729bOVxzDw6BBRB1CqRVjqqeYZr51+gMbuviiRQbtO5S4gpPwNy5HCRENBPJbneSJgjwZEGnlf9IsL+2ox86gKyDZYTrzkpIWO+HevsysNxkwX+JKR0hfcPboamm3RSaBvAHGDxO1jwdltrupqDzDCH/PWTdMpyyLHSvsb6aBTMduNVTNbFCKpX/ra8jkVFgcdKQ1mfpuLVfLgIFuDF1V9c5mMhpp3whUg3aiuJgmg==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=pFMFP1+9nXmTGE7gbf23P/UoI1+Rbo2Byz6vcVDd4aFRYcubInLScBOBMFXJMDMr7iTwVhdF8vmHcRgeXkDzAXlnIf1kac6ZEbz9Q31kTgwlcNEQ/OBppvFxSUVrR1Oo+qXNrLAlO6WkdeTjA01kJsMakf/v385l4sEhk/ZIVTaQNDuMrPwSOGrpXtW7ouEE96c4/nDVIiXCHfkNi/Yn6FQdIPuC2VkAWDoJFL2um5xOVwR/WYo9aGKv1uRWCsLh6fUmi7KFgkhrSl7W6vJ1pJZEo6oE3xVg+eemESBOl3/6e11djs0+rLdoErt+dpODOS7fPJCxz6yysJ62UiJ0Rg==
- Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=citrix.com header.i="@citrix.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"
- Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
- Autocrypt: addr=andrew.cooper3@xxxxxxxxxx; keydata= xsFNBFLhNn8BEADVhE+Hb8i0GV6mihnnr/uiQQdPF8kUoFzCOPXkf7jQ5sLYeJa0cQi6Penp VtiFYznTairnVsN5J+ujSTIb+OlMSJUWV4opS7WVNnxHbFTPYZVQ3erv7NKc2iVizCRZ2Kxn srM1oPXWRic8BIAdYOKOloF2300SL/bIpeD+x7h3w9B/qez7nOin5NzkxgFoaUeIal12pXSR Q354FKFoy6Vh96gc4VRqte3jw8mPuJQpfws+Pb+swvSf/i1q1+1I4jsRQQh2m6OTADHIqg2E ofTYAEh7R5HfPx0EXoEDMdRjOeKn8+vvkAwhviWXTHlG3R1QkbE5M/oywnZ83udJmi+lxjJ5 YhQ5IzomvJ16H0Bq+TLyVLO/VRksp1VR9HxCzItLNCS8PdpYYz5TC204ViycobYU65WMpzWe LFAGn8jSS25XIpqv0Y9k87dLbctKKA14Ifw2kq5OIVu2FuX+3i446JOa2vpCI9GcjCzi3oHV e00bzYiHMIl0FICrNJU0Kjho8pdo0m2uxkn6SYEpogAy9pnatUlO+erL4LqFUO7GXSdBRbw5 gNt25XTLdSFuZtMxkY3tq8MFss5QnjhehCVPEpE6y9ZjI4XB8ad1G4oBHVGK5LMsvg22PfMJ ISWFSHoF/B5+lHkCKWkFxZ0gZn33ju5n6/FOdEx4B8cMJt+cWwARAQABzSlBbmRyZXcgQ29v cGVyIDxhbmRyZXcuY29vcGVyM0BjaXRyaXguY29tPsLBegQTAQgAJAIbAwULCQgHAwUVCgkI CwUWAgMBAAIeAQIXgAUCWKD95wIZAQAKCRBlw/kGpdefoHbdD/9AIoR3k6fKl+RFiFpyAhvO 59ttDFI7nIAnlYngev2XUR3acFElJATHSDO0ju+hqWqAb8kVijXLops0gOfqt3VPZq9cuHlh IMDquatGLzAadfFx2eQYIYT+FYuMoPZy/aTUazmJIDVxP7L383grjIkn+7tAv+qeDfE+txL4 SAm1UHNvmdfgL2/lcmL3xRh7sub3nJilM93RWX1Pe5LBSDXO45uzCGEdst6uSlzYR/MEr+5Z JQQ32JV64zwvf/aKaagSQSQMYNX9JFgfZ3TKWC1KJQbX5ssoX/5hNLqxMcZV3TN7kU8I3kjK mPec9+1nECOjjJSO/h4P0sBZyIUGfguwzhEeGf4sMCuSEM4xjCnwiBwftR17sr0spYcOpqET ZGcAmyYcNjy6CYadNCnfR40vhhWuCfNCBzWnUW0lFoo12wb0YnzoOLjvfD6OL3JjIUJNOmJy RCsJ5IA/Iz33RhSVRmROu+TztwuThClw63g7+hoyewv7BemKyuU6FTVhjjW+XUWmS/FzknSi dAG+insr0746cTPpSkGl3KAXeWDGJzve7/SBBfyznWCMGaf8E2P1oOdIZRxHgWj0zNr1+ooF /PzgLPiCI4OMUttTlEKChgbUTQ+5o0P080JojqfXwbPAyumbaYcQNiH1/xYbJdOFSiBv9rpt TQTBLzDKXok86M7BTQRS4TZ/ARAAkgqudHsp+hd82UVkvgnlqZjzz2vyrYfz7bkPtXaGb9H4 Rfo7mQsEQavEBdWWjbga6eMnDqtu+FC+qeTGYebToxEyp2lKDSoAsvt8w82tIlP/EbmRbDVn 7bhjBlfRcFjVYw8uVDPptT0TV47vpoCVkTwcyb6OltJrvg/QzV9f07DJswuda1JH3/qvYu0p vjPnYvCq4NsqY2XSdAJ02HrdYPFtNyPEntu1n1KK+gJrstjtw7KsZ4ygXYrsm/oCBiVW/OgU g/XIlGErkrxe4vQvJyVwg6YH653YTX5hLLUEL1NS4TCo47RP+wi6y+TnuAL36UtK/uFyEuPy wwrDVcC4cIFhYSfsO0BumEI65yu7a8aHbGfq2lW251UcoU48Z27ZUUZd2Dr6O/n8poQHbaTd 6bJJSjzGGHZVbRP9UQ3lkmkmc0+XCHmj5WhwNNYjgbbmML7y0fsJT5RgvefAIFfHBg7fTY/i kBEimoUsTEQz+N4hbKwo1hULfVxDJStE4sbPhjbsPCrlXf6W9CxSyQ0qmZ2bXsLQYRj2xqd1 bpA+1o1j2N4/au1R/uSiUFjewJdT/LX1EklKDcQwpk06Af/N7VZtSfEJeRV04unbsKVXWZAk uAJyDDKN99ziC0Wz5kcPyVD1HNf8bgaqGDzrv3TfYjwqayRFcMf7xJaL9xXedMcAEQEAAcLB XwQYAQgACQUCUuE2fwIbDAAKCRBlw/kGpdefoG4XEACD1Qf/er8EA7g23HMxYWd3FXHThrVQ HgiGdk5Yh632vjOm9L4sd/GCEACVQKjsu98e8o3ysitFlznEns5EAAXEbITrgKWXDDUWGYxd pnjj2u+GkVdsOAGk0kxczX6s+VRBhpbBI2PWnOsRJgU2n10PZ3mZD4Xu9kU2IXYmuW+e5KCA vTArRUdCrAtIa1k01sPipPPw6dfxx2e5asy21YOytzxuWFfJTGnVxZZSCyLUO83sh6OZhJkk b9rxL9wPmpN/t2IPaEKoAc0FTQZS36wAMOXkBh24PQ9gaLJvfPKpNzGD8XWR5HHF0NLIJhgg 4ZlEXQ2fVp3XrtocHqhu4UZR4koCijgB8sB7Tb0GCpwK+C4UePdFLfhKyRdSXuvY3AHJd4CP 4JzW0Bzq/WXY3XMOzUTYApGQpnUpdOmuQSfpV9MQO+/jo7r6yPbxT7CwRS5dcQPzUiuHLK9i nvjREdh84qycnx0/6dDroYhp0DFv4udxuAvt1h4wGwTPRQZerSm4xaYegEFusyhbZrI0U9tJ B8WrhBLXDiYlyJT6zOV2yZFuW47VrLsjYnHwn27hmxTC/7tvG3euCklmkn9Sl9IAKFu29RSo d5bD8kMSCYsTqtTfT6W4A3qHGvIDta3ptLYpIAOD2sY3GYq2nf3Bbzx81wZK14JdDDHUX2Rs 6+ahAA==
- Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
- Delivery-date: Fri, 08 May 2026 10:28:39 +0000
- List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
On 08/05/2026 11:09 am, Jürgen Groß wrote:
> On 08.05.26 11:54, Kevin Brodsky wrote:
>> On 08/05/2026 10:53, Juergen Gross wrote:
>>> [...]
>>>
>>> But now I think I have found the real culprit in
>>> lazy_mmu_mode_enable():
>>>
>>> static inline void lazy_mmu_mode_enable(void)
>>> {
>>> struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>>
>>> if (in_interrupt() || state->pause_count > 0)
>>> return;
>>>
>>> VM_WARN_ON_ONCE(state->enable_count == U8_MAX);
>>>
>>> if (state->enable_count++ == 0)
>>> arch_enter_lazy_mmu_mode();
>>> }
>>>
>>> Consider a preemption just before calling
>>> arch_enter_lazy_mmu_mode(). The
>>> enable_count will be 1 now, but there was no switch to lazy mode yet.
>>>
>>> When the task becomes active again, context switch handling will see
>>> lazy
>>> mode enabled (enable_count > 0), so it will call
>>> arch_enter_lazy_mmu_mode().
>>> And then the task resumes and is calling arch_enter_lazy_mmu_mode()
>>> another
>>> time.
>>
>> Agreed, this must be the problem. I did wonder whether the lack of
>> atomicity would cause trouble...
>>
>> arm64 isn't impacted because it tracks related state in task_struct
>> only. powerpc and sparc do use percpu variables but that shouldn't
>> matter as they disable preemption in the entire lazy MMU section.
>>
>>>
>>> The only chance I'm seeing to avoid that would be to disable preemption
>>> around all instances of testing a condition and then enabling or
>>> disabling
>>> lazy mmu mode.
>>
>> I don't immediately see why we would need such a big hammer. If we
>> revert commit 291b3abed657 ("x86/xen: use lazy_mmu_state when
>> context-switching"), then arch_{start,end}_context_switch() should once
>> again do the right thing for Xen since the TIF_LAZY_MMU_UPDATES flag is
>> separate from lazy_mmu_state. I think it looks like this:
>>
>> lazy_mmu_mode_enable()
>> state->enable_count++
>> <PREEMPT>
>> arch_start_context_switch()
>> xen_lazy_mode == XEN_LAZY_NONE -> do nothing
>> <other task runs; this task is scheduled again>
>>
>> arch_end_context_switch()
>> TIF_LAZY_MMU_UPDATES not set -> do nothing
>>
>> <exception return>
>> enter_lazy(XEN_LAZY_MMU)
>>
>> Nothing else should be checking lazy MMU state during the context
>> switch.
>>
>> Does that make sense?
>
> This would work, yes.
>
> OTOH I don't like the multiple conditions used for testing
> (state->enable_count,
> TIF_LAZY_MMU_UPDATES, xen_lazy_mode).
>
> Another variant would be to just let the Xen specific code tolerate
> the double
> calls by disabling preemption in the Xen code and checking via
> __task_lazy_mmu_mode_active() if anything needs to be done.
>
> I'd really like to get rid of xen_lazy_mode completely.
Without wishing to interrupt the flow too much.
In XenServer, work on migration performance[1] has demonstrated that a
very large number of multicalls issued by Linux are single-op multicalls.
(I blindly assert) these must be coming from the lazy_mode logic, and
they're even less efficient than making the hypercall normally, owing to
the need to marshal it through the multicall ABI.
There's a possibility that you can simply delete lazy mode and stuff
gets faster. (Although it's far more likely that the difference is in
the noise).
~Andrew
[1] The dominating perf problem for migration is ptwr emulation and
Linux not using a hypercall, which IIRC accounts for 40% of wallclock
time during live migration.
|