[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: kernel BUG around vmap/vfree - xen_enter_lazy_mmu()/xen_leave_lazy_mmu() - Linux 7.0-rc1


  • To: Kevin Brodsky <kevin.brodsky@xxxxxxx>, Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
  • From: Juergen Gross <jgross@xxxxxxxx>
  • Date: Fri, 8 May 2026 10:53:01 +0200
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=google header.d=suse.com header.i="@suse.com" header.h="In-Reply-To:Autocrypt:Content-Language:References:Cc:To:From:Subject:User-Agent:MIME-Version:Date:Message-ID"
  • Autocrypt: addr=jgross@xxxxxxxx; keydata= xsBNBFOMcBYBCACgGjqjoGvbEouQZw/ToiBg9W98AlM2QHV+iNHsEs7kxWhKMjrioyspZKOB ycWxw3ie3j9uvg9EOB3aN4xiTv4qbnGiTr3oJhkB1gsb6ToJQZ8uxGq2kaV2KL9650I1SJve dYm8Of8Zd621lSmoKOwlNClALZNew72NjJLEzTalU1OdT7/i1TXkH09XSSI8mEQ/ouNcMvIJ NwQpd369y9bfIhWUiVXEK7MlRgUG6MvIj6Y3Am/BBLUVbDa4+gmzDC9ezlZkTZG2t14zWPvx XP3FAp2pkW0xqG7/377qptDmrk42GlSKN4z76ELnLxussxc7I2hx18NUcbP8+uty4bMxABEB AAHNH0p1ZXJnZW4gR3Jvc3MgPGpncm9zc0BzdXNlLmNvbT7CwHkEEwECACMFAlOMcK8CGwMH CwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRCw3p3WKL8TL8eZB/9G0juS/kDY9LhEXseh mE9U+iA1VsLhgDqVbsOtZ/S14LRFHczNd/Lqkn7souCSoyWsBs3/wO+OjPvxf7m+Ef+sMtr0 G5lCWEWa9wa0IXx5HRPW/ScL+e4AVUbL7rurYMfwCzco+7TfjhMEOkC+va5gzi1KrErgNRHH kg3PhlnRY0Udyqx++UYkAsN4TQuEhNN32MvN0Np3WlBJOgKcuXpIElmMM5f1BBzJSKBkW0Jc Wy3h2Wy912vHKpPV/Xv7ZwVJ27v7KcuZcErtptDevAljxJtE7aJG6WiBzm+v9EswyWxwMCIO RoVBYuiocc51872tRGywc03xaQydB+9R7BHPzsBNBFOMcBYBCADLMfoA44MwGOB9YT1V4KCy vAfd7E0BTfaAurbG+Olacciz3yd09QOmejFZC6AnoykydyvTFLAWYcSCdISMr88COmmCbJzn sHAogjexXiif6ANUUlHpjxlHCCcELmZUzomNDnEOTxZFeWMTFF9Rf2k2F0Tl4E5kmsNGgtSa aMO0rNZoOEiD/7UfPP3dfh8JCQ1VtUUsQtT1sxos8Eb/HmriJhnaTZ7Hp3jtgTVkV0ybpgFg w6WMaRkrBh17mV0z2ajjmabB7SJxcouSkR0hcpNl4oM74d2/VqoW4BxxxOD1FcNCObCELfIS auZx+XT6s+CE7Qi/c44ibBMR7hyjdzWbABEBAAHCwF8EGAECAAkFAlOMcBYCGwwACgkQsN6d 1ii/Ey9D+Af/WFr3q+bg/8v5tCknCtn92d5lyYTBNt7xgWzDZX8G6/pngzKyWfedArllp0Pn fgIXtMNV+3t8Li1Tg843EXkP7+2+CQ98MB8XvvPLYAfW8nNDV85TyVgWlldNcgdv7nn1Sq8g HwB2BHdIAkYce3hEoDQXt/mKlgEGsLpzJcnLKimtPXQQy9TxUaLBe9PInPd+Ohix0XOlY+Uk QFEx50Ki3rSDl2Zt2tnkNYKUCvTJq7jvOlaPd6d/W0tZqpyy7KVay+K4aMobDsodB3dvEAs6 ScCnh03dDAFgIq5nsB11j3KPKdVoPlfucX2c7kGNH+LUMbzqV6beIENfNexkOfxHfw==
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
  • Delivery-date: Fri, 08 May 2026 08:53:14 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 07.05.26 18:31, Jürgen Groß wrote:
On 07.04.26 11:23, Kevin Brodsky wrote:
On 05/04/2026 11:41, Marek Marczykowski-Górecki wrote:
On Thu, Feb 26, 2026 at 02:41:12PM +0100, Jürgen Groß wrote:
On 26.02.26 14:27, Andrew Cooper wrote:
On 26/02/2026 1:17 pm, Marek Marczykowski-Górecki wrote:
Hi,

When testing Linux 7.0-rc1 in PV dom0, I hit the following panic
sometimes:

[  436.849614] ------------[ cut here ]------------
[  436.849669] kernel BUG at arch/x86/include/asm/xen/hypervisor.h:78!
[  436.849693] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[  436.849710] CPU: 3 UID: 0 PID: 4021 Comm: kworker/u25:1 Not tainted 7.0.0-0.rc1.1.qubes.1001.fc41.x86_64 #1 PREEMPT(full) [  436.849729] Hardware name: Star Labs StarBook/StarBook, BIOS 8.97 10/03/2023
[  436.849743] Workqueue: i915_flip intel_atomic_commit_work [i915]
[  436.850226] RIP: e030:xen_enter_lazy_mmu+0x24/0x30
[  436.850245] Code: 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 65 8b 05 b8 e5 02 03 85 c0 75 10 65 c7 05 a9 e5 02 03 01 00 00 00 c3 cc cc cc cc <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90
[  436.850270] RSP: e02b:ffffc90045727a68 EFLAGS: 00010202
[  436.850283] RAX: 0000000000000001 RBX: ffff8881042fa6d0 RCX: 000fffffffe00000 [  436.850296] RDX: 0000000000000001 RSI: ffff88810a5a2980 RDI: 0000000000000000 [  436.850308] RBP: ffffc90049eda000 R08: ffffc90049edc000 R09: ffffc90049edc000 [  436.850320] R10: ffffc90049edc000 R11: ffffc90049edbfff R12: ffffc90049edc000 [  436.850332] R13: ffffc90045727bb0 R14: ffffc90045727b28 R15: 800000000000006b [  436.850356] FS:  0000000000000000(0000) GS:ffff888201e6e000(0000) knlGS:0000000000000000
[  436.850371] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[  436.850383] CR2: 00006543dbade250 CR3: 0000000115ef1000 CR4: 0000000000050660
[  436.850401] Call Trace:
[  436.850410]  <TASK>
[  436.850420]  vmap_pages_pud_range+0x47c/0x530
[  436.850439]  vmap_small_pages_range_noflush+0x1f1/0x2b0
[  436.850451]  ? __get_vm_area_node+0x10a/0x170
[  436.850465]  vmap+0x79/0xd0
[  436.850476]  i915_gem_object_map_page+0x13b/0x210 [i915]
[  436.850812]  i915_gem_object_pin_map+0x1e2/0x210 [i915]
[  436.851123]  i915_gem_object_pin_map_unlocked+0x2d/0xa0 [i915]
[  436.851424]  intel_dsb_buffer_create+0xed/0x1a0 [i915]
[  436.851778]  intel_dsb_prepare+0xca/0x1a0 [i915]
[  436.852110]  intel_atomic_dsb_finish+0x92/0x350 [i915]
[  436.852456]  intel_atomic_commit_tail+0x326/0xd40 [i915]
[  436.852769]  process_one_work+0x18d/0x380
[  436.852779]  worker_thread+0x196/0x300
[  436.852787]  ? __pfx_worker_thread+0x10/0x10
[  436.852796]  kthread+0xe3/0x120
[  436.852805]  ? __pfx_kthread+0x10/0x10
[  436.852815]  ret_from_fork+0x19e/0x260
[  436.852824]  ? __pfx_kthread+0x10/0x10
[  436.852832]  ret_from_fork_asm+0x1a/0x30
[  436.852842]  </TASK>
[  436.852847] Modules linked in: snd_seq_dummy snd_hrtimer snd_hda_codec_intelhdmi snd_hda_codec_hdmi snd_hda_codec_alc269 snd_hda_codec_realtek_lib snd_hda_scodec_component snd_hda_codec_generic snd_hda_intel snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic soundwire_intel snd_sof_intel_hda_sdw_bpt snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda_mlink snd_sof_intel_hda soundwire_cadence snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_acpi_intel_match snd_soc_acpi_intel_sdca_quirks soundwire_generic_allocation snd_soc_sdw_utils snd_soc_acpi crc8 intel_rapl_msr soundwire_bus intel_rapl_common snd_soc_sdca snd_soc_avs snd_soc_hda_codec snd_hda_ext_core snd_hda_codec vfat intel_uncore_frequency_common fat snd_hda_core snd_intel_dspcfg snd_intel_sdw_acpi snd_hwdep intel_powerclamp snd_soc_core iwlwifi snd_compress spi_nor iTCO_wdt ac97_bus intel_pmc_bxt ee1004 mtd snd_pcm_dmaengine snd_seq cfg80211 snd_seq_device pcspkr spi_intel_pci snd_pcm rfkill spi_intel snd_timer snd [  436.852939]  i2c_i801 soundcore i2c_smbus idma64 intel_pmc_core pmt_telemetry pmt_discovery pmt_class intel_hid intel_pmc_ssram_telemetry intel_scu_pltdrv sparse_keymap joydev loop fuse xenfs nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock zram vmw_vmci lz4hc_compress lz4_compress dm_thin_pool dm_persistent_data dm_bio_prison dm_crypt xe drm_ttm_helper drm_suballoc_helper gpu_sched drm_gpuvm drm_exec drm_gpusvm_helper i915 i2c_algo_bit drm_buddy hid_multitouch i2c_hid_acpi ghash_clmulni_intel video nvme wmi ttm i2c_hid nvme_core nvme_keyring drm_display_helper nvme_auth xhci_pci pinctrl_tigerlake thunderbolt hkdf cec xhci_hcd intel_vsec serio_raw xen_acpi_processor xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput i2c_dev
[  436.853183] ---[ end trace 0000000000000000 ]---

or this:

[  548.736884] ------------[ cut here ]------------
[  548.736907] kernel BUG at arch/x86/include/asm/xen/hypervisor.h:85!
[  548.736923] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[  548.736935] CPU: 0 UID: 0 PID: 206 Comm: kworker/0:2 Not tainted 7.0.0-0.rc1.1.qubes.1001.fc41.x86_64 #1 PREEMPT(full) [  548.736949] Hardware name: LENOVO 2347A45/2347A45, BIOS CBET4000 Nitrokey-v0.2.0-2608-ga649597 01/01/1970
[  548.736962] Workqueue: events delayed_vfree_work
[  548.736976] RIP: e030:xen_leave_lazy_mmu+0x44/0x50
[  548.736989] Code: 02 03 83 f8 01 75 23 65 c7 05 6c e4 02 03 00 00 00 00 65 ff 0d 7d b8 02 03 74 05 c3 cc cc cc cc e8 61 5d fd ff c3 cc cc cc cc <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90
[  548.737010] RSP: e02b:ffffc90040607cf0 EFLAGS: 00010297
[  548.737018] RAX: 0000000000000000 RBX: ffff888164a70408 RCX: 0000000000000000 [  548.737029] RDX: 0000000000000000 RSI: 000ffffffffff000 RDI: ffff8881069c0000 [  548.737039] RBP: ffffc90049681000 R08: ffffc90049681000 R09: 0000000000000027 [  548.737050] R10: 0000000000000027 R11: fefefefefefefeff R12: ffffc90049681000 [  548.737060] R13: ffff8881002fd258 R14: 0000000000000000 R15: ffffc90040607dac [  548.737079] FS:  0000000000000000(0000) GS:ffff8881f88ee000(0000) knlGS:0000000000000000
[  548.737090] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[  548.737099] CR2: 000055576c2e6058 CR3: 000000010d47b000 CR4: 0000000000050660
[  548.737115] Call Trace:
[  548.737123]  <TASK>
[  548.737128]  vunmap_pmd_range.isra.0+0x1f1/0x2e0
[  548.737142]  vunmap_p4d_range+0x17d/0x290
[  548.737151]  __vunmap_range_noflush+0x182/0x1d0
[  548.737161]  ? _raw_spin_unlock+0xe/0x30
[  548.737171]  remove_vm_area+0x40/0x70
[  548.737180]  vfree.part.0+0x1b/0x290
[  548.737189]  delayed_vfree_work+0x35/0x50
[  548.737198]  process_one_work+0x18d/0x380
[  548.737207]  worker_thread+0x196/0x300
[  548.737215]  ? __pfx_worker_thread+0x10/0x10
[  548.737224]  kthread+0xe3/0x120
[  548.737233]  ? __pfx_kthread+0x10/0x10
[  548.737242]  ret_from_fork+0x19e/0x260
[  548.737250]  ? __pfx_kthread+0x10/0x10
[  548.737258]  ret_from_fork_asm+0x1a/0x30
[  548.737269]  </TASK>
[  548.737274] Modules linked in: vfat fat snd_seq_dummy snd_hrtimer ath9k ath9k_common snd_hda_codec_intelhdmi snd_hda_codec_hdmi ath9k_hw snd_hda_codec_alc269 snd_hda_codec_realtek_lib snd_hda_scodec_component snd_hda_codec_generic snd_hda_intel snd_hda_codec mac80211 snd_hda_core snd_intel_dspcfg snd_intel_sdw_acpi snd_hwdep ath snd_seq snd_seq_device snd_ctl_led cfg80211 snd_pcm at24 thinkpad_acpi intel_rapl_msr i2c_i801 snd_timer sparse_keymap iTCO_wdt intel_rapl_common platform_profile intel_powerclamp intel_pmc_bxt pcspkr i2c_smbus rfkill libarc4 snd soundcore mei_me e1000e mei joydev lpc_ich loop fuse xenfs nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock zram vmw_vmci lz4hc_compress lz4_compress dm_thin_pool dm_persistent_data dm_bio_prison dm_crypt i915 i2c_algo_bit drm_buddy ghash_clmulni_intel ttm sdhci_pci drm_display_helper sdhci_uhs2 sdhci video xhci_pci cqhci wmi cec xhci_hcd ehci_pci mmc_core ehci_hcd serio_raw xen_acpi_processor xen_privcmd xen_pciback [  548.737348]  xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput i2c_dev
[  548.737469] ---[ end trace 0000000000000000 ]---

I don't have clear pattern when this happens, one was during host
suspend, but the other was during "normal" test run (starting/stopping
domUs and running stuff around them). Note also one of those is Intel
and the other AMD, so it isn't really hardware specific.

Slightly more details with links (especially serial0.txt in the logs
tab) at
https://github.com/QubesOS/qubes-linux-kernel/ pull/662#issuecomment-3963326188

Any idea?

That looks like the issue Juergen fixed with:

https://lore.kernel.org/xen-devel/20260220123715.834848-1-jgross@xxxxxxxx/
No, it doesn't. The fix is already in rc1, and the crash was quite early during
boot (before any secondary CPUs were brought up).

I guess this problem is related to the lazy_mmu_state series [1].

That may well be the case - it seems that xen_enter_lazy_mmu() is called
while already in lazy MMU mode (first splat), and xen_leave_lazy_mmu()
is called without being in lazy MMU mode (second splat). I expect this
is something specific to Xen, which I didn't get the chance to test.

Looking into this again.

I think the main problem is the call of arch_end_context_switch() in
__switch_to(). For xen this is xen_end_context_switch() and it is doing:

   if (__task_lazy_mmu_mode_active(next))
       arch_enter_lazy_mmu_mode()

But this is wrong here, as current hasn't been switched to "next" yet.

I don't think we can just move the call of arch_end_context_switch(), as
it is needed for issuing the context switch related hypercall for switching
all the needed non-MMU settings.

What we probably really want is to call lazy_mmu_mode_pause() before the
call of arch_start_context_switch() and later call lazy_mmu_mode_resume()
after switching context to next. In xen_start_context_switch() and
xen_end_context_switch() the lazy mmu mode handling should be removed.

I will test that tomorrow, unless someone talks me out of it. :-)

That wasn't it, as the reasoning was wrong.

But now I think I have found the real culprit in lazy_mmu_mode_enable():

static inline void lazy_mmu_mode_enable(void)
{
        struct lazy_mmu_state *state = &current->lazy_mmu_state;

        if (in_interrupt() || state->pause_count > 0)
                return;

        VM_WARN_ON_ONCE(state->enable_count == U8_MAX);

        if (state->enable_count++ == 0)
                arch_enter_lazy_mmu_mode();
}

Consider a preemption just before calling arch_enter_lazy_mmu_mode(). The
enable_count will be 1 now, but there was no switch to lazy mode yet.

When the task becomes active again, context switch handling will see lazy
mode enabled (enable_count > 0), so it will call arch_enter_lazy_mmu_mode().
And then the task resumes and is calling arch_enter_lazy_mmu_mode() another
time.

The only chance I'm seeing to avoid that would be to disable preemption
around all instances of testing a condition and then enabling or disabling
lazy mmu mode.


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.