[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Linux 6.13-rc3 many different panics in Xen PV dom0


  • To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
  • From: Jürgen Groß <jgross@xxxxxxxx>
  • Date: Thu, 2 Jan 2025 20:17:00 +0100
  • Autocrypt: addr=jgross@xxxxxxxx; keydata= xsBNBFOMcBYBCACgGjqjoGvbEouQZw/ToiBg9W98AlM2QHV+iNHsEs7kxWhKMjrioyspZKOB ycWxw3ie3j9uvg9EOB3aN4xiTv4qbnGiTr3oJhkB1gsb6ToJQZ8uxGq2kaV2KL9650I1SJve dYm8Of8Zd621lSmoKOwlNClALZNew72NjJLEzTalU1OdT7/i1TXkH09XSSI8mEQ/ouNcMvIJ NwQpd369y9bfIhWUiVXEK7MlRgUG6MvIj6Y3Am/BBLUVbDa4+gmzDC9ezlZkTZG2t14zWPvx XP3FAp2pkW0xqG7/377qptDmrk42GlSKN4z76ELnLxussxc7I2hx18NUcbP8+uty4bMxABEB AAHNH0p1ZXJnZW4gR3Jvc3MgPGpncm9zc0BzdXNlLmNvbT7CwHkEEwECACMFAlOMcK8CGwMH CwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRCw3p3WKL8TL8eZB/9G0juS/kDY9LhEXseh mE9U+iA1VsLhgDqVbsOtZ/S14LRFHczNd/Lqkn7souCSoyWsBs3/wO+OjPvxf7m+Ef+sMtr0 G5lCWEWa9wa0IXx5HRPW/ScL+e4AVUbL7rurYMfwCzco+7TfjhMEOkC+va5gzi1KrErgNRHH kg3PhlnRY0Udyqx++UYkAsN4TQuEhNN32MvN0Np3WlBJOgKcuXpIElmMM5f1BBzJSKBkW0Jc Wy3h2Wy912vHKpPV/Xv7ZwVJ27v7KcuZcErtptDevAljxJtE7aJG6WiBzm+v9EswyWxwMCIO RoVBYuiocc51872tRGywc03xaQydB+9R7BHPzsBNBFOMcBYBCADLMfoA44MwGOB9YT1V4KCy vAfd7E0BTfaAurbG+Olacciz3yd09QOmejFZC6AnoykydyvTFLAWYcSCdISMr88COmmCbJzn sHAogjexXiif6ANUUlHpjxlHCCcELmZUzomNDnEOTxZFeWMTFF9Rf2k2F0Tl4E5kmsNGgtSa aMO0rNZoOEiD/7UfPP3dfh8JCQ1VtUUsQtT1sxos8Eb/HmriJhnaTZ7Hp3jtgTVkV0ybpgFg w6WMaRkrBh17mV0z2ajjmabB7SJxcouSkR0hcpNl4oM74d2/VqoW4BxxxOD1FcNCObCELfIS auZx+XT6s+CE7Qi/c44ibBMR7hyjdzWbABEBAAHCwF8EGAECAAkFAlOMcBYCGwwACgkQsN6d 1ii/Ey9D+Af/WFr3q+bg/8v5tCknCtn92d5lyYTBNt7xgWzDZX8G6/pngzKyWfedArllp0Pn fgIXtMNV+3t8Li1Tg843EXkP7+2+CQ98MB8XvvPLYAfW8nNDV85TyVgWlldNcgdv7nn1Sq8g HwB2BHdIAkYce3hEoDQXt/mKlgEGsLpzJcnLKimtPXQQy9TxUaLBe9PInPd+Ohix0XOlY+Uk QFEx50Ki3rSDl2Zt2tnkNYKUCvTJq7jvOlaPd6d/W0tZqpyy7KVay+K4aMobDsodB3dvEAs6 ScCnh03dDAFgIq5nsB11j3KPKdVoPlfucX2c7kGNH+LUMbzqV6beIENfNexkOfxHfw==
  • Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Thu, 02 Jan 2025 19:17:13 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 02.01.25 19:54, Marek Marczykowski-Górecki wrote:
On Thu, Jan 02, 2025 at 01:24:21PM +0100, Marek Marczykowski-Górecki wrote:
On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote:
On 02.01.25 11:20, Jürgen Groß wrote:
On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
Hi,

It crashes on boot like below, most of the times. But sometimes (rarely)
it manages to stay alive. Below I'm pasting few of the crashes that look
distinctly different, if you follow the links, you can find more of
them. IMHO it looks like some memory corruption bug somewhere. I tested
also Linux 6.13-rc2 before, and it had very similar issue.

...


Full log:
https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt

I can reproduce a crash with 6.13-rc5 PV dom0.

What is really interesting in the logs: most crashes seem to happen right
after a module being loaded (in my reproducer it was right after loading
the first module).

I need to go through the 6.13 commits, but I think I remember having seen
a patch optimizing module loading by using large pages for addressing the
loaded modules. Maybe the case of no large pages being available isn't
handled properly.

Seems I was right.

For me the following diff fixes the issue. Marek, can you please confirm
it fixes your crashes, too?

Thanks for looking into it!
Will do, I've pushed it to
https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build it
and then I'll post it to openQA.

It is much better!

Tests are still running, but I already see that many are green.

So are you fine with me adding your "Tested-by:"?

There is
one issue (likely unrelated to this change) - sys-usb (HVM domU with USB
controllers passed through) crashes on a system with Raptor Lake CPU
(only, others, including ADL and MTL look fine):

[   75.770849] Bluetooth: Core ver 2.22
[   75.770866] Oops: general protection fault, probably for non-canonical 
address 0xc9d2315bc82c3bbd: 0000 [#1] PREEMPT SMP NOPTI
[   75.770880] CPU: 0 UID: 0 PID: 923 Comm: (udev-worker) Not tainted 
6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1
[   75.770890] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025
[   75.770897] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth]
[   75.770924] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 21 <26> 2b 8b ad 03 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

This code is looking suspicious. Large areas of binary 0 in a normal function?
And the code itself is nonsense, as it is using a memory access via ES:, which
doesn't make any sense in 64-bit kernel.


Juergen


[   75.770943] RSP: 0000:ffffad644108fa40 EFLAGS: 00010246
[   75.770950] RAX: ffff93da8a149600 RBX: c9d2315bc82c3810 RCX: 0000000100000000
[   75.770958] RDX: 0000000000000001 RSI: ffff93da905e9180 RDI: ffff93da81404598
[   75.770967] RBP: ffffad644108fa58 R08: 0000000000000064 R09: 00000000000012ab
[   75.770975] R10: ffff93da81207000 R11: 0000000000000286 R12: ffffad644108fb00
[   75.770983] R13: ffffad644108fa68 R14: ffff93da9089b840 R15: ffff93da8c265100
[   75.770991] FS:  000078fa4cec4bc0(0000) GS:ffff93da97000000(0000) 
knlGS:0000000000000000
[   75.771000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.771007] CR2: 000074fa64aadc08 CR3: 00000000105d2006 CR4: 0000000000770ef0
[   75.771016] PKRU: 55555554
[   75.771019] Call Trace:
[   75.771024]  <TASK>
[   75.771028]  ? show_trace_log_lvl+0x1b0/0x2f0
[   75.771036]  ? show_trace_log_lvl+0x1b0/0x2f0
[   75.771042]  ? do_one_initcall+0x58/0x310
[   75.771048]  ? __die_body.cold+0x8/0x12
[   75.771053]  ? die_addr+0x3c/0x60
[   75.771059]  ? exc_general_protection+0x17d/0x400
[   75.771066]  ? asm_exc_general_protection+0x26/0x30
[   75.771074]  ? msft_monitor_device_del+0x93/0x170 [bluetooth]
[   75.771095]  ? bt_init+0x54/0x1d0 [bluetooth]
[   75.771114]  ? __pfx_bt_init+0x10/0x10 [bluetooth]
[   75.771131]  ? do_one_initcall+0x58/0x310
[   75.771137]  ? do_init_module+0x90/0x250
[   75.771142]  ? init_module_from_file+0x86/0xc0
[   75.771149]  ? idempotent_init_module+0x115/0x310
[   75.771156]  ? __x64_sys_finit_module+0x65/0xc0
[   75.771163]  ? do_syscall_64+0x82/0x160
[   75.771168]  ? backing_file_read_iter+0x156/0x1f0
[   75.771176]  ? ovl_read_iter+0x94/0xa0 [overlay]
[   75.771189]  ? __pfx_ovl_file_accessed+0x10/0x10 [overlay]
[   75.771199]  ? rseq_get_rseq_cs+0x1d/0x220
[   75.771205]  ? rseq_ip_fixup+0x8d/0x1d0
[   75.771210]  ? __seccomp_filter+0x303/0x520
[   75.771216]  ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
[   75.771224]  ? syscall_exit_to_user_mode+0x10/0x210
[   75.771231]  ? do_syscall_64+0x8e/0x160
[   75.771236]  ? do_sys_openat2+0x9c/0xe0
[   75.771241]  ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
[   75.771249]  ? syscall_exit_to_user_mode+0x10/0x210
[   75.771255]  ? do_syscall_64+0x8e/0x160
[   75.771260]  ? do_user_addr_fault+0x1ec/0x7b0
[   75.771267]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   75.771274]  </TASK>
[   75.771277] Modules linked in: bluetooth(+) rfkill snd_seq_dummy snd_hrtimer 
snd_seq snd_seq_device snd_timer snd soundcore nft_reject_ipv6 nf_reject_ipv6 
nft_reject_ipv4 nf_reject_ipv4 nft_reject intel_rapl_msr intel_rapl_common 
nft_ct intel_uncore_frequency_common intel_pmc_core intel_vsec joydev nft_masq 
pmt_telemetry pmt_class nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni 
xhci_pci polyval_generic ghash_clmulni_intel xhci_hcd sha512_ssse3 sha256_ssse3 
nf_tables sha1_ssse3 ehci_pci mei_me ehci_hcd pcspkr mei ata_generic pata_acpi 
i2c_piix4 i2c_smbus serio_raw xen_scsiback target_core_mod xen_netback 
xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn loop fuse nfnetlink 
overlay xen_blkfront
[   75.771370] ---[ end trace 0000000000000000 ]---
[   75.771376] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth]
[   75.771397] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 21 <26> 2b 8b ad 03 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   75.771416] RSP: 0000:ffffad644108fa40 EFLAGS: 00010246
[   75.771422] RAX: ffff93da8a149600 RBX: c9d2315bc82c3810 RCX: 0000000100000000
[   75.771431] RDX: 0000000000000001 RSI: ffff93da905e9180 RDI: ffff93da81404598
[   75.771439] RBP: ffffad644108fa58 R08: 0000000000000064 R09: 00000000000012ab
[   75.771446] R10: ffff93da81207000 R11: 0000000000000286 R12: ffffad644108fb00
[   75.771454] R13: ffffad644108fa68 R14: ffff93da9089b840 R15: ffff93da8c265100
[   75.771463] FS:  000078fa4cec4bc0(0000) GS:ffff93da97000000(0000) 
knlGS:0000000000000000
[   75.771471] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   75.771477] CR2: 000074fa64aadc08 CR3: 00000000105d2006 CR4: 0000000000770ef0
[   75.771485] PKRU: 55555554
[   75.771488] Kernel panic - not syncing: Fatal exception
[   75.771519] Kernel Offset: 0x3b800000 from 0xffffffff80200000 (relocation 
range: 0xffffffff80000000-0xffffffffbfffffff)

Full log inside
https://openqa.qubes-os.org/tests/124736/file/usbvm-var_log.tar.gz
(log/xen/console/guest-sys-usb.log)


Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.