[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Linux 6.13-rc3 many different panics in Xen PV dom0



On Thu, Jan 02, 2025 at 08:39:16PM +0100, Marek Marczykowski-Górecki wrote:
> On Thu, Jan 02, 2025 at 08:17:00PM +0100, Jürgen Groß wrote:
> > On 02.01.25 19:54, Marek Marczykowski-Górecki wrote:
> > > On Thu, Jan 02, 2025 at 01:24:21PM +0100, Marek Marczykowski-Górecki 
> > > wrote:
> > > > On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote:
> > > > > On 02.01.25 11:20, Jürgen Groß wrote:
> > > > > > On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > It crashes on boot like below, most of the times. But sometimes 
> > > > > > > (rarely)
> > > > > > > it manages to stay alive. Below I'm pasting few of the crashes 
> > > > > > > that look
> > > > > > > distinctly different, if you follow the links, you can find more 
> > > > > > > of
> > > > > > > them. IMHO it looks like some memory corruption bug somewhere. I 
> > > > > > > tested
> > > > > > > also Linux 6.13-rc2 before, and it had very similar issue.
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > > > 
> > > > > > > Full log:
> > > > > > > https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt
> > > > > > 
> > > > > > I can reproduce a crash with 6.13-rc5 PV dom0.
> > > > > > 
> > > > > > What is really interesting in the logs: most crashes seem to happen 
> > > > > > right
> > > > > > after a module being loaded (in my reproducer it was right after 
> > > > > > loading
> > > > > > the first module).
> > > > > > 
> > > > > > I need to go through the 6.13 commits, but I think I remember 
> > > > > > having seen
> > > > > > a patch optimizing module loading by using large pages for 
> > > > > > addressing the
> > > > > > loaded modules. Maybe the case of no large pages being available 
> > > > > > isn't
> > > > > > handled properly.
> > > > > 
> > > > > Seems I was right.
> > > > > 
> > > > > For me the following diff fixes the issue. Marek, can you please 
> > > > > confirm
> > > > > it fixes your crashes, too?
> > > > 
> > > > Thanks for looking into it!
> > > > Will do, I've pushed it to
> > > > https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build it
> > > > and then I'll post it to openQA.
> > > 
> > > It is much better!
> > > 
> > > Tests are still running, but I already see that many are green.
> > 
> > So are you fine with me adding your "Tested-by:"?
> 
> Yes.
> 
> > > There is
> > > one issue (likely unrelated to this change) - sys-usb (HVM domU with USB
> > > controllers passed through) crashes on a system with Raptor Lake CPU
> > > (only, others, including ADL and MTL look fine):

Correction, it does happen on some others too, just got the crash on the ADL
system, although looks a bit different ("Corrupted page table at ..."):

sys-usb login: [2025-01-02 23:44:58] [    7.295556] Bluetooth: hci0: Waiting 
for firmware download to complete
[    7.296996] Bluetooth: hci0: Firmware loaded in 2882606 usecs
[    7.297276] Bluetooth: hci0: Waiting for device to boot
[    7.313074] Bluetooth: hci0: Device booted in 15473 usecs
[    7.318447] Bluetooth: hci0: Found Intel DDC parameters: 
intel/ibt-1040-0041.ddc
[    7.321060] Bluetooth: hci0: Applying Intel DDC parameters completed
[    7.322057] Bluetooth: hci0: No support for BT device in ACPI firmware
[    7.324037] Bluetooth: hci0: Firmware timestamp 2024.33 buildtype 1 build 
81755
[    7.324085] Bluetooth: hci0: Firmware SHA1: 0xd028ffe4
[    7.327995] Bluetooth: hci0: Fseq status: Success (0x00)
[    7.328017] Bluetooth: hci0: Fseq executed: 00.00.02.41
[    7.328032] Bluetooth: hci0: Fseq BT Top: 00.00.02.41
[    7.396950] Bluetooth: MGMT ver 1.23
[    9.352650] kauditd_printk_skb: 82 callbacks suppressed
[    9.352655] audit: type=1131 audit(1735861500.506:81): pid=1 uid=0 
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 
msg='unit=systemd-rfkill comm="systemd" exe="/usr/lib/systemd/systemd" 
hostname=? addr=? terminal=? res=success'
[   15.808157] audit: type=1100 audit(1735861506.961:82): pid=867 uid=0 
auid=4294967295 ses=4294967295 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 
msg='op=PAM:authentication grantors=pam_rootok acct="user" 
exe="/usr/bin/qubes-gui-runuser" hostname=sys-usb addr=? terminal=/dev/tty7 
res=success'
[   15.808860] audit: type=1100 audit(1735861506.962:83): pid=866 uid=0 
auid=4294967295 ses=4294967295 
subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 msg='op=PAM:authentication 
grantors=pam_rootok acct="user" exe="/usr/lib/qubes/qrexec-agent" hostname=? 
addr=? terminal=? res=success'
[   15.814137] audit: type=1103 audit(1735861506.967:84): pid=867 uid=0 
auid=4294967295 ses=4294967295 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 
msg='op=PAM:setcred grantors=pam_rootok acct="user" 
exe="/usr/bin/qubes-gui-runuser" hostname=sys-usb addr=? terminal=/dev/tty7 
res=success'
[   15.814816] audit: type=1006 audit(1735861506.968:85): pid=867 uid=0 
subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 old-auid=4294967295 auid=1000 
tty=tty7 old-ses=4294967295 ses=1 res=1
[   15.815078] audit: type=1300 audit(1735861506.968:85): arch=c000003e 
syscall=1 success=yes exit=4 a0=3 a1=7ffe29c03a70 a2=4 a3=0 items=0 ppid=712 
pid=867 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 
tty=tty7 ses=1 comm="qubes-gui-runus" exe="/usr/bin/qubes-gui-runuser" 
subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 key=(null)
[   15.815164] audit: type=1327 audit(1735861506.968:85): 
proctitle=2F7573722F62696E2F71756265732D6775692D72756E757365720075736572002F62696E2F7368002D6C002D630065786563202F7573722F62696E2F78696E6974202F6574632F5831312F78696E69742F78696E69747263202D2D202F7573722F6C69622F71756265732F71756265732D786F72672D77726170706572203A30
[   15.815420] audit: type=1103 audit(1735861506.969:86): pid=866 uid=0 
auid=4294967295 ses=4294967295 
subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 msg='op=PAM:setcred 
grantors=pam_rootok acct="user" exe="/usr/lib/qubes/qrexec-agent" hostname=? 
addr=? terminal=? res=success'
[   15.816039] audit: type=1006 audit(1735861506.969:87): pid=866 uid=0 
subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 old-auid=4294967295 
auid=1000 tty=(none) old-ses=4294967295 ses=2 res=1
[   15.817029] audit: type=1300 audit(1735861506.969:87): arch=c000003e 
syscall=1 success=yes exit=4 a0=3 a1=7ffe550c1c30 a2=4 a3=0 items=0 ppid=864 
pid=866 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 
tty=(none) ses=2 comm="qrexec-agent" exe="/usr/lib/qubes/qrexec-agent" 
subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 key=(null)
[   15.817160] audit: type=1327 audit(1735861506.969:87): 
proctitle="/usr/lib/qubes/qrexec-agent"
[   16.111133] systemd-journald[366]: Time jumped backwards, rotating.
th: RFCOMM TTY layer initialized
[   18.286026] Bluetooth: RFCOMM socket layer initialized
[   18.286035] Bluetooth: RFCOMM ver 1.11
[   18.469074] abrt-dump-journ: Corrupted page table at address 78c64b600010
[   18.469096] PGD 14980067 P4D 14980067 PUD 14981067 PMD 38c8047 PTE 
243c8b48ffffff57
[   18.469117] Oops: Bad pagetable: 000d [#1] PREEMPT SMP NOPTI
[   18.469132] CPU: 1 UID: 0 PID: 657 Comm: abrt-dump-journ Not tainted 
6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1
[   18.469152] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025
[   18.469165] RIP: 0033:0x78c64e1bc9a0
[   18.469177] Code: 86 f5 01 00 00 49 8b 7c 24 38 48 85 ff 0f 84 08 03 00 00 
48 8d 0d 40 e6 ff ff ba 18 00 00 00 e8 46 c7 fa ff e9 d1 01 00 00 90 <0f> b6 50 
10 38 96 c8 01 00 00 0f 85 63 fd ff ff 80 fa 02 0f 84 4c
[   18.469211] RSP: 002b:00007ffcdc67a8b0 EFLAGS: 00010246
[   18.469223] RAX: 000078c64b600000 RBX: 00006045c444c890 RCX: 0000000000000048
[   18.469238] RDX: 0000000000000000 RSI: 00006045c444c890 RDI: 00006045c444f040
[   18.469253] RBP: 00007ffcdc67a930 R08: 00006045c43a1010 R09: 0000000000000001
[   18.469268] R10: 00006045c44098b0 R11: 0000000000000246 R12: 00006045c444f040
[   18.469284] R13: 00006045c4409890 R14: 00006045c444c890 R15: 0000000000000000
[   18.469299] FS:  000078c64d675400 GS:  0000000000000000
[   18.469310] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq 
snd_seq_device snd_timer snd soundcore rfcomm bnep btusb btrtl btintel btbcm 
btmtk bluetooth rfkill nft_reject_ipv6 nf_reject_ipv6 nft_reject_ipv4 
nf_reject_ipv4 nft_reject nft_ct nft_masq nft_chain_nat nf_nat nf_conntrack 
nf_defrag_ipv6 nf_defrag_ipv4 joydev nf_tables intel_rapl_msr intel_rapl_common 
intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class 
crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic 
ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 xhci_pci ehci_pci 
xhci_hcd ehci_hcd pcspkr i2c_piix4 i2c_smbus ata_generic pata_acpi serio_raw 
xen_scsiback target_core_mod xen_netback xen_privcmd xen_gntdev xen_gntalloc 
xen_blkback xen_evtchn loop fuse nfnetlink overlay xen_blkfront
[   18.469484] ---[ end trace 0000000000000000 ]---
[   18.469495] RIP: 0033:0x78c64e1bc9a0
[   18.469504] RSP: 002b:00007ffcdc67a8b0 EFLAGS: 00010246
[   18.469516] RAX: 000078c64b600000 RBX: 00006045c444c890 RCX: 0000000000000048
[   18.469531] RDX: 0000000000000000 RSI: 00006045c444c890 RDI: 00006045c444f040
[   18.469547] RBP: 00007ffcdc67a930 R08: 00006045c43a1010 R09: 0000000000000001
[   18.469562] R10: 00006045c44098b0 R11: 0000000000000246 R12: 00006045c444f040
[   18.469577] R13: 00006045c4409890 R14: 00006045c444c890 R15: 0000000000000000
[   18.469593] FS:  000078c64d675400(0000) GS:ffff9de397100000(0000) 
knlGS:0000000000000000
[   18.469609] CS:  0033 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.469623] CR2: 000078c64b600010 CR3: 0000000000164004 CR4: 0000000000770ef0
[   18.469640] PKRU: 55555554
[   18.469646] Kernel panic - not syncing: Fatal exception
[   18.469706] Kernel Offset: 0x2ec00000 from 0xffffffff80200000 (relocation 
range: 0xffffffff80000000-0xffffffffbfffffff)


> > > [   75.770849] Bluetooth: Core ver 2.22
> > > [   75.770866] Oops: general protection fault, probably for non-canonical 
> > > address 0xc9d2315bc82c3bbd: 0000 [#1] PREEMPT SMP NOPTI
> > > [   75.770880] CPU: 0 UID: 0 PID: 923 Comm: (udev-worker) Not tainted 
> > > 6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1
> > > [   75.770890] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025
> > > [   75.770897] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth]
> > > [   75.770924] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 
> > > 21 <26> 2b 8b ad 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 
> > This code is looking suspicious. Large areas of binary 0 in a normal 
> > function?
> > And the code itself is nonsense, as it is using a memory access via ES:, 
> > which
> > doesn't make any sense in 64-bit kernel.
> 
> Could it be still something related to modules layout in memory?
> It seems it's not 100% reliable crash, I see in at least one instance
> sys-usb remained running (unfortunately I don't have collected full
> sys-usb console log from successful test...).
> 
> I just checked again that this crash didn't happen with any 6.12 or 6.11
> kernels.
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab



-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.