[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Linux 6.13-rc3 many different panics in Xen PV dom0
On Thu, Jan 02, 2025 at 08:39:16PM +0100, Marek Marczykowski-Górecki wrote: > On Thu, Jan 02, 2025 at 08:17:00PM +0100, Jürgen Groß wrote: > > On 02.01.25 19:54, Marek Marczykowski-Górecki wrote: > > > On Thu, Jan 02, 2025 at 01:24:21PM +0100, Marek Marczykowski-Górecki > > > wrote: > > > > On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote: > > > > > On 02.01.25 11:20, Jürgen Groß wrote: > > > > > > On 19.12.24 17:14, Marek Marczykowski-Górecki wrote: > > > > > > > Hi, > > > > > > > > > > > > > > It crashes on boot like below, most of the times. But sometimes > > > > > > > (rarely) > > > > > > > it manages to stay alive. Below I'm pasting few of the crashes > > > > > > > that look > > > > > > > distinctly different, if you follow the links, you can find more > > > > > > > of > > > > > > > them. IMHO it looks like some memory corruption bug somewhere. I > > > > > > > tested > > > > > > > also Linux 6.13-rc2 before, and it had very similar issue. > > > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > Full log: > > > > > > > https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt > > > > > > > > > > > > I can reproduce a crash with 6.13-rc5 PV dom0. > > > > > > > > > > > > What is really interesting in the logs: most crashes seem to happen > > > > > > right > > > > > > after a module being loaded (in my reproducer it was right after > > > > > > loading > > > > > > the first module). > > > > > > > > > > > > I need to go through the 6.13 commits, but I think I remember > > > > > > having seen > > > > > > a patch optimizing module loading by using large pages for > > > > > > addressing the > > > > > > loaded modules. Maybe the case of no large pages being available > > > > > > isn't > > > > > > handled properly. > > > > > > > > > > Seems I was right. > > > > > > > > > > For me the following diff fixes the issue. Marek, can you please > > > > > confirm > > > > > it fixes your crashes, too? > > > > > > > > Thanks for looking into it! > > > > Will do, I've pushed it to > > > > https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build it > > > > and then I'll post it to openQA. > > > > > > It is much better! > > > > > > Tests are still running, but I already see that many are green. > > > > So are you fine with me adding your "Tested-by:"? > > Yes. > > > > There is > > > one issue (likely unrelated to this change) - sys-usb (HVM domU with USB > > > controllers passed through) crashes on a system with Raptor Lake CPU > > > (only, others, including ADL and MTL look fine): Correction, it does happen on some others too, just got the crash on the ADL system, although looks a bit different ("Corrupted page table at ..."): sys-usb login: [2025-01-02 23:44:58] [ 7.295556] Bluetooth: hci0: Waiting for firmware download to complete [ 7.296996] Bluetooth: hci0: Firmware loaded in 2882606 usecs [ 7.297276] Bluetooth: hci0: Waiting for device to boot [ 7.313074] Bluetooth: hci0: Device booted in 15473 usecs [ 7.318447] Bluetooth: hci0: Found Intel DDC parameters: intel/ibt-1040-0041.ddc [ 7.321060] Bluetooth: hci0: Applying Intel DDC parameters completed [ 7.322057] Bluetooth: hci0: No support for BT device in ACPI firmware [ 7.324037] Bluetooth: hci0: Firmware timestamp 2024.33 buildtype 1 build 81755 [ 7.324085] Bluetooth: hci0: Firmware SHA1: 0xd028ffe4 [ 7.327995] Bluetooth: hci0: Fseq status: Success (0x00) [ 7.328017] Bluetooth: hci0: Fseq executed: 00.00.02.41 [ 7.328032] Bluetooth: hci0: Fseq BT Top: 00.00.02.41 [ 7.396950] Bluetooth: MGMT ver 1.23 [ 9.352650] kauditd_printk_skb: 82 callbacks suppressed [ 9.352655] audit: type=1131 audit(1735861500.506:81): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-rfkill comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 15.808157] audit: type=1100 audit(1735861506.961:82): pid=867 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 msg='op=PAM:authentication grantors=pam_rootok acct="user" exe="/usr/bin/qubes-gui-runuser" hostname=sys-usb addr=? terminal=/dev/tty7 res=success' [ 15.808860] audit: type=1100 audit(1735861506.962:83): pid=866 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 msg='op=PAM:authentication grantors=pam_rootok acct="user" exe="/usr/lib/qubes/qrexec-agent" hostname=? addr=? terminal=? res=success' [ 15.814137] audit: type=1103 audit(1735861506.967:84): pid=867 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 msg='op=PAM:setcred grantors=pam_rootok acct="user" exe="/usr/bin/qubes-gui-runuser" hostname=sys-usb addr=? terminal=/dev/tty7 res=success' [ 15.814816] audit: type=1006 audit(1735861506.968:85): pid=867 uid=0 subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 old-auid=4294967295 auid=1000 tty=tty7 old-ses=4294967295 ses=1 res=1 [ 15.815078] audit: type=1300 audit(1735861506.968:85): arch=c000003e syscall=1 success=yes exit=4 a0=3 a1=7ffe29c03a70 a2=4 a3=0 items=0 ppid=712 pid=867 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=tty7 ses=1 comm="qubes-gui-runus" exe="/usr/bin/qubes-gui-runuser" subj=system_u:system_r:xdm_t:s0-s0:c0.c1023 key=(null) [ 15.815164] audit: type=1327 audit(1735861506.968:85): proctitle=2F7573722F62696E2F71756265732D6775692D72756E757365720075736572002F62696E2F7368002D6C002D630065786563202F7573722F62696E2F78696E6974202F6574632F5831312F78696E69742F78696E69747263202D2D202F7573722F6C69622F71756265732F71756265732D786F72672D77726170706572203A30 [ 15.815420] audit: type=1103 audit(1735861506.969:86): pid=866 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 msg='op=PAM:setcred grantors=pam_rootok acct="user" exe="/usr/lib/qubes/qrexec-agent" hostname=? addr=? terminal=? res=success' [ 15.816039] audit: type=1006 audit(1735861506.969:87): pid=866 uid=0 subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 old-auid=4294967295 auid=1000 tty=(none) old-ses=4294967295 ses=2 res=1 [ 15.817029] audit: type=1300 audit(1735861506.969:87): arch=c000003e syscall=1 success=yes exit=4 a0=3 a1=7ffe550c1c30 a2=4 a3=0 items=0 ppid=864 pid=866 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=2 comm="qrexec-agent" exe="/usr/lib/qubes/qrexec-agent" subj=system_u:system_r:local_login_t:s0-s0:c0.c1023 key=(null) [ 15.817160] audit: type=1327 audit(1735861506.969:87): proctitle="/usr/lib/qubes/qrexec-agent" [ 16.111133] systemd-journald[366]: Time jumped backwards, rotating. th: RFCOMM TTY layer initialized [ 18.286026] Bluetooth: RFCOMM socket layer initialized [ 18.286035] Bluetooth: RFCOMM ver 1.11 [ 18.469074] abrt-dump-journ: Corrupted page table at address 78c64b600010 [ 18.469096] PGD 14980067 P4D 14980067 PUD 14981067 PMD 38c8047 PTE 243c8b48ffffff57 [ 18.469117] Oops: Bad pagetable: 000d [#1] PREEMPT SMP NOPTI [ 18.469132] CPU: 1 UID: 0 PID: 657 Comm: abrt-dump-journ Not tainted 6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1 [ 18.469152] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025 [ 18.469165] RIP: 0033:0x78c64e1bc9a0 [ 18.469177] Code: 86 f5 01 00 00 49 8b 7c 24 38 48 85 ff 0f 84 08 03 00 00 48 8d 0d 40 e6 ff ff ba 18 00 00 00 e8 46 c7 fa ff e9 d1 01 00 00 90 <0f> b6 50 10 38 96 c8 01 00 00 0f 85 63 fd ff ff 80 fa 02 0f 84 4c [ 18.469211] RSP: 002b:00007ffcdc67a8b0 EFLAGS: 00010246 [ 18.469223] RAX: 000078c64b600000 RBX: 00006045c444c890 RCX: 0000000000000048 [ 18.469238] RDX: 0000000000000000 RSI: 00006045c444c890 RDI: 00006045c444f040 [ 18.469253] RBP: 00007ffcdc67a930 R08: 00006045c43a1010 R09: 0000000000000001 [ 18.469268] R10: 00006045c44098b0 R11: 0000000000000246 R12: 00006045c444f040 [ 18.469284] R13: 00006045c4409890 R14: 00006045c444c890 R15: 0000000000000000 [ 18.469299] FS: 000078c64d675400 GS: 0000000000000000 [ 18.469310] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device snd_timer snd soundcore rfcomm bnep btusb btrtl btintel btbcm btmtk bluetooth rfkill nft_reject_ipv6 nf_reject_ipv6 nft_reject_ipv4 nf_reject_ipv4 nft_reject nft_ct nft_masq nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 joydev nf_tables intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 xhci_pci ehci_pci xhci_hcd ehci_hcd pcspkr i2c_piix4 i2c_smbus ata_generic pata_acpi serio_raw xen_scsiback target_core_mod xen_netback xen_privcmd xen_gntdev xen_gntalloc xen_blkback xen_evtchn loop fuse nfnetlink overlay xen_blkfront [ 18.469484] ---[ end trace 0000000000000000 ]--- [ 18.469495] RIP: 0033:0x78c64e1bc9a0 [ 18.469504] RSP: 002b:00007ffcdc67a8b0 EFLAGS: 00010246 [ 18.469516] RAX: 000078c64b600000 RBX: 00006045c444c890 RCX: 0000000000000048 [ 18.469531] RDX: 0000000000000000 RSI: 00006045c444c890 RDI: 00006045c444f040 [ 18.469547] RBP: 00007ffcdc67a930 R08: 00006045c43a1010 R09: 0000000000000001 [ 18.469562] R10: 00006045c44098b0 R11: 0000000000000246 R12: 00006045c444f040 [ 18.469577] R13: 00006045c4409890 R14: 00006045c444c890 R15: 0000000000000000 [ 18.469593] FS: 000078c64d675400(0000) GS:ffff9de397100000(0000) knlGS:0000000000000000 [ 18.469609] CS: 0033 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 18.469623] CR2: 000078c64b600010 CR3: 0000000000164004 CR4: 0000000000770ef0 [ 18.469640] PKRU: 55555554 [ 18.469646] Kernel panic - not syncing: Fatal exception [ 18.469706] Kernel Offset: 0x2ec00000 from 0xffffffff80200000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > > > [ 75.770849] Bluetooth: Core ver 2.22 > > > [ 75.770866] Oops: general protection fault, probably for non-canonical > > > address 0xc9d2315bc82c3bbd: 0000 [#1] PREEMPT SMP NOPTI > > > [ 75.770880] CPU: 0 UID: 0 PID: 923 Comm: (udev-worker) Not tainted > > > 6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1 > > > [ 75.770890] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025 > > > [ 75.770897] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth] > > > [ 75.770924] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 > > > 21 <26> 2b 8b ad 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > > This code is looking suspicious. Large areas of binary 0 in a normal > > function? > > And the code itself is nonsense, as it is using a memory access via ES:, > > which > > doesn't make any sense in 64-bit kernel. > > Could it be still something related to modules layout in memory? > It seems it's not 100% reliable crash, I see in at least one instance > sys-usb remained running (unfortunately I don't have collected full > sys-usb console log from successful test...). > > I just checked again that this crash didn't happen with any 6.12 or 6.11 > kernels. > > -- > Best Regards, > Marek Marczykowski-Górecki > Invisible Things Lab -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab Attachment:
signature.asc
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |