[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Linux 6.13-rc3 many different panics in Xen PV dom0



On Fri, Jan 03, 2025 at 01:18:31AM +0100, Marek Marczykowski-Górecki wrote:
> On Thu, Jan 02, 2025 at 08:39:16PM +0100, Marek Marczykowski-Górecki wrote:
> > On Thu, Jan 02, 2025 at 08:17:00PM +0100, Jürgen Groß wrote:
> > > On 02.01.25 19:54, Marek Marczykowski-Górecki wrote:
> > > > On Thu, Jan 02, 2025 at 01:24:21PM +0100, Marek Marczykowski-Górecki 
> > > > wrote:
> > > > > On Thu, Jan 02, 2025 at 12:30:10PM +0100, Juergen Gross wrote:
> > > > > > On 02.01.25 11:20, Jürgen Groß wrote:
> > > > > > > On 19.12.24 17:14, Marek Marczykowski-Górecki wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > It crashes on boot like below, most of the times. But sometimes 
> > > > > > > > (rarely)
> > > > > > > > it manages to stay alive. Below I'm pasting few of the crashes 
> > > > > > > > that look
> > > > > > > > distinctly different, if you follow the links, you can find 
> > > > > > > > more of
> > > > > > > > them. IMHO it looks like some memory corruption bug somewhere. 
> > > > > > > > I tested
> > > > > > > > also Linux 6.13-rc2 before, and it had very similar issue.
> > > > > > > 
> > > > > > > ...
> > > > > > > 
> > > > > > > > 
> > > > > > > > Full log:
> > > > > > > > https://openqa.qubes-os.org/tests/122879/logfile?filename=serial0.txt
> > > > > > > 
> > > > > > > I can reproduce a crash with 6.13-rc5 PV dom0.
> > > > > > > 
> > > > > > > What is really interesting in the logs: most crashes seem to 
> > > > > > > happen right
> > > > > > > after a module being loaded (in my reproducer it was right after 
> > > > > > > loading
> > > > > > > the first module).
> > > > > > > 
> > > > > > > I need to go through the 6.13 commits, but I think I remember 
> > > > > > > having seen
> > > > > > > a patch optimizing module loading by using large pages for 
> > > > > > > addressing the
> > > > > > > loaded modules. Maybe the case of no large pages being available 
> > > > > > > isn't
> > > > > > > handled properly.
> > > > > > 
> > > > > > Seems I was right.
> > > > > > 
> > > > > > For me the following diff fixes the issue. Marek, can you please 
> > > > > > confirm
> > > > > > it fixes your crashes, too?
> > > > > 
> > > > > Thanks for looking into it!
> > > > > Will do, I've pushed it to
> > > > > https://github.com/QubesOS/qubes-linux-kernel/pull/662, CI will build 
> > > > > it
> > > > > and then I'll post it to openQA.
> > > > 
> > > > It is much better!
> > > > 
> > > > Tests are still running, but I already see that many are green.
> > > 
> > > So are you fine with me adding your "Tested-by:"?
> > 
> > Yes.
> > 
> > > > There is
> > > > one issue (likely unrelated to this change) - sys-usb (HVM domU with USB
> > > > controllers passed through) crashes on a system with Raptor Lake CPU
> > > > (only, others, including ADL and MTL look fine):
> 
> Correction, it does happen on some others too, just got the crash on the ADL
> system, although looks a bit different ("Corrupted page table at ..."):

I've collected some more of them at 
https://github.com/QubesOS/qubes-issues/issues/9681

Should I start new thread for this? On one hand, it's a different domain
type (HVM), but on the other hand, many of the crashes are around
loading modules too.

> > > > [   75.770849] Bluetooth: Core ver 2.22
> > > > [   75.770866] Oops: general protection fault, probably for 
> > > > non-canonical address 0xc9d2315bc82c3bbd: 0000 [#1] PREEMPT SMP NOPTI
> > > > [   75.770880] CPU: 0 UID: 0 PID: 923 Comm: (udev-worker) Not tainted 
> > > > 6.13.0-0.rc5.2.qubes.1.fc41.x86_64 #1
> > > > [   75.770890] Hardware name: Xen HVM domU, BIOS 4.19.0 01/02/2025
> > > > [   75.770897] RIP: 0010:msft_monitor_device_del+0x93/0x170 [bluetooth]
> > > > [   75.770924] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> > > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0 65 
> > > > 21 <26> 2b 8b ad 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 
> > > This code is looking suspicious. Large areas of binary 0 in a normal 
> > > function?
> > > And the code itself is nonsense, as it is using a memory access via ES:, 
> > > which
> > > doesn't make any sense in 64-bit kernel.
> > 
> > Could it be still something related to modules layout in memory?
> > It seems it's not 100% reliable crash, I see in at least one instance
> > sys-usb remained running (unfortunately I don't have collected full
> > sys-usb console log from successful test...).
> > 
> > I just checked again that this crash didn't happen with any 6.12 or 6.11
> > kernels.
> > 
> > -- 
> > Best Regards,
> > Marek Marczykowski-Górecki
> > Invisible Things Lab
> 
> 
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab



-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.