[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: E820 memory allocation issue on Threadripper platforms
On 17.01.2024 07:12, Patrick Plenefisch wrote: > On Tue, Jan 16, 2024 at 4:33 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: > >> On 16.01.2024 01:22, Patrick Plenefisch wrote: >>> I managed to set up serial access and saved the output with the requested >>> flags as the attached logs >> >> Thanks. While you didn't ... >> >> >> ... fiddle with the Linux message, ... >> > > I last built the kernel over a decade ago, and so was hoping to not have to > look up how to do that again, but I can research how to go about that again > if it would help? > > >> >> ... as per >> >> (XEN) Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x4a00000 >> >> there's an overlap with not exactly a hole, but with an >> EfiACPIMemoryNVS region: >> >> (XEN) 0000000100000-0000003159fff type=2 attr=000000000000000f >> (XEN) 000000315a000-0000003ffffff type=7 attr=000000000000000f >> (XEN) 0000004000000-0000004045fff type=10 attr=000000000000000f >> (XEN) 0000004046000-0000009afefff type=7 attr=000000000000000f >> >> (the 3rd of the 4 lines). Considering there's another region higher >> up: >> >> (XEN) 00000a747f000-00000a947efff type=10 attr=000000000000000f >> >> I'm inclined to say it is poor firmware (or, far less likely, boot >> loader) behavior to clobber a rather low and entirely arbitrary RAM >> > > Bootloader is Grub 2.06 EFI platform as packaged by Debian 12 > > > >> range, rather than consolidating all such regions near the top of >> RAM below 4Gb. There are further such odd regions, btw: >> >> (XEN) 0000009aff000-0000009ffffff type=0 attr=000000000000000f >> ... >> (XEN) 000000b000000-000000b020fff type=0 attr=000000000000000f >> >> If the kernel image was sufficiently much larger, these could become >> a problem as well. Otoh if the kernel wasn't built with >> CONFIG_PHYSICAL_START=0x1000000, i.e. to start at 16Mb, but at, say, >> 2Mb, things should apparently work even with this unusual memory >> layout (until the kernel would grow enough to again run into that >> very region). > > I'm currently talking to the vendor's support team and testing a beta BIOS > for unrelated reasons, is there something specific I should forward to > them, either as a question or as a request for a fix? Well, first it would need figuring whether the "interesting" regions are being put in place by firmware of the boot loader. If it's firmware (pretty likely at least for the region you're having trouble with), you may want to ask them to re-do where they place that specific data. > As someone who hasn't built a kernel in over a decade, should I figure out > how to do a kernel build with CONFIG_PHYSICAL_START=0x2000000 and report > back? That was largely a suggestion to perhaps allow you to gain some workable setup. It would be of interest to us largely for completeness. >> It remains to be seen in how far it is reasonably possible to work >> around this in the kernel. While (sadly) still unsupported, in the >> meantime you may want to consider running Dom0 in PVH mode. >> > > I tried this by adding dom0=pvh, and instead got this boot error: > > (XEN) xenoprof: Initialization failed. AMD processor family 25 is not > supported > (XEN) NX (Execute Disable) protection active > (XEN) Dom0 has maximum 1400 PIRQs > (XEN) *** Building a PVH Dom0 *** > (XEN) Failed to load kernel: -1 > (XEN) Xen dom0 kernel broken ELF: <NULL> > (XEN) Failed to load Dom0 kernel > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 0: > (XEN) Could not construct domain 0 > (XEN) **************************************** > (XEN) > (XEN) Reboot in five seconds... Hmm, that's sad. The more that the error messages aren't really informative. You did check though that your kernel is PVH-capable? (With a debug build of Xen, and with suitably high logging level, various of the ELF properties would be logged. Such output may or may not give further hints towards what's actually wrong. Albeit you using 4.17 this would further require you to pull in commit ea3dabfb80d7 ["x86/PVH: allow Dom0 ELF parsing to be verbose"].) But wait - aren't you running into the same collision there with that memory region? I think that explains the unhelpful output. Whereas I assume the native kernel can deal with that as long as it's built with CONFIG_RELOCATABLE=y. I don't think we want to get into the business of interpreting the kernel's internal representation of the relocations needed, so it's not really clear to me what we might do in such a case. Perhaps the only way is to signal to the kernel that it needs to apply relocations itself (which in turn would require the kernel to signal to us that it's capable of doing so). Cc-ing Roger in case he has any neat idea. Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |