[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH v5] RFC: x86/pvh: Make Xen PVH entrypoint PIC for x86-64


  • To: <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Jason Andryuk <jason.andryuk@xxxxxxx>
  • Date: Tue, 26 Mar 2024 17:47:01 -0400
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=lists.xenproject.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0)
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=WrdRcWgadoJZx7idcbsBD1hlDQIwTc10NcY4KAVoW78=; b=a8Rb7DNm7rIuQ1JUh1cnCkzxBgtNfcO6wPKGyUwoVt54c0WYALL5wvKPstUngPG7bzcR5xEF8rjW0DD/87C6NbQCiyC8ER4uqbIX96O0e7QvGxOobZu2RQ1a8jQdwvtUzBr8dUHbyQvCLlkR7pTMVNVBNObxmgIDIHXal/XthZfrB376zZevI8IAJ2sxFKBrHIuot9rtPWKbwVWZoAHSs60ZL0FJq6kEgDc/APMG+UqXS4DGlFOuJGd7a41qz8jv1cBtV7kIhrl2hDxFsc4nrdPgbIgYeAJlPuvjsdUSZDrvQX79jTmwEiBWuZnbyRtdze4eSC+WLfsGslRSUpIFpA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=EIGkrGhWmpvynLcicn4Lxei9AbDz49sPDP7jLgZ3ju0iISR41C4GL531YkXNRJ8dkeGUPonBLbl7B1flA8MkZfmqAEZqEwu6N0ExtRk2xA8IPgpVQvw820rPxh2VmxwmQSMPj7Pf+rRDXlpVtMZN0uK7cp9/HC6On/1YOtoy4mpKWGIadLHI7+HTYY3Hs34eQxrqOOmuM1FYtW/gIE0whSlB0iQZAJ9YLv1zE1c1UjttrkOrOtc8jihm/ybqfIENLzcuB8WKWofcipMtChm4swxAd/4iSyXH0+Y1DRLKzr1MJxrfcnbdy76qO9+xdWdZ3bkHAMAG4pzIZptLyS8QXw==
  • Cc: Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, "Juergen Gross" <jgross@xxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Jason Andryuk <jason.andryuk@xxxxxxx>
  • Delivery-date: Tue, 26 Mar 2024 21:47:19 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

The Xen PVH entrypoint is 32bit non-PIC code running at a default load
address of 0x1000000 (16MB) (CONFIG_PHYSICAL_START).  Xen loads the
kernel at that physical address inside the PVH container.

When running a PVH Dom0, the system reserved addresses are mapped 1-1
into the PVH container.  There exist system firmwares (Coreboot/EDK2)
with reserved memory at 16MB.  This creates a conflict where the PVH
kernel cannot be loaded at that address.

Modify the PVH entrypoint to be position-indepedent to allow flexibility
in load address.  Only the 64bit entry path is converted.  A 32bit
kernel is not PIC, so calling into other parts of the kernel, like
xen_prepare_pvh() and mk_pgtable_32(), don't work properly when
relocated.

Initial PVH entry runs at the physical addresses and then transitions to
the identity mapped address.  While executing xen_prepare_pvh() calls
through pv_ops function pointers transition to the high mapped
addresses.  Additionally, __va() is called on some hvm_start_info
physical addresses, we need the directmap address range is used.  So we
need to run page tables with all of those ranges mapped.

Modifying init_top_pgt tables ran into issue since
startup_64/__startup_64() will modify those page tables again.  Use a
dedicated set of page tables - pvh_init_top_pgt  - for the PVH entry to
avoid unwanted interactions.

In xen_pvh_init(), __pa() is called to find the physical address of the
hypercall page.  Set phys_base temporarily before calling into
xen_prepare_pvh(), which calls xen_pvh_init(), and clear it afterwards.
__startup_64() assumes phys_base is zero and adds load_delta to it.  If
phys_base is already set, the calculation results in an incorrect cr3.

TODO: Sync elfnote.h from xen.git commit xxxxxxxxxx when it is
commited.

Signed-off-by: Jason Andryuk <jason.andryuk@xxxxxxx>
---
Put this out as an example for the Xen modifications

Instead of setting and clearing phys_base, add a dedicated variable?
Clearing phys_base is a little weird, but it leaves the kernel more
consistent when running non-entry code.

Make __startup_64() exit if phys_base is already set to allow calling
multiple times, and use that and init_top_pgt instead of adding
additional page tables?  That won't work.  __startup_64 is 64bit code,
and pvh needs to create page tables in 32bit code before it can
transition to 64bit long mode.  Hence it can't be re-used to relocate
page tables.
---
 arch/x86/platform/pvh/head.S    | 184 +++++++++++++++++++++++++++++---
 include/xen/interface/elfnote.h |  18 +++-
 2 files changed, 189 insertions(+), 13 deletions(-)

diff --git a/arch/x86/platform/pvh/head.S b/arch/x86/platform/pvh/head.S
index f7235ef87bc3..13cfd4a35462 100644
--- a/arch/x86/platform/pvh/head.S
+++ b/arch/x86/platform/pvh/head.S
@@ -50,11 +50,32 @@
 #define PVH_CS_SEL             (PVH_GDT_ENTRY_CS * 8)
 #define PVH_DS_SEL             (PVH_GDT_ENTRY_DS * 8)
 
+#define rva(x) ((x) - pvh_start_xen)
+
 SYM_CODE_START_LOCAL(pvh_start_xen)
        UNWIND_HINT_END_OF_STACK
        cld
 
-       lgdt (_pa(gdt))
+       /*
+        * See the comment for startup_32 for more details.  We need to
+        * execute a call to get the execution address to be position
+        * independent, but we don't have a stack.  Save and restore the
+        * magic field of start_info in ebx, and use that as the stack.
+        */
+       mov     (%ebx), %eax
+       leal    4(%ebx), %esp
+       ANNOTATE_INTRA_FUNCTION_CALL
+       call    1f
+1:     popl    %ebp
+       mov     %eax, (%ebx)
+       subl    $ rva(1b), %ebp
+       movl    $0, %esp
+
+       leal    rva(gdt)(%ebp), %eax
+       movl    %eax, %ecx
+       leal    rva(gdt_start)(%ebp), %ecx
+       movl    %ecx, 2(%eax)
+       lgdt    (%eax)
 
        mov $PVH_DS_SEL,%eax
        mov %eax,%ds
@@ -62,14 +83,14 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
        mov %eax,%ss
 
        /* Stash hvm_start_info. */
-       mov $_pa(pvh_start_info), %edi
+       leal rva(pvh_start_info)(%ebp), %edi
        mov %ebx, %esi
-       mov _pa(pvh_start_info_sz), %ecx
+       movl rva(pvh_start_info_sz)(%ebp), %ecx
        shr $2,%ecx
        rep
        movsl
 
-       mov $_pa(early_stack_end), %esp
+       leal rva(early_stack_end)(%ebp), %esp
 
        /* Enable PAE mode. */
        mov %cr4, %eax
@@ -83,29 +104,83 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
        btsl $_EFER_LME, %eax
        wrmsr
 
+       mov %ebp, %ebx
+       subl $LOAD_PHYSICAL_ADDR, %ebx /* offset */
+       jz .Lpagetable_done
+
+       /* Fixup page-tables for relocation. */
+       leal rva(pvh_init_top_pgt)(%ebp), %edi
+       movl $512, %ecx
+2:
+       movl 0x00(%edi), %eax
+       addl 0x04(%edi), %eax
+       jz 1f
+       addl %ebx, 0x00(%edi)
+1:
+       addl $8, %edi
+       decl %ecx
+       jnz 2b
+
+       /* L3 ident has a single entry. */
+       leal rva(pvh_level3_ident_pgt)(%ebp), %edi
+       addl %ebx, 0x00(%edi)
+
+       leal rva(pvh_level3_kernel_pgt)(%ebp), %edi
+       addl %ebx, (4096 - 16)(%edi)
+       addl %ebx, (4096 - 8)(%edi)
+
+       /* pvh_level2_ident_pgt is fine - large pages */
+
+       /* pvh_level2_kernel_pgt needs adjustment - large pages */
+       leal rva(pvh_level2_kernel_pgt)(%ebp), %edi
+       movl $512, %ecx
+2:
+       movl 0x00(%edi), %eax
+       addl 0x04(%edi), %eax
+       jz 1f
+       addl %ebx, 0x00(%edi)
+1:
+       addl $8, %edi
+       decl %ecx
+       jnz 2b
+
+.Lpagetable_done:
        /* Enable pre-constructed page tables. */
-       mov $_pa(init_top_pgt), %eax
+       leal rva(pvh_init_top_pgt)(%ebp), %eax
        mov %eax, %cr3
        mov $(X86_CR0_PG | X86_CR0_PE), %eax
        mov %eax, %cr0
 
        /* Jump to 64-bit mode. */
-       ljmp $PVH_CS_SEL, $_pa(1f)
+       pushl $PVH_CS_SEL
+       leal  rva(1f)(%ebp), %eax
+       pushl %eax
+       lretl
 
        /* 64-bit entry point. */
        .code64
 1:
        /* Set base address in stack canary descriptor. */
        mov $MSR_GS_BASE,%ecx
-       mov $_pa(canary), %eax
+       leal rva(canary)(%ebp), %eax
        xor %edx, %edx
        wrmsr
 
+       /* Calculate load offset from LOAD_PHYSICAL_ADDR and store in
+        * phys_base.  __pa() needs phys_base set to calculate the the
+        * hypercall page in xen_pvh_init(). */
+       movq %rbp, %rbx
+       subq $LOAD_PHYSICAL_ADDR, %rbx
+       movq %rbx, phys_base(%rip)
        call xen_prepare_pvh
+       /* Clear phys_base.  startup_64/__startup_64 will *add* to its value,
+          so start from 0. */
+       xor  %rbx, %rbx
+       movq %rbx, phys_base(%rip)
 
        /* startup_64 expects boot_params in %rsi. */
-       mov $_pa(pvh_bootparams), %rsi
-       mov $_pa(startup_64), %rax
+       lea rva(pvh_bootparams)(%ebp), %rsi
+       lea rva(startup_64)(%ebp), %rax
        ANNOTATE_RETPOLINE_SAFE
        jmp *%rax
 
@@ -137,13 +212,14 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
 
        ljmp $PVH_CS_SEL, $_pa(startup_32)
 #endif
+
 SYM_CODE_END(pvh_start_xen)
 
        .section ".init.data","aw"
        .balign 8
 SYM_DATA_START_LOCAL(gdt)
-       .word gdt_end - gdt_start
-       .long _pa(gdt_start)
+       .word gdt_end - gdt_start - 1
+       .long _pa(gdt_start) /* x86-64 will overwrite if relocated. */
        .word 0
 SYM_DATA_END(gdt)
 SYM_DATA_START_LOCAL(gdt_start)
@@ -163,5 +239,89 @@ SYM_DATA_START_LOCAL(early_stack)
        .fill BOOT_STACK_SIZE, 1, 0
 SYM_DATA_END_LABEL(early_stack, SYM_L_LOCAL, early_stack_end)
 
+#ifdef CONFIG_X86_64
+/*
+ * We are not able to switch in one step to the final KERNEL ADDRESS SPACE
+ * because we need identity-mapped pages.
+ */
+#define l4_index(x)     (((x) >> 39) & 511)
+#define pud_index(x)    (((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
+
+L4_PAGE_OFFSET  = l4_index(__PAGE_OFFSET_BASE_L4)
+L4_START_KERNEL = l4_index(__START_KERNEL_map)
+L3_START_KERNEL = pud_index(__START_KERNEL_map)
+
+#define SYM_DATA_START_PAGE_ALIGNED(name)                      \
+       SYM_START(name, SYM_L_GLOBAL, .balign PAGE_SIZE)
+
+/* Automate the creation of 1 to 1 mapping pmd entries */
+#define PMDS(START, PERM, COUNT)                       \
+       i = 0 ;                                         \
+       .rept (COUNT) ;                                 \
+       .quad   (START) + (i << PMD_SHIFT) + (PERM) ;   \
+       i = i + 1 ;                                     \
+       .endr
+
+/*
+ * Xen PVH needs a set of identity mapped and kernel high mapping
+ * page tables.  pvh_start_xen starts running on the identity mapped
+ * page tables, but xen_prepare_pvh calls into the high mapping.
+ * These page tables need to be relocatable and are only used until
+ * startup_64 transitions to init_top_pgt.
+ */
+SYM_DATA_START_PAGE_ALIGNED(pvh_init_top_pgt)
+       .quad   pvh_level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
+       .org    pvh_init_top_pgt + L4_PAGE_OFFSET*8, 0
+       .quad   pvh_level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
+       .org    pvh_init_top_pgt + L4_START_KERNEL*8, 0
+       /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
+       .quad   pvh_level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
+SYM_DATA_END(pvh_init_top_pgt)
+
+SYM_DATA_START_PAGE_ALIGNED(pvh_level3_ident_pgt)
+       .quad   pvh_level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
+       .fill   511, 8, 0
+SYM_DATA_END(pvh_level3_ident_pgt)
+SYM_DATA_START_PAGE_ALIGNED(pvh_level2_ident_pgt)
+       /*
+        * Since I easily can, map the first 1G.
+        * Don't set NX because code runs from these pages.
+        *
+        * Note: This sets _PAGE_GLOBAL despite whether
+        * the CPU supports it or it is enabled.  But,
+        * the CPU should ignore the bit.
+        */
+       PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
+SYM_DATA_END(pvh_level2_ident_pgt)
+SYM_DATA_START_PAGE_ALIGNED(pvh_level3_kernel_pgt)
+       .fill   L3_START_KERNEL,8,0
+       /* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
+       .quad   pvh_level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
+       .quad   0 /* no fixmap */
+SYM_DATA_END(pvh_level3_kernel_pgt)
+
+SYM_DATA_START_PAGE_ALIGNED(pvh_level2_kernel_pgt)
+       /*
+        * Kernel high mapping.
+        *
+        * The kernel code+data+bss must be located below KERNEL_IMAGE_SIZE in
+        * virtual address space, which is 1 GiB if RANDOMIZE_BASE is enabled,
+        * 512 MiB otherwise.
+        *
+        * (NOTE: after that starts the module area, see MODULES_VADDR.)
+        *
+        * This table is eventually used by the kernel during normal runtime.
+        * Care must be taken to clear out undesired bits later, like _PAGE_RW
+        * or _PAGE_GLOBAL in some cases.
+        */
+       PMDS(0, __PAGE_KERNEL_LARGE_EXEC, KERNEL_IMAGE_SIZE/PMD_SIZE)
+SYM_DATA_END(pvh_level2_kernel_pgt)
+
+       ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_RELOC,
+                    .long CONFIG_PHYSICAL_ALIGN;
+                    .long LOAD_PHYSICAL_ADDR;
+                    .long KERNEL_IMAGE_SIZE - 1)
+#endif
+
        ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY,
-                    _ASM_PTR (pvh_start_xen - __START_KERNEL_map))
+                    .long (pvh_start_xen - __START_KERNEL_map))
diff --git a/include/xen/interface/elfnote.h b/include/xen/interface/elfnote.h
index 38deb1214613..9ebd4e79bb41 100644
--- a/include/xen/interface/elfnote.h
+++ b/include/xen/interface/elfnote.h
@@ -185,9 +185,25 @@
  */
 #define XEN_ELFNOTE_PHYS32_ENTRY 18
 
+/*
+ * Physical loading constraints for PVH kernels
+ *
+ * Used to place constraints on the guest physical loading addresses and
+ * alignment for a PVH kernel.
+ *
+ * The presence of this note indicates the kernel supports relocating itself.
+ *
+ * The note may include up to three 32bit values in the following order:
+ *  - a maximum address for the entire image to be loaded below (default
+ *      0xffffffff)
+ *  - a minimum address for the start of the image (default 0)
+ *  - a required start alignment (default 0x200000)
+ */
+#define XEN_ELFNOTE_PHYS32_RELOC 19
+
 /*
  * The number of the highest elfnote defined.
  */
-#define XEN_ELFNOTE_MAX XEN_ELFNOTE_PHYS32_ENTRY
+#define XEN_ELFNOTE_MAX XEN_ELFNOTE_PHYS32_RELOC
 
 #endif /* __XEN_PUBLIC_ELFNOTE_H__ */
-- 
2.44.0




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.