[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC 2/4] HVM x86 deprivileged mode: Create deprivileged page tables





On 06/08/15 20:52, Andrew Cooper wrote:
On 06/08/15 17:45, Ben Catterall wrote:
The paging structure mappings for the deprivileged mode are added
to the monitor page table for HVM guests. The entries are generated by
walking the page tables and mapping in new pages. If a higher-level page table
mapping exists, we attempt to traverse all the way down to a leaf page and add
the page there. Access bits are flipped as needed.

The page entries are generated for deprivileged .text, .data and a stack. The
data is copied from sections allocated by the linker. The mappings are setup in
an unused portion of the Xen virtual address space. The pages are mapped in as
user mode accessible, with NX bits set for the data and stack regions and the
code region is set to be executable and read-only.

The needed pages are allocated on the HAP page heap and are deallocated when
those heap pages are deallocated (on domain destruction).

Signed-off-by: Ben Catterall <Ben.Catterall@xxxxxxxxxx>
---
  xen/arch/x86/hvm/Makefile          |   1 +
  xen/arch/x86/hvm/deprivileged.c    | 441 +++++++++++++++++++++++++++++++++++++
  xen/arch/x86/mm/hap/hap.c          |  10 +-
  xen/arch/x86/xen.lds.S             |  19 ++
  xen/include/asm-x86/config.h       |  10 +-
  xen/include/xen/hvm/deprivileged.h |  94 ++++++++
  6 files changed, 573 insertions(+), 2 deletions(-)
  create mode 100644 xen/arch/x86/hvm/deprivileged.c
  create mode 100644 xen/include/xen/hvm/deprivileged.h

diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index 794e793..bd83ba3 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -16,6 +16,7 @@ obj-y += pmtimer.o
  obj-y += quirks.o
  obj-y += rtc.o
  obj-y += save.o
+obj-y += deprivileged.o

We attempt to keep objects linked in alphabetical order, where possible.

apologies
  obj-y += stdvga.o
  obj-y += vioapic.o
  obj-y += viridian.o
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
new file mode 100644
index 0000000..071d900
--- /dev/null
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -0,0 +1,441 @@
+/*
+ * HVM deprivileged mode to provide support for running operations in
+ * user mode from Xen
+ */
+#include <xen/lib.h>
+#include <xen/mm.h>
+#include <xen/domain_page.h>
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/sched.h>
+#include <asm/paging.h>
+#include <xen/compiler.h>
+#include <asm/hap.h>
+#include <asm/paging.h>
+#include <asm-x86/page.h>
+#include <public/domctl.h>
+#include <xen/domain_page.h>
+#include <asm/hvm/vmx/vmx.h>
+#include <xen/hvm/deprivileged.h>
+
+void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4e)
+{
+    int ret;
+    void *p;
+    unsigned long int size;
+
+    /* Copy the .text segment for deprivileged code */

Why do you need to copy the deprivileged text section at all?  It is
read only and completely common.  All you should need to do is make
userspace mappings to it where it resides in the Xen linked area.

tbh: At the time, I was unsure if it could open a hole if, when Xen switched over to superpages, and the deprivileged code was not aligned on a 4KB page boundary, then mapping it in might prove problematic. Mainly as we'd map in some of Xen's ring 0 code by accident, which could prove useful to an attacker as they could then call this from userspace.

Though, now I've thought about it more, I can ensure in the linker that it's aligned properly and I think map_pages_to_xen() will do what I need.

i.e. this is an operation which should be performed once at startup, not
once per domain.  You will also reduce your memory overhead by doing this.

got it.
+    size = (unsigned long int)&__hvm_deprivileged_text_end -

Don't bother qualifying unsigned long with a further "int".  A plain
"unsigned long" is the expected declaration.
understood.

+           (unsigned long int)&__hvm_deprivileged_text_start;
+
+    ret = hvm_deprivileged_copy_pages(d, l4e,
+                              (unsigned long 
int)&__hvm_deprivileged_text_start,
+                              (unsigned long int)HVM_DEPRIVILEGED_TEXT_ADDR,
+                              size, 0 /* No write */);
+
+    if( ret )
+    {
+        printk("HVM: Error when initialising depriv .text. Code: %d", ret);
+        domain_crash(d);
+        return;
+    }
+
+    /* Copy the .data segment for ring3 code */
+    size = (unsigned long int)&__hvm_deprivileged_data_end -
+           (unsigned long int)&__hvm_deprivileged_data_start;
+
+    ret = hvm_deprivileged_copy_pages(d, l4e,
+                              (unsigned long 
int)&__hvm_deprivileged_data_start,
+                              (unsigned long int)HVM_DEPRIVILEGED_DATA_ADDR,
+                              size, _PAGE_NX | _PAGE_RW);

What do you expect to end up in a data section like this?

It's for passing in configuration data at the start when we run the device (reduce the number of system calls needed) and for local data which the emulated devices may need. It can also be used to put the results of any system calls that the deprivileged code may call.
+
+    if( ret )
+    {
+        printk("HVM: Error when initialising depriv .data. Code: %d", ret);
+        domain_crash(d);
+        return;
+    }
+
+    /* Setup the stack mappings.

Xen recommended style for multiline comments is

/*
  * Setup ...

apologies. I'll revise all of these.
+     * This is a bit of a hack: by allocating a blank area
+     * we can reuse the hvm_deprivileged_copy_pages code.
+     * The stacks are actually allocated in the per-vcpu initialisation
+     * routine.
+     */
+    size = STACK_SIZE;
+
+    p = alloc_xenheap_pages(STACK_ORDER, 0);
+    ret = hvm_deprivileged_copy_pages(d, l4e,
+                              (unsigned long int)p,
+                              (unsigned long int)HVM_DEPRIVILEGED_STACK_ADDR,
+                              size, _PAGE_NX | _PAGE_RW);
+
+    free_xenheap_pages(p, STACK_ORDER);

The userspace stacks really don't need to be this large.  Xens primary
stack, which runs all of this code normally, is limited at 2 pages,
which is 1/4 of STACK_SIZE.

got it. will do.
+
+    if( ret )
+    {
+        printk("HVM: Error when initialising depriv stack. Code: %d", ret);
+        domain_crash(d);
+        return;
+    }
+}
+
+void hvm_deprivileged_destroy(struct domain *d)
+{
+    struct page_info *pg = NULL;
+
+    /* Free up all the pages we allocated from the HAP HVM list */
+    while( (pg = page_list_remove_head(&d->arch.paging.hap.hvmlist)) )
+    {
+        d->arch.paging.hap.free_pages++;
+        page_list_add_tail(pg, &d->arch.paging.hap.freelist);
+    }

See c/s 0174da5b7 for why loops like this are bad in general.  However,
if you use the shadow pool, this should be able to disappear.

I'll switch pools as you suggested.
+}
+
+/* Create a copy of the data at the specified virtual address.
+ * When writing to the source, it will walk the page table hierarchy, creating
+ * new levels as needed.
+ *
+ * If we find a leaf entry in a page table (one which holds the
+ * mfn of a 4KB, 2MB, etc. page frame) which has already been
+ * mapped in, then we bail as we have a collision and this likely
+ * means a bug or the memory configuration has been changed.
+ *
+ * Pages have PAGE_USER, PAGE_GLOBAL (if supported) and PAGE_PRESENT set by
+ * default. The extra l1_flags are used for extra control e.g. PAGE_RW.
+ * The PAGE_RW flag will be enabled for all page structure entries
+ * above the leaf page if that leaf page has PAGE_RW set. This is needed to
+ * permit the writes on the leaf pages. See the Intel manual 3A section 4.6.
+ *
+ * TODO: We proceed down to L1 4KB pages and then map these in. We should
+ * stop the recursion on L3/L2 for a 1GB or 2MB page which would mean faster
+ * page access. When we stop would depend on size (e.g. use 2MB pages for a
+ * few MBs)
+ */
+int hvm_deprivileged_copy_pages(struct domain *d,
+                            l4_pgentry_t *l4t_base,
+                            unsigned long int src_start,
+                            unsigned long int dst_start,
+                            unsigned long int size,
+                            unsigned int l1_flags)

<large snip>

Does map_pages_to_xen() not do everything you need here?  It certainly
should be fine for the .text section, and will keep the idle and monitor
tables up to date for you.

As it updates the idle tables I haven't used it. From my understanding: those idle tables are also used by the PV guests when the guest is created as the idle table is copied from FIRST_XEN_SLOT up to the end of the table. In config.h, the region where I'm mapping in my new mode has been allocated for PV guest-defined usage. I can edit init_guest_l4_table so that it doesn't copy this last slot across. I don't know if this would have further repercussions.

What would you suggest? Thanks!

Either way, it would be useful to start early on with a description of
what the new memory layout is expected to be, and which bits are global,
which are per pcpu and which are per vcpu.

Ok, I'll add this to the comments. Briefly:

The .text mappings are global for all HVM vcpus.
The data mappings need to be per vcpu as each vcpu depriv mode has its own local data and configuration.
The stack mappings need to be per vcpu for the same reason.

These are mapped into 0xffffff8000000000 which is currently unused for HVM guests.

I'll revise the global page bits as, after further consideration (and our discussion clarifying them), they are no longer needed.


diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index 593cba7..abc5113 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -40,7 +40,8 @@
  #include <asm/domain.h>
  #include <xen/numa.h>
  #include <asm/hvm/nestedhvm.h>
-
+#include <xen/hvm/deprivileged.h>
+#include <asm/hvm/vmx/vmx.h>

This looks like a spurious addition.

apologies.
  #include "private.h"

  /* Override macros from asm/page.h to make them work with mfn_t */
@@ -401,6 +402,9 @@ static void hap_install_xen_entries_in_l4(struct vcpu *v, 
mfn_t l4mfn)
             &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
             ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));

+    /* Initialise the HVM deprivileged mode feature */
+    hvm_deprivileged_init(d, l4e);
+
      /* Install the per-domain mappings for this domain */
      l4e[l4_table_offset(PERDOMAIN_VIRT_START)] =
          l4e_from_pfn(mfn_x(page_to_mfn(d->arch.perdomain_l3_pg)),
@@ -439,6 +443,10 @@ static void hap_destroy_monitor_table(struct vcpu* v, 
mfn_t mmfn)

      /* Put the memory back in the pool */
      hap_free(d, mmfn);
+
+    /* Destroy the HVM tables */
+    ASSERT(paging_locked_by_me(d));
+    hvm_deprivileged_destroy(d);
  }

  /************************************************/
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 6553cff..0bfe0cf 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -50,6 +50,25 @@ SECTIONS
         _etext = .;             /* End of text section */
    } :text = 0x9090

+  /* HVM deprivileged mode segments
+   * Used to map the ring3 static data and .text
+   */
+
+  . = ALIGN(PAGE_SIZE);
+  .hvm_deprivileged_text : {
+      __hvm_deprivileged_text_start = . ;
+      *(.hvm_deprivileged_enhancement.text)
+      __hvm_deprivileged_text_end = . ;
+  } : text
+
+  . = ALIGN(PAGE_SIZE);
+  .hvm_deprivileged_data : {
+      __hvm_deprivileged_data_start = . ;
+      *(.hvm_deprivileged_enhancement.data)
+      __hvm_deprivileged_data_end = . ;
+  } : text
+
+  . = ALIGN(PAGE_SIZE);
    .rodata : {
         /* Bug frames table */
         . = ALIGN(4);
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index 3e9be83..302399d 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -186,7 +186,7 @@ extern unsigned char boot_edid_info[128];
   *  0xffff880000000000 - 0xffffff7fffffffff [119.5TB,           PML4:272-510]
   *    HVM/idle: continuation of 1:1 mapping
   *  0xffffff8000000000 - 0xffffffffffffffff [512GB, 2^39 bytes  PML4:511]
- *    HVM/idle: unused
+ *    HVM: HVM deprivileged mode
   *
   * Compatibility guest area layout:
   *  0x0000000000000000 - 0x00000000f57fffff [3928MB,            PML4:0]
@@ -280,6 +280,14 @@ extern unsigned char boot_edid_info[128];
  #endif
  #define DIRECTMAP_VIRT_END      (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE)

+/* Slot 511: HVM deprivileged mode
+ * The virtual addresses where the .text, .data and stack should be
+ * placed.
+ */
+#define HVM_DEPRIVILEGED_TEXT_ADDR  (PML4_ADDR(511))
+#define HVM_DEPRIVILEGED_DATA_ADDR  (PML4_ADDR(511) + 0xa000000)
+#define HVM_DEPRIVILEGED_STACK_ADDR (PML4_ADDR(511) + 0xc000000)
+
  #ifndef __ASSEMBLY__

  /* This is not a fixed value, just a lower limit. */
diff --git a/xen/include/xen/hvm/deprivileged.h 
b/xen/include/xen/hvm/deprivileged.h
new file mode 100644
index 0000000..6cc803e
--- /dev/null
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -0,0 +1,94 @@
+#ifndef __X86_HVM_DEPRIVILEGED
+#define __X86_HVM_DEPRIVILEGED
+
+/* This is also included in the HVM deprivileged mode .S file */
+#ifndef __ASSEMBLY__
+
+#include <asm/page.h>
+#include <xen/lib.h>
+#include <xen/mm.h>
+#include <xen/domain_page.h>
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/sched.h>
+#include <asm/paging.h>
+#include <asm/hap.h>
+#include <asm/paging.h>
+#include <asm-x86/page.h>
+#include <public/domctl.h>
+#include <xen/domain_page.h>
+
+/* Initialise the HVM deprivileged mode. This just sets up the general
+ * page mappings for .text and .data. It does not prepare each HVM vcpu's data
+ * or stack which needs to be done separately using
+ * hvm_deprivileged_prepare_vcpu.
+ */
+void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4e);
+
+/* Free up the data used by the HVM deprivileged enhancements.
+ * This frees general page mappings. It does not destroy the per-vcpu
+ * data so hvm_deprivileged_destroy_vcpu also needs to be called for each vcpu.
+ * This method should be called after those per-vcpu destruction routines.
+ */
+void hvm_deprivileged_destroy(struct domain *d);
+
+/* Use create a copy of the data at the specified virtual address.
+ * When writing to the source, it will walk the page table hierarchy, creating
+ * new levels as needed, and then copy the data across.
+ */
+int hvm_deprivileged_copy_pages(struct domain *d,
+                                l4_pgentry_t *l4e_base,
+                                unsigned long int src_start,
+                                unsigned long int dst_start,
+                                unsigned long int size,
+                                unsigned int l1_flags);
+
+int hvm_deprivileged_copy_l3(struct domain *d,
+                             l3_pgentry_t *l1t_base,
+                             unsigned long int src_start,
+                             unsigned long int dst_start,
+                             unsigned long int size,
+                             unsigned int l1_flags);
+
+int hvm_deprivileged_copy_l2(struct domain *d,
+                             l2_pgentry_t *l1t_base,
+                             unsigned long int src_start,
+                             unsigned long int dst_start,
+                             unsigned long int size,
+                             unsigned int l1_flags);
+
+/* The left case of the copy. Will allocate the pages and actually copy the
+ * data.
+ */
+int hvm_deprivileged_copy_l1(struct domain *d,
+                             l1_pgentry_t *l1t_base,
+                             unsigned long int src_start,
+                             unsigned long int dst_start,
+                             unsigned long int size,
+                             unsigned int l1_flags);
+
+
+/* The segments where the user mode .text and .data are stored */
+extern unsigned long int __hvm_deprivileged_text_start;
+extern unsigned long int __hvm_deprivileged_text_end;
+extern unsigned long int __hvm_deprivileged_data_start;
+extern unsigned long int __hvm_deprivileged_data_end;

extern unsigned long __hvm_$foo[];

Will be a more useful type to work with, and harder to accidentally misuse.

thanks, will do!
+
+#endif
+
+/* The sizes of the pages */
+#define HVM_1GB_PAGESIZE 0x40000000ul
+#define HVM_2MB_PAGESIZE 0x00200000ul
+#define HVM_4KB_PAGESIZE 0x00001000ul
+
+/* The size in bytes that a single L(1,2,3,4} entry covers.
+ * There are 512 (left shift by 9) entries in each page-structure.
+ */
+#define HVM_L4_PAGE_RANGE (HVM_1GB_PAGESIZE << 9)
+#define HVM_L3_PAGE_RANGE (HVM_2MB_PAGESIZE << 9)
+#define HVM_L2_PAGE_RANGE (HVM_4KB_PAGESIZE << 9)
+#define HVM_L1_PAGE_RANGE (HVM_4KB_PAGESIZE     )

Values like this should be straight from page.h rather than redefining
here.  if page.h is missing some, that is the correct place to add them.

got it.
~Andrew

+
+#define HVM_ERR_PG_ALLOC -1
+
+#endif


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.