[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH RFC v3 5/6] HVM x86 deprivileged mode: Syscall and deprivileged operation dispatcher



We have two operations:
1) dispatching a deprivileged mode operation
2) deprivileged mode executing a system call

For (1):
We have a table of methods which can be dispatched. All deprivileged mode
methods which can be dispatched need to be in this array. This aims to
prevent dispatching functions which are not designed for deprivileged mode
and means that we do not dispatch on an aribitrary pointer. We then dispatch
to the function pointer stored in this array. This goes via an assembly stub
in deprivileged mode which calls the function and then issues a syscall to
return to privileged mode when the operation completes. This allows the
deprivileged function to return normally.

For (2):
We again have a table of methods which are the syscall handlers. All
system calls which we handle need to be in this table. Deprivileged mode
passes an integer to select which operation to call. The system call is
wrapped to marshall the paramters as necessary and then jumps to a stub to
issue the syscall.

Data transfer for (1):
To pass data to deprivileged mode, we can pass up to five integer or
pointer parameters in registers. This is thanks to the 64-bit Linux calling
convention which puts these into 64-bit registers. The dispatch code takes
these parameters and arranges them so that when the deprivileged mode
operation executes, they are in the registers specified by the calling
convention. This means that it is transparent to the operation that it was
not invoked by a function call.

To pass the data which pointers correspond to, we use the deprivileged data
section. We copy this data to the section and change the pointer so that it
points into this section. Any extra parameters are also copied into the
data section.

To return data from deprivileged mode, the operation can supply a return
value which we pass through back to the caller. If extra data is needed,
which may be needed to make logical decisions after invocation of the
operation, then this is placed at the end of the data section. The caller
of the operation can then access this data. We copy back the data which we
initially copied in so that the caller sees any changes made by the callee.
NOTE: You need to handle the case where these structures can be updated
whilst in deprivileged mode.

It is necessary to clear out the data page between deprivilegeged mode
operations to prevent data leakage between operations which _may_ be
useful to an attacker.

Data transfer for (2):
We need to transfer data to the syscall handler and then back to the
deprivileged mode operation. To pass data, we use the same method as in (1)
for the first five parameters. For extra data, this will be placed at the
end of the data section and will be fetched by the handler.  We also use
the same method as in (1) for passing data back to the operation.

The general process to create a deprivilged mode operation is as follows:
 - Keep the old method prototype the same so that callers do not need to be
   modified. This helps to reduce the impact of this feature on the rest of the
   code base
 - Move the old code into a new deprv_F version of the function.
 - Marshall and unmarshall arguments as needed in the old function
 - Call the depriv version using depriv#n(F, params) function which is a wrapper
   around hvm_deprivileged_user_mode(F, params) in case we want to change this
   interface later or need better/extra argument marshalling.
 - Use the return code to work out what further processing is needed then return
 - Add an entry into the depriv_operation_table and add an operation number

With this done, there are no edits which need to be made to callers. If aliasing
of data is added to the feature, then this may not longer be the case.

The process to create a syscall is as follows:
 - Create a syscall with a name do_depriv_* using the depriv_syscall_t type
 - Write the syscall body
 - Return a result to depriv mode
 - Add an entry to the depriv_syscall_table and create a syscall number

Syscalls are made using DEPRIV_SYSCALL_CALL(op, ret, params) which
takes the operation number, the return variable and the paramters for the
system call, executes the system call using the Linux 64-bit calling convention
and then sets ret to the return value.

TODO:
-----
 - Alias data for deprivileged mode. There is a large comment at the top of
   deprivileged_syscall.c which outlines considerations.
 - Check if we need to map_domain_page the pages when we do the copy in
   hvm_deprivileged_copy_data{to/from}
 - Check for unsigned integer wrapping on addition in
   hvm_deprivileged_copy_data_{to/from}
 - Move hvm_deprivileged_syscall into the syscall macro. It's a stub and
   unless extra code is needed there it can be folded into the macro.
 - Check maintainers' thoughts on the deprivileged mode function checks in
   hvm_deprivileged_user_mode. See the TODO comment.

We copy the data for ease of implementation and for small enough
structures, this is acceptable. For larger structures, or ones
which can be updated whilst deprivileged mode uses them, it would be
better to alias them. This would require the caller to provide the data on
separate pages (so that only the required data is passed in, we don't want
deprivileged mode to be able to access any other Xen data). We can then
alias these through a page table mapping. It would make sense to
preallocate a set of pages in the monitor table to do this, so that, when
aliasing, we just need to switch the mfn on the L1 page table, rather than
allocating and mapping in a whole new paging hierarchy. Then, we only
need to invalidate those L1 page table TLB entries when we exit the mode.

Signed-off-by: Ben Catterall <Ben.Catterall@xxxxxxxxxx>
---
 xen/arch/x86/hvm/Makefile                  |   1 +
 xen/arch/x86/hvm/deprivileged.c            |  56 +++++-
 xen/arch/x86/hvm/deprivileged_asm.S        | 170 +++++++++++++-----
 xen/arch/x86/hvm/deprivileged_syscall.c    | 277 +++++++++++++++++++++++++++++
 xen/arch/x86/hvm/svm/svm.c                 |   2 +-
 xen/arch/x86/hvm/vmx/vmx.c                 |   1 -
 xen/arch/x86/x86_64/asm-offsets.c          |   1 +
 xen/arch/x86/x86_64/entry.S                |   8 +-
 xen/include/asm-x86/hvm/vcpu.h             |   5 +-
 xen/include/xen/hvm/deprivileged.h         |  17 +-
 xen/include/xen/hvm/deprivileged_syscall.h | 200 +++++++++++++++++++++
 11 files changed, 686 insertions(+), 52 deletions(-)
 create mode 100644 xen/arch/x86/hvm/deprivileged_syscall.c
 create mode 100644 xen/include/xen/hvm/deprivileged_syscall.h

diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index e16960a..cf93e3e 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -4,6 +4,7 @@ subdir-y += vmx
 obj-y += asid.o
 obj-y += deprivileged.o
 obj-y += deprivileged_asm.o
+obj-y += deprivileged_syscall.o
 obj-y += emulate.o
 obj-y += event.o
 obj-y += hpet.o
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 0b02065..5606f9a 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -20,6 +20,9 @@
 #include <xen/hvm/deprivileged.h>
 #include <xen/hvm/deprivileged_syscall.h>
 
+static depriv_syscall_t depriv_operation_table[] = {
+};
+
 void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4t_base)
 {
     void *p;
@@ -562,17 +565,66 @@ void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu)
  * This method is then jumped into to restore execution context after
  * exiting user mode.
  */
-int hvm_deprivileged_user_mode(void)
+int hvm_deprivileged_user_mode(unsigned long operation, register_t a,
+                               register_t b, register_t c, register_t d,
+                               register_t e)
 {
     struct vcpu *vcpu = get_current();
+    depriv_syscall_fn_t depriv_f;
+    depriv_syscall_fn_t f;
+    unsigned long offset;
 
     ASSERT( vcpu->arch.hvm_vcpu.depriv_user_mode == 0 );
     ASSERT( vcpu->arch.hvm_vcpu.depriv_rsp == 0 );
+    printk("OP: %lx size: %lx\n", operation, 
ARRAY_SIZE(depriv_operation_table));
+    ASSERT( operation < ARRAY_SIZE(depriv_operation_table) );
+
+    vcpu->arch.hvm_vcpu.depriv_return_code = 0;
+
+    /* Invalid operation? */
+    if ( operation >= ARRAY_SIZE(depriv_operation_table) ) {
+        domain_crash(vcpu->domain);
+        return HVM_DISPATCH_ERR;
+    }
+
+    f = depriv_operation_table[operation].fn;
+
+    /*
+     * f needs to be a depriv mode function so needs to be in the deprivileged
+     * text segment. This also check for unsigned integer wrapping on
+     * subtraction. This is probably not necessary as we've indexed via the
+     * array to get the function address which shouldn't be user controllable
+     * so shouldn't represent a security concern. Expecially as we only call
+     * the function in deprivileged mode so it cannot access privileged mode
+     * code but, better safe than sorry...
+     * TODO: See what maintainers think
+     */
+    if ( (unsigned long)f < (unsigned long)__hvm_deprivileged_text_start ||
+         (unsigned long)f >= (unsigned long)__hvm_deprivileged_text_end )
+    {
+        domain_crash(vcpu->domain);
+        return HVM_DISPATCH_ERR;
+    }
+
+    /*
+     * Calculate the offset and then test for unsigned integer wrapping on
+     * addition.
+     */
+    offset = (unsigned long)f - (unsigned long)__hvm_deprivileged_text_start;
+
+    if ( ULONG_MAX - offset  < (unsigned long)HVM_DEPRIVILEGED_TEXT_ADDR )
+    {
+        domain_crash(vcpu->domain);
+        return HVM_DISPATCH_ERR;
+    }
 
+    depriv_f = (depriv_syscall_fn_t)(offset +
+                                     (unsigned 
long)HVM_DEPRIVILEGED_TEXT_ADDR);
+    printk("FUNCTION: %lx\n", (unsigned long)depriv_f);
     vcpu->arch.hvm_vcpu.depriv_ctxt_switch_to(vcpu);
 
     /* The assembly routine to handle moving into/out of deprivileged mode */
-    hvm_deprivileged_user_mode_asm();
+    hvm_deprivileged_user_mode_asm(depriv_f, a, b, c, d, e);
 
     vcpu->arch.hvm_vcpu.depriv_ctxt_switch_from(vcpu);
 
diff --git a/xen/arch/x86/hvm/deprivileged_asm.S 
b/xen/arch/x86/hvm/deprivileged_asm.S
index 07d4216..7d3e632 100644
--- a/xen/arch/x86/hvm/deprivileged_asm.S
+++ b/xen/arch/x86/hvm/deprivileged_asm.S
@@ -11,7 +11,7 @@
 #include <public/xen.h>
 #include <irq_vectors.h>
 #include <xen/hvm/deprivileged.h>
-
+#include <xen/hvm/deprivileged_syscall.h>
 /*
  * Handles entry into the deprivileged mode and returning from this
  * mode.
@@ -24,25 +24,43 @@
  * We're doing a sort-of long jump/set jump with copying to a stack to
  * preserve it and allow returning code to continue executing from
  * within this method.
+ *
+ * Params:
+ *    f - rdi
+ *    a - rsi
+ *    b - rdx
+ *    c - rcx
+ *    d - r8
+ *    e - r9
+ *
+ * NOTE: This function relies on the 64-bit linux calling convention. See the
+ *       header comment in deprivileged_syscall.c for a full description.
+ *
+ * Stack Layout after entering user mode:
+ * The stack grows down in this diagram
+ *
+ * caller
+ * --------------
+ * saved registers
+ * --------------
+ * return eip
+ * --------------
  */
 ENTRY(hvm_deprivileged_user_mode_asm)
         /* Save our registers */
-        push   %rax
-        push   %rbx
-        push   %rcx
-        push   %rdx
-        push   %rsi
-        push   %rdi
-        push   %rbp
-        push   %r8
-        push   %r9
-        push   %r10
-        push   %r11
-        push   %r12
-        push   %r13
-        push   %r14
-        push   %r15
+        /* The ordering here is deliberate, we want to save all but our
+         * parameters.
+         */
         pushfq
+        push   %r15
+        push   %r14
+        push   %r13
+        push   %r12
+        push   %r11
+        push   %r10
+        push   %rbp
+        push   %rbx
+        push   %rax
 
         /* Perform a near call to push rip onto the stack */
         call   1f
@@ -54,6 +72,17 @@ ENTRY(hvm_deprivileged_user_mode_asm)
 1:      addq   $2f-1b, (%rsp)
 
         /*
+         * We need to also save our parameters as they are caller-saved and
+         * we call other functions in this one.
+         */
+        push   %r9
+        push   %r8
+        push   %rcx
+        push   %rdx
+        push   %rsi
+        push   %rdi
+
+        /*
          * Setup the stack pointers for exceptions, syscall and sysenter to be
          * just after our current rsp, adjusted for 16 byte alignment.
          */
@@ -76,11 +105,11 @@ ENTRY(hvm_deprivileged_user_mode_asm)
         jne    3f
         cli
 
-        movq   %rsp, VCPU_depriv_rsp(%r8)        /* The rsp to restore to */
-        movabs $HVM_DEPRIVILEGED_TEXT_ADDR, %rcx /* RIP in user mode */
 
         /* RFLAGS user mode */
         movq   $(X86_EFLAGS_IF | X86_EFLAGS_VIP), %r11
+
+        movabs $(hvm_deprivileged_ring3 - .hvm_deprivileged_enhancement.text + 
HVM_DEPRIVILEGED_TEXT_ADDR), %rcx /* RIP in user mode */
         movq   $1, VCPU_depriv_user_mode(%r8)    /* Now in user mode */
 
         /*
@@ -94,6 +123,21 @@ ENTRY(hvm_deprivileged_user_mode_asm)
          * the sysret instruction.
          */
         movq   $HVM_STACK_PTR, %rbx
+
+        /*
+         * Pop our parameters so that the registers now hold the needed values
+         * for the deprivileged mode operation.
+         */
+        movq   %r8,  %r15
+        pop    %rdi
+        pop    %rsi
+        pop    %rdx
+        pop    %r10    /* rcx is needed by sysret so use r10 instead */
+        pop    %r8
+        pop    %r9
+
+        movq   %rsp, VCPU_depriv_rsp(%r15)        /* The rsp to restore to */
+
         sysretq                         /* Enter deprivileged mode */
 
 3:      call   hvm_deprivileged_restore_stacks
@@ -102,22 +146,16 @@ ENTRY(hvm_deprivileged_user_mode_asm)
          * Restore registers
          * The return rip has been popped by the ret on the return path
          */
-        popfq
-        pop    %r15
-        pop    %r14
-        pop    %r13
-        pop    %r12
-        pop    %r11
-        pop    %r10
-        pop    %r9
-        pop    %r8
-        pop    %rbp
-        pop    %rdi
-        pop    %rsi
-        pop    %rdx
-        pop    %rcx
-        pop    %rbx
         pop    %rax
+        pop    %rbx
+        pop    %rbp
+        pop    %r10
+        pop    %r11
+        pop    %r12
+        pop    %r13
+        pop    %r14
+        pop    %r15
+        popfq
         ret
 
 /* Finished in user mode so return */
@@ -134,11 +172,14 @@ ENTRY(hvm_deprivileged_finish_user_mode_asm)
         /* Go to user mode return code */
         ret
 
-/* Entry point from the assembly syscall handlers */
+/* Entry point from the assembly hypercall handlers */
 ENTRY(hvm_deprivileged_handle_user_mode)
+        /* Return code is in %rdi */
+        GET_CURRENT(%rbx)
+        movq   %rdi, VCPU_depriv_return_code(%rbx)
 
-        /* Handle a user mode hypercall here */
-
+        /* Handle a user mode syscall here */
+        call do_deprivileged_syscall
 
         /* We are finished in user mode */
         call hvm_deprivileged_finish_user_mode
@@ -146,7 +187,25 @@ ENTRY(hvm_deprivileged_handle_user_mode)
         ret
 
 .section .hvm_deprivileged_enhancement.text,"ax"
-/* HVM deprivileged code */
+/* HVM deprivileged code general entry and exit point
+ *
+ * All entries and exits to deprivileged mode operations enter
+ * and exit here. This means that the depriv functions do not need
+ * to be written to setup the needed state and can return normally.
+ * We then handle the return to the hypervisor here
+ *
+ * In rdi, we have the address of the function to jump to and its
+ * parameters are in the necessary registers for the 64-bit linux
+ * calling convention
+ *
+ * Params:
+ *    f - rdi
+ *    a - rsi
+ *    b - rdx
+ *    c - r10   - rcx is needed by sysret so we can't use it: use r10 instead
+ *    d - r8
+ *    e - r9
+ */
 ENTRY(hvm_deprivileged_ring3)
         /*
          * sysret has loaded eip from rcx and rflags from r11.
@@ -155,13 +214,42 @@ ENTRY(hvm_deprivileged_ring3)
          */
         movabs $HVM_STACK_PTR, %rsp
 
+        /*
+         * Shuffle params across so that the callee has its first argument in
+         * rdi as defined in the calling convention. We have put f in rdi and
+         * effectively moved the other five arguments 'down' one slot. This
+         * makes the depriv invocation transparent to the callee.
+         */
+        movq   %rdi, %r15
+        movq   %rsi, %rdi
+        movq   %rdx, %rsi
+        movq   %r10, %rdx  /* r10 holds param rcx */
+        movq   %r8,  %rcx
+        movq   %r9,  %r8
+
         /* Perform user mode processing */
-        movabs $0xff, %rcx
-1: dec  %rcx
-        cmp $0, %rcx
-        jne 1b
+        callq   %r15
+
+        /* Result is in rax */
+        mov    %rax, %rdi
 
         /* Return to ring 0 */
         syscall
 
+/*
+ * Dispatch a syscall from within deprivileged mode
+ *
+ * Params:
+ * - syscall number is in rdi
+ * - syscall arguments are in rsi, rdx, rcx, r8 and r9 (in that order)
+ *
+ * TODO: this is currently a stub, it can be folded into DEPRIV_SYSCALL_CALL
+ *   if no extra code is needed.
+ */
+ENTRY(hvm_deprivileged_syscall)
+
+        syscall
+
+        /* Returned from this mode: Get result into rax */
+        ret
 .previous
diff --git a/xen/arch/x86/hvm/deprivileged_syscall.c 
b/xen/arch/x86/hvm/deprivileged_syscall.c
new file mode 100644
index 0000000..34dfee9
--- /dev/null
+++ b/xen/arch/x86/hvm/deprivileged_syscall.c
@@ -0,0 +1,277 @@
+/*
+ * A description of deprivileged mode operation dispatch and system call 
handling
+ * follows.
+ *
+ * We have two operations:
+ * 1) dispatching a deprivileged mode operation
+ * 2) deprivileged mode executing a system call
+ *
+ * For (1):
+ *   We have a table of methods which can be dispatched. All deprivileged mode
+ *   methods which can be dispatched need to be in this array. This aims to
+ *   prevent dispatching functions which are not designed for deprivileged mode
+ *   and means that we do not dispatch on an aribitrary pointer.
+ *
+ * For (2):
+ *   We again have a table of methods which are the syscall handlers. All
+ *   system calls which we handle need to be in this table. Deprivileged mode
+ *   passes an integer to select which operation to call.
+ *
+ * Data transfer for (1):
+ *   To pass data to deprivileged mode, we can pass up to five integer or
+ *   pointer parameters in registers. This is thanks to the 64-bit Linux 
calling
+ *   convention which puts these into 64-bit registers. The dispatch code takes
+ *   these parameters and arranges them so that when the deprivileged mode
+ *   operation executes, they are in the registers specified by the calling
+ *   convention. This means that it is transparent to the operation that it was
+ *   not invoked by a function call.
+ *
+ *   To pass the data which pointers correspond to, we use the deprivileged 
data
+ *   section. We copy this data to the section and change the pointer so that 
it
+ *   points into this section. Any extra parameters are also copied into the
+ *   data section.
+ *
+ *   To return data from deprivileged mode, the operation can supply a return
+ *   value which we pass through back to the caller. If extra data is needed,
+ *   which may be needed to make logical decisions after invocation of the
+ *   operation, then this is placed at the end of the data section. The caller
+ *   of the operation can then access this data. We copy back the data which we
+ *   initially copied in so that the caller sees any changes made by the 
callee.
+ *   NOTE: You need to handle the case where these structures can be updated
+ *   whilst in deprivileged mode.
+ *
+ *   It is necessary to clear out the data page between deprivilegeged mode
+ *   operations to prevent data leakage between operations which _may_ be
+ *   useful to an attacker.
+ *
+ * Data transfer for (2):
+ *   We need to transfer data to the syscall handler and then back to the
+ *   deprivileged mode operation. To pass data, we use the same method as in 
(1)
+ *   for the first five parameters. For extra data, this will be placed at the
+ *   end of the data section and will be fetched by the handler.  We also use
+ *   the same method as in (1) for passing data back to the operation.
+ *
+ *
+ * TODO: We copy the data for ease of implementation and for small enough
+ *   structures, this is acceptable. For larger structures, or ones
+ *   which can be updated whilst deprivileged mode uses them, it would be
+ *   better to alias them. This would require the caller to provide the data on
+ *   separate pages (so that only the required data is passed in, we don't want
+ *   deprivileged mode to be able to access any other Xen data). We can then
+ *   alias these through a page table mapping. It would make sense to
+ *   preallocate a set of pages in the monitor table to do this, so that, when
+ *   aliasing, we just need to switch the mfn on the L1 page table, rather than
+ *   allocating and mapping in a whole new paging hierarchy and then we only
+ *   need to invalidate that one TLB entry when we exit the mode.
+ */
+
+#include <xen/hvm/deprivileged.h>
+#include <xen/hvm/deprivileged_syscall.h>
+
+/*
+ * Similar to Xen's arch/arm/traps.c
+ * Used for handling a syscall from deprivileged mode or dispatching a
+ * deprivileged mode operation.
+ */
+
+
+/* This table holds the functions which can be called from deprivileged mode. 
*/
+static depriv_syscall_t depriv_syscall_table[] = {
+
+};
+
+/* Handle a syscall from deprivileged mode */
+void do_deprivileged_syscall(struct cpu_user_regs *regs)
+{
+    depriv_syscall_fn_t fn = NULL;
+    unsigned long nr = regs->rdi;
+
+    /* Invalid syscall? */
+    if ( nr >= ARRAY_SIZE(depriv_syscall_table) )
+        hvm_deprivileged_crash_domain("Invalid syscall number");
+
+    fn = depriv_syscall_table[nr].fn;
+
+    /* No syscall? */
+    if ( fn == NULL )
+        hvm_deprivileged_crash_domain("No syscall");
+
+    /*
+     * We can use the 64-bit linux calling convention here. The first 6 integer
+     * and pointer arguments are passed in registers. Now, as long as all of 
our
+     * system calls use fewer than this, we can just call all of our functions
+     * with five arguments. This is fine as these registers should be preserved
+     * by the caller if they use them so will not impact functions with fewer
+     * parameters.
+     */
+    ASSERT(depriv_syscall_table[nr].nr_args <= 5);
+
+    DEPRIV_SYSCALL_RESULT(regs) = fn(DEPRIV_SYSCALL_ARGS(regs));
+
+    /* Return results */
+}
+
+/*
+ * Copy data from privileged context to deprivileged context for
+ * use by deprivileged context functions.
+ *
+ * TODO: In future, it might be better to alias such data, we can put
+ * the source data in a page aligned region and then alias it so that
+ * deprivileged mode can access it. This would avoid the overheads of
+ * the copy. See the header of this file.
+ */
+void *hvm_deprivileged_copy_data_to(struct vcpu *vcpu, void *src,
+                                    unsigned long size)
+{
+    unsigned long data_offset = vcpu->arch.hvm_vcpu.depriv_data_offset;
+
+    /*
+     * TODO: Check for unsigned integer wrapping on addition
+     */
+    printk("off: %lx, size: %lx\n section: %lx\n", data_offset, size, 
HVM_DEPRIV_DATA_SECTION_SIZE);
+    ASSERT(data_offset + size < HVM_DEPRIV_DATA_SECTION_SIZE);
+
+    /*
+     * TODO: We _may_ need to map_domain_page these in first???
+     */
+    memcpy((void *)((unsigned long)HVM_DEPRIVILEGED_DATA_ADDR + data_offset),
+           src, size);
+
+    vcpu->arch.hvm_vcpu.depriv_data_offset += size;
+
+    /* The destination */
+    return (void *)((unsigned long)HVM_DEPRIVILEGED_DATA_ADDR + data_offset);
+}
+
+/* Copy data from deprivileged context to privileged context. */
+void *hvm_deprivileged_copy_data_from(struct vcpu *vcpu, void *dest, void *src,
+                                      unsigned long size)
+{
+    unsigned long data_offset = vcpu->arch.hvm_vcpu.depriv_data_offset;
+
+    /*
+     * TODO: Check for unsigned integer wrapping on addition
+     */
+    ASSERT(data_offset + size < HVM_DEPRIV_DATA_SECTION_SIZE);
+
+    memcpy(dest, src, size);
+
+    /*
+     * Prevent information leakage between separate deprivileged mode 
operations
+     * by  clearing out this region
+     */
+    memset((void *)((unsigned long)HVM_DEPRIVILEGED_DATA_ADDR + data_offset),
+           0, size);
+
+    return dest;
+}
+
+/*******************************************************************************
+ *
+ * Deprivileged mode dispatcher wrappers
+ *
+ * These are used to wrap calling a deprivileged mode operation with up to five
+ * parameters in case we change the interface.
+ *
+ 
******************************************************************************/
+int depriv0(unsigned long f)
+{
+    return hvm_deprivileged_user_mode(f, 0, 0, 0, 0, 0);
+}
+
+int depriv1(unsigned long f, register_t a)
+{
+    return hvm_deprivileged_user_mode(f, a, 0, 0, 0, 0);
+}
+
+int depriv2(unsigned long f, register_t a, register_t b)
+{
+    return hvm_deprivileged_user_mode(f, a, b, 0, 0, 0);
+}
+
+int depriv3(unsigned long f, register_t a, register_t b, register_t c)
+{
+    return hvm_deprivileged_user_mode(f, a, b, c, 0, 0);
+}
+
+int depriv4(unsigned long f, register_t a, register_t b, register_t c,
+            register_t d)
+{
+    return hvm_deprivileged_user_mode(f, a, b, c, d, 0);
+}
+
+int depriv5(unsigned long f, register_t a, register_t b, register_t c,
+            register_t d, register_t e)
+{
+    return hvm_deprivileged_user_mode(f, a, b, c, d, e);
+}
+
+/*******************************************************************************
+ *
+ * Test dispatchers, used to dispatch a deprivileged mode operation
+ *
+ 
******************************************************************************/
+int test_op0(void)
+{
+    return depriv0(DEPRIV_OPERATION_test_op0);
+}
+
+int test_op1(int a)
+{
+    return depriv1(DEPRIV_OPERATION_test_op1, a);
+}
+
+int test_op2(int a, int b)
+{
+    return depriv2(DEPRIV_OPERATION_test_op2, a, b);
+}
+
+int test_op3(int a, int b, int c)
+{
+    return depriv3(DEPRIV_OPERATION_test_op3, a, b, c);
+}
+
+int test_op4(int a, int b, int c, int d)
+{
+    return depriv4(DEPRIV_OPERATION_test_op4, a, b, c, d);
+}
+
+int test_op5(int a, int b, int c, int d, int e)
+{
+    return depriv5(DEPRIV_OPERATION_test_op5, a, b, c, d, e);
+}
+
+/*******************************************************************************
+ *
+ * Test HVM Deprivileged mode functions
+ *
+ 
******************************************************************************/
+int depriv_test_op0(void)
+{
+    return 0xDEADBEEF;
+}
+
+int depriv_test_op1(int a)
+{
+    return a;
+}
+
+int depriv_test_op2(int a, int b)
+{
+    return b;
+}
+
+int depriv_test_op3(int a, int b, int c)
+{
+    return c;
+}
+
+int depriv_test_op4(int a, int b, int c, int d)
+{
+    return d;
+}
+
+int depriv_test_op5(int a, int b, int c, int d, int e)
+{
+   return e;
+}
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 3393fb5..4ca6d53 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -2670,7 +2670,7 @@ void svm_vmexit_handler(struct cpu_user_regs *regs)
             {
                 if( guest_cpu_user_regs()->eax == 0x1)
                 {
-                    hvm_deprivileged_user_mode();
+//                    hvm_deprivileged_user_mode();
                 }
                 __update_guest_eip(regs, vmcb->exitinfo2 - vmcb->rip);
                 break;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 1ec23f9..b93b0b6 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3453,7 +3453,6 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
             {
                 if( guest_cpu_user_regs()->eax == 0x1)
                 {
-                    hvm_deprivileged_user_mode();
                 }
                 update_guest_eip(); /* Safe: IN, OUT */
                 break;
diff --git a/xen/arch/x86/x86_64/asm-offsets.c 
b/xen/arch/x86/x86_64/asm-offsets.c
index 7af824a..4e2a96c 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -118,6 +118,7 @@ void __dummy__(void)
     OFFSET(VCPU_depriv_rsp, struct vcpu, arch.hvm_vcpu.depriv_rsp);
     OFFSET(VCPU_depriv_user_mode, struct vcpu, arch.hvm_vcpu.depriv_user_mode);
     OFFSET(VCPU_depriv_destroy, struct vcpu, arch.hvm_vcpu.depriv_destroy);
+    OFFSET(VCPU_depriv_return_code, struct vcpu, 
arch.hvm_vcpu.depriv_return_code);
     BLANK();
 
     OFFSET(DOMAIN_is_32bit_pv, struct domain, arch.is_32bit_pv);
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 9590065..df434f2 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -106,6 +106,7 @@ restore_all_xen:
 /* Returning from user mode */
 ENTRY(handle_hvm_user_mode)
 
+        movq %rsp, %rdi
         call hvm_deprivileged_handle_user_mode
 
         /* fallthrough */
@@ -141,7 +142,12 @@ ENTRY(lstar_enter)
         SAVE_VOLATILE TRAP_syscall
         GET_CURRENT(%rbx)
 
-        /* Were we in Xen's ring 3?  */
+        /*
+         * Were we in Xen's ring 3?
+         * From lstar_enter up to saving all registers, we need to preserve 
rdi,
+         * rsi, rdx, rcx, r8 and r9 so that syscalls into deprivileged mode can
+         * function as expected
+         */
         cmpq  $1, VCPU_depriv_user_mode(%rbx)
         je    handle_hvm_user_mode
 
diff --git a/xen/include/asm-x86/hvm/vcpu.h b/xen/include/asm-x86/hvm/vcpu.h
index f7df9d4..dcdecf1 100644
--- a/xen/include/asm-x86/hvm/vcpu.h
+++ b/xen/include/asm-x86/hvm/vcpu.h
@@ -216,7 +216,10 @@ struct hvm_vcpu {
     unsigned long depriv_tss_rsp0;
     unsigned long depriv_destroy;
     unsigned long depriv_watchdog_count;
-    
+    unsigned long depriv_return_code;
+    /* Offset into our data page where we can put data for depriv operations */
+    unsigned long depriv_data_offset;
+
     /* Pending hw/sw interrupt (.vector = -1 means nothing pending). */
     struct hvm_trap     inject_trap;
 
diff --git a/xen/include/xen/hvm/deprivileged.h 
b/xen/include/xen/hvm/deprivileged.h
index b6e575d..d7228be 100644
--- a/xen/include/xen/hvm/deprivileged.h
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -17,6 +17,7 @@
 #include <asm-x86/page.h>
 #include <public/domctl.h>
 #include <xen/domain_page.h>
+#include <xen/hvm/deprivileged_syscall.h>
 
 /*
  * Initialise the HVM deprivileged mode. This just sets up the general
@@ -83,16 +84,21 @@ int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu);
 /* Destroy each vcpu's data for Xen user mode. Again, call for each vcpu. */
 void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu);
 
-/* Called to perform a user mode operation. */
-int hvm_deprivileged_user_mode(void);
-
 /* Called when the user mode operation has completed */
 void hvm_deprivileged_finish_user_mode(void);
 
-/* Called to move into and then out of user mode. Needed for accessing
+/* Dispatch a deprivileged user mode operation */
+int hvm_deprivileged_user_mode(unsigned long operation, register_t a,
+                               register_t b, register_t c, register_t d,
+                               register_t e);
+
+/*
+ * Called to move into and then out of user mode. Needed for accessing
  * assembly features.
  */
-void hvm_deprivileged_user_mode_asm(void);
+void hvm_deprivileged_user_mode_asm(depriv_syscall_fn_t f, register_t a,
+                                    register_t b, register_t c, register_t d,
+                                    register_t e);
 
 /* Called on the return path to return to the correct execution point */
 void hvm_deprivileged_finish_user_mode_asm(void);
@@ -151,6 +157,7 @@ extern unsigned long __hvm_deprivileged_data_end[];
 #define HVM_ERR_PG_ALLOC -1
 #define HVM_DEPRIV_ALIAS 1
 #define HVM_DEPRIV_COPY 0
+#define HVM_DISPATCH_ERR -1
 
 /*
  * The user mode stack pointer.
diff --git a/xen/include/xen/hvm/deprivileged_syscall.h 
b/xen/include/xen/hvm/deprivileged_syscall.h
new file mode 100644
index 0000000..3af29ae
--- /dev/null
+++ b/xen/include/xen/hvm/deprivileged_syscall.h
@@ -0,0 +1,200 @@
+#ifndef __X86_HVM_DEPRIVILEGED_SYSCALL
+
+/* Table of HVM deprivileged mode syscall array offsets */
+#define DEPRIV_SYSCALL_vpic_get_priority 0
+
+/* Table of HVM deprivileged mode operation array offsets */
+#define DEPRIV_OPERATION_vpic_ioport_write 0
+#define DEPRIV_OPERATION_test_op0 1
+#define DEPRIV_OPERATION_test_op1 2
+#define DEPRIV_OPERATION_test_op2 3
+#define DEPRIV_OPERATION_test_op3 4
+#define DEPRIV_OPERATION_test_op4 5
+#define DEPRIV_OPERATION_test_op5 6
+
+/* This is also included in the HVM deprivileged mode .S file */
+#ifndef __ASSEMBLY__
+#define __X86_HVM_DEPRIVILEGED_SYSCALL
+#include <xen/hvm/deprivileged.h>
+
+/* Handle a syscall from deprivileged mode */
+void do_deprivileged_syscall(struct cpu_user_regs *regs);
+
+/* Dispatch a syscall from within deprivileged mode */
+void hvm_deprivileged_syscall(void);
+
+/*
+ * Copy data from privileged context to deprivileged context for
+ * use by deprivileged context functions.
+ */
+void *hvm_deprivileged_copy_data_to(struct vcpu *vcpu, void *src,
+                                    unsigned long size);
+
+/* Copy data from deprivileged context to privileged context. */
+void *hvm_deprivileged_copy_data_from(struct vcpu *vcpu, void *dest, void *src,
+                                      unsigned long size);
+
+/*
+ * Typing to allow us to store and lookup system calls with different
+ * prototypes by a syscall number
+ */
+typedef unsigned long register_t;
+
+typedef register_t (*depriv_syscall_fn_t)(
+    register_t, register_t, register_t, register_t, register_t);
+
+typedef struct {
+    depriv_syscall_fn_t fn;
+    int nr_args;
+} depriv_syscall_t;
+
+/* Create an entry in the syscall table */
+#define DEPRIV_SYSCALL(_name, _nr_args)                       \
+    [ DEPRIV_SYSCALL_ ## _name ] = {                          \
+        .fn = (depriv_syscall_fn_t) &do_depriv_ ## _name,     \
+        .nr_args = _nr_args                                   \
+    }
+
+/* Use to extract the arguments from the cpu_user_regs struct */
+#define DEPRIV_SYSCALL_ARGS(r) (r)->rsi, (r)->rdx, (r)->rcx, (r)->r8, (r)->r9
+
+/* Use to set the rax register in the cpu_user_regs struct */
+#define DEPRIV_SYSCALL_RESULT(r) (r)->rax
+
+/*
+ * Use this to call a system call from deprvileged mode.
+ * We take the syscall number and then up to five parameters which we pass in
+ * registers, following the 64-bit Linux calling convention.
+ *
+ * We need to calculate the actual address of the syscall dispatcher as we
+ * relocated it to the deprivileged code area so it's compiled address is not
+ * its acutal address. This also means that we can't just do
+ * hvm_deprivileged_syscall(op, a, ...) as that will be for the compiled 
address
+ * not the actual relocated address.
+ */
+#define DEPRIV_SYSCALL_CALL(_op, _ret,  _a, _b, _c, _d, _e)                    
\
+    __asm__ volatile("movq  %1, %%rdi;" /* The syscall number */               
\
+                     /* Parameters */                                          
\
+                     "movq  %2, %%rsi;"                                        
\
+                     "movq  %3, %%rdx;"                                        
\
+                     "movq  %4, %%rcx;"                                        
\
+                     "movq  %5, %%r8;"                                         
\
+                     "movq  %6, %%r9;"                                         
\
+                     /* Dispatch it */                                         
\
+                     "callq %7;"                                               
\
+                     /* return is in rax */                                    
\
+                     : "=a"(_ret)                                              
\
+                     : "rm"((unsigned long)_op), "rm"((unsigned long)_a),      
\
+                       "rm"((unsigned long)_b), "rm"((unsigned long)_c),       
\
+                       "rm"((unsigned long)_d), "rm"((unsigned long)_e),       
\
+                       "rm"((unsigned long)hvm_deprivileged_syscall -          
\
+                           (unsigned long)__hvm_deprivileged_text_start +      
\
+                           (unsigned long)HVM_DEPRIVILEGED_TEXT_ADDR)          
\
+                     : "rdi", "rsi", "rdx", "rcx", "r8", "r9")
+
+#define DEPRIV_SYSCALL_CALL0(_op, _ret)                     \
+    DEPRIV_SYSCALL_CALL(_op, _ret, 0, 0, 0, 0, 0)
+
+#define DEPRIV_SYSCALL_CALL1(_op, _ret, _a)                 \
+    DEPRIV_SYSCALL_CALL(_op, _ret, _a, 0, 0, 0, 0)
+
+#define DEPRIV_SYSCALL_CALL2(_op, _ret, _a, _b)             \
+    DEPRIV_SYSCALL_CALL(_op, _ret, _a, _b, 0, 0, 0)
+
+#define DEPRIV_SYSCALL_CALL3(_op, _ret, _a, _b, _c)         \
+    DEPRIV_SYSCALL_CALL(_op, _ret, _a, _b, _c, 0, 0)
+
+#define DEPRIV_SYSCALL_CALL4(_op, _ret, _a, _b, _c, _d)     \
+    DEPRIV_SYSCALL_CALL(_op, _ret, _a, _b, _c, _d, 0)
+
+#define DEPRIV_SYSCALL_CALL5(_op, _ret, _a, _b, _c, _d, _e) \
+    DEPRIV_SYSCALL_CALL(_op, _ret, _a, _b, _c, _d, _e)
+
+/* Deprivileged mode operation. This can be dispatched. */
+#define DEPRIV_OPERATION(_name, _nr_args)                     \
+    [ DEPRIV_OPERATION_ ## _name ] = {                        \
+        .fn = (depriv_syscall_fn_t) &depriv_ ## _name,        \
+        .nr_args = _nr_args                                   \
+    }
+
+#define DEPRIV_OPERATION_ARGS(r) (r)->rdi, (r)->rsi, (r)->rdx, (r)->rcx, 
(r)->r8
+
+#define DEPRIV_OPERATION_RESULT(r) (r)->rax
+
+/*
+ * Use this attribute on the prototype of any method which is to be executed in
+ * depriviled mode.
+ */
+#define DEPRIV_TEXT_SEGMENT                                           \
+    __attribute__((section(".hvm_deprivileged_enhancement.text")))
+
+/*
+ * Wrappers to pass up to five parameters on a deprvileged dispatch in a
+ * uniform manner
+ */
+int depriv0(unsigned long f);
+
+int depriv1(unsigned long f, register_t a);
+
+int depriv2(unsigned long f, register_t a, register_t b);
+
+int depriv3(unsigned long f, register_t a, register_t b, register_t c);
+
+int depriv4(unsigned long f, register_t a, register_t b, register_t c,
+            register_t d);
+
+int depriv5(unsigned long f, register_t a, register_t b, register_t c,
+            register_t d, register_t e);
+
+/*
+ * We may want both the caller and the callee to have the same types for the
+ * paramters, so use these macros to ensure this is the case. GCC will complain
+ * if they are not when using these.
+ *
+ * We have a version of the function with identifier F, which can be the old
+ * function identifier and prototype. The aim is to minimise the intrusiveness
+ * of adding this feature so, the original call points remain unchanged and,
+ * instead, we move the contents of the old function into a deprivileged
+ * version and marshall arguments as needed. We then call the deprivileged
+ * version and then handle the return result (this may require additional
+ * logic).
+ */
+#define MAKE_DEPRIV0(retn, F) \
+    retn F(void);             \
+    retn depriv_ ##F(void) DEPRIV_TEXT_SEGMENT;
+
+#define MAKE_DEPRIV1(retn, F, type1, arg1) \
+    retn F(type1 arg1);                    \
+    retn depriv_ ##F(type1 arg1) DEPRIV_TEXT_SEGMENT;
+
+#define MAKE_DEPRIV2(retn, F, type1, arg1, type2, arg2) \
+    retn F(type1 arg1, type2 arg2);                     \
+    retn depriv_ ##F(type1 arg1, type2 arg2) DEPRIV_TEXT_SEGMENT;
+
+#define MAKE_DEPRIV3(retn, F, type1, arg1, type2, arg2, type3, arg3)           
\
+    retn F(type1 arg1, type2 arg2, type3 arg3);                                
\
+    retn depriv_ ##F(type1 arg1, type2 arg2, type3 arg3) DEPRIV_TEXT_SEGMENT; \
+
+#define MAKE_DEPRIV4(retn, F, type1, arg1, type2, arg2, type3, arg3,  \
+                     type4, arg4)                                     \
+    retn F(type1 arg1, type2 arg2, type3 arg3, type4 arg4);           \
+    retn depriv_ ##F(type1 arg1, type2 arg2, type3 arg3,              \
+                     type4 arg4) DEPRIV_TEXT_SEGMENT;
+
+#define MAKE_DEPRIV5(retn, F, type1, arg1, type2, arg2, type3, arg3,    \
+                     type4, arg4, type5, arg5)                          \
+    retn F(type1 arg1, type2 arg2, type3 arg3, type4 arg4, type5 arg5); \
+    retn depriv_ ##F(type1 arg1, type2 arg2, type3 arg3, type4 arg4,    \
+                     type5 arg5) DEPRIV_TEXT_SEGMENT;
+
+/* Test functions for all arguments for deprivileged dispatch */
+MAKE_DEPRIV0(int, test_op0)
+MAKE_DEPRIV1(int, test_op1, int, a)
+MAKE_DEPRIV2(int, test_op2, int, a, int, b)
+MAKE_DEPRIV3(int, test_op3, int, a, int, b, int, c)
+MAKE_DEPRIV4(int, test_op4, int, a, int, b, int, c, int, d)
+MAKE_DEPRIV5(int, test_op5, int, a, int, b, int, c, int, d, int, e)
+
+#endif /* !__ASSEMBLY__ */
+
+#endif
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.