[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC v2] xSplice design



Hi,

this looks very interesting.

I have talked about an experimental Xen hotpatching design at Linux
Plumbers Conference 2014 in Düsseldorf, slides are here:

http://www.linuxplumbersconf.net/2014/ocw//system/presentations/2421/original/xen_hotpatching-2014-10-16.pdf

and would like to share a couple of observations from the slides:

* We found splicing at hypervisor exit with a barrier for all CPUs to
  be a good place, because all Xen stacks are effectively empty at
  that point.  We use a barrier with a maximum timeout (that can be
  adapted).  The approach is similar in concept to stop_machine and
  the time rendezvous but is time-bound and does not require all the
  asynchronous operations discussed below.

* Hotpatch generation often requires support for compiling the target
  with -ffunction-sections / -fdata-sections.  Last time I looked, you
  can compile Xen with that option, but the linker scripts are not
  fully adapted to deal with it and the resulting hypervisor cannot be
  booted.

* Xen as it is now, has a couple of non-unique symbol names which will
  make runtime symbol identification hard.  Sometimes, static symbols
  simply have the same name in C files, sometimes such symbols get
  included via header files, and some C files are also compiled
  multiple times and linked under different names (guest_walk.c).  I
  think the Linux kernel solves this by aiming at unique symbol names
  even for local symbols.

  nm xen-syms | cut -f3- -d\  | sort | uniq -c | sort -nr | head

* You may need support for adapting or augmenting exception tables if
  patching such code is desired (it probably is).  Hotpatches may need
  to bring their own small exception tables (similar to how Linux
  modules support this).  If you don't plan on supporting hotpatches
  that introduce additional exception-locations, one could also change
  the exception table in-place and reorder it afterwards.

* Hotpatch interdependencies are tricky.  IIRC the discussion at the
  Live Patching track at LPC, both kpatch and kgraft aimed, at the
  time, at a single large patch that subsumes and replaces all previous
  ones.

Some more comments inline below ...

Cheers,
Martin

On 15.05.2015 21:44, Konrad Rzeszutek Wilk wrote:
> Hey!
> 
> During the Xen Hacka^H^H^H^HProject Summit? we chatted about live-patching
> the hypervisor. We sketched out how it could be done, and brainstormed
> some of the problems.
> 
> I took that and wrote an design - which is very much RFC. The design is
> laid out in two sections - the format of the ELF payload - and then the
> hypercalls to act on it.
> 
> Hypercall preemption has caused a couple of XSAs so I've baked the need
> for that in the design so we hopefully won't have an XSA for this code.
> 
> There are two big *TODO* in the design which I had hoped to get done
> before sending this out - however I am going on vacation for two weeks
> so I figured it would be better to send this off for folks to mull now
> then to have it languish.
> 
> Please feel free to add more folks on the CC list.
> 
> Enjoy!
> 
> 
> # xSplice Design v1 (EXTERNAL RFC v2)
> 
> ## Rationale
> 
> A mechanism is required to binarily patch the running hypervisor with new
> opcodes that have come about due to primarily security updates.
> 
> This document describes the design of the API that would allow us to
> upload to the hypervisor binary patches.
> 
> ## Glossary
> 
>  * splice - patch in the binary code with new opcodes
>  * trampoline - a jump to a new instruction.
>  * payload - telemetries of the old code along with binary blob of the new
>    function (if needed).
>  * reloc - telemetries contained in the payload to construct proper 
> trampoline.
> 
> ## Multiple ways to patch
> 
> The mechanism needs to be flexible to patch the hypervisor in multiple ways
> and be as simple as possible. The compiled code is contiguous in memory with
> no gaps - so we have no luxury of 'moving' existing code and must either
> insert a trampoline to the new code to be executed - or only modify in-place
> the code if there is sufficient space. The placement of new code has to be 
> done
> by hypervisor and the virtual address for the new code is allocated 
> dynamically.
> i
> This implies that the hypervisor must compute the new offsets when splicing
> in the new trampoline code. Where the trampoline is added (inside
> the function we are patching or just the callers?) is also important.
> 
> To lessen the amount of code in hypervisor, the consumer of the API
> is responsible for identifying which mechanism to employ and how many 
> locations
> to patch. Combinations of modifying in-place code, adding trampoline, etc
> has to be supported. The API should allow read/write any memory within
> the hypervisor virtual address space.
> 
> We must also have a mechanism to query what has been applied and a mechanism
> to revert it if needed.
> 
> We must also have a mechanism to: provide an copy of the old code - so that
> the hypervisor can verify it against the code in memory; the new code;

As Xen has no stable in-hypervisor API / ABI, you need to make sure
that a generated module matches a target hypervisor.  In our design,
we use build IDs for that (ld --build-id).  We embed build IDs at Xen
compile time and can query a running hypervisor for its ID and only
load matching patches.

This seems to be an alternative to your proposal to include old code
into hotpatch modules.

> the symbol name of the function to be patched; or offset from the symbol;
> or virtual address.
> 
> The complications that this design will encounter are explained later
> in this document.
> 
> ## Patching code
> 
> The first mechanism to patch that comes in mind is in-place replacement.
> That is replace the affected code with new code. Unfortunately the x86
> ISA is variable size which places limits on how much space we have available
> to replace the instructions.
> 
> The second mechanism is by replacing the call or jump to the
> old function with the address of the new function.

We found splicing via a non-conditional JMP in the beginning of the
old target function to be a sensible and uncomplicated approach.  This
approach works at function level which feels like a very natural unit
to work at.  The really nice thing of this approach is that you don't
have to track all the, potential indirect, call sites.

As you discussed, if you allocate hotpatch memory withing +-2GB of the
patch location, no further trampoline indirection is required, a
5-byte JMP does the trick on x86.  We found that to be sufficient in
our experiments.

> A third mechanism is to add a jump to the new function at the
> start of the old function.
> 
> ### Example of trampoline and in-place splicing
> 
> As example we will assume the hypervisor does not have XSA-132 (see
> *domctl/sysctl: don't leak hypervisor stack to toolstacks*
> 4ff3449f0e9d175ceb9551d3f2aecb59273f639d) and we would like to binary patch
> the hypervisor with it. The original code looks as so:
> 
> <pre>
>    48 89 e0                  mov    %rsp,%rax  
>    48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax  
> </pre>
> 
> while the new patched hypervisor would be:
> 
> <pre>
>    48 c7 45 b8 00 00 00 00   movq   $0x0,-0x48(%rbp)  
>    48 c7 45 c0 00 00 00 00   movq   $0x0,-0x40(%rbp)  
>    48 c7 45 c8 00 00 00 00   movq   $0x0,-0x38(%rbp)  
>    48 89 e0                  mov    %rsp,%rax  
>    48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax  
> </pre>
> 
> This is inside the arch_do_domctl. This new change adds 21 extra
> bytes of code which alters all the offsets inside the function. To alter
> these offsets and add the extra 21 bytes of code we might not have enough
> space in .text to squeze this in.
> 
> As such we could simplify this problem by only patching the site
> which calls arch_do_domctl:
> 
> <pre>
> <do_domctl>:  
>  e8 4b b1 05 00          callq  ffff82d08015fbb9 <arch_do_domctl>  
> </pre>
> 
> with a new address for where the new `arch_do_domctl` would be (this
> area would be allocated dynamically).
> 
> Astute readers will wonder what we need to do if we were to patch `do_domctl`
> - which is not called directly by hypervisor but on behalf of the guests via
> the `compat_hypercall_table` and `hypercall_table`.
> Patching the offset in `hypercall_table` for `do_domctl:
> (ffff82d080103079 <do_domctl>:)
> <pre>
> 
>  ffff82d08024d490:   79 30  
>  ffff82d08024d492:   10 80 d0 82 ff ff   
> 
> </pre>
> with the new address where the new `do_domctl` is possible. The other
> place where it is used is in `hvm_hypercall64_table` which would need
> to be patched in a similar way. This would require an in-place splicing
> of the new virtual address of `arch_do_domctl`.
> 
> In summary this example patched the callee of the affected function by
>  * allocating memory for the new code to live in,
>  * changing the virtual address of all the functions which called the old
>    code (computing the new offset, patching the callq with a new callq).
>  * changing the function pointer tables with the new virtual address of
>    the function (splicing in the new virtual address). Since this table
>    resides in the .rodata section we would need to temporarily change the
>    page table permissions during this part.
> 
> 
> However it has severe drawbacks - the safety checks which have to make sure
> the function is not on the stack - must also check every caller. For some
> patches this could if there were an sufficient large amount of callers
> that we would never be able to apply the update.
> 
> ### Example of different trampoline patching.
> 
> An alternative mechanism exists where we can insert an trampoline in the
> existing function to be patched to jump directly to the new code. This
> lessens the locations to be patched to one but it puts pressure on the
> CPU branching logic (I-cache, but it is just one unconditional jump).
> 
> For this example we will assume that the hypervisor has not been compiled
> with fe2e079f642effb3d24a6e1a7096ef26e691d93e (XSA-125: *pre-fill structures
> for certain HYPERVISOR_xen_version sub-ops*) which mem-sets an structure
> in `xen_version` hypercall. This function is not called **anywhere** in
> the hypervisor (it is called by the guest) but referenced in the
> `compat_hypercall_table` and `hypercall_table` (and indirectly called
> from that). Patching the offset in `hypercall_table` for the old
> `do_xen_version` (ffff82d080112f9e <do_xen_version>)
> 
> </pre>
>  ffff82d08024b270 <hypercall_table>  
>  ...  
>  ffff82d08024b2f8:   9e 2f 11 80 d0 82 ff ff  
> 
> </pre>
> with the new address where the new `do_xen_version` is possible. The other
> place where it is used is in `hvm_hypercall64_table` which would need
> to be patched in a similar way. This would require an in-place splicing
> of the new virtual address of `do_xen_version`.
> 
> An alternative solution would be to patch insert an trampoline in the
> old `do_xen_version' function to directly jump to the new `do_xen_version`.
> 
> <pre>
>  ffff82d080112f9e <do_xen_version>:  
>  ffff82d080112f9e:       48 c7 c0 da ff ff ff    mov    
> $0xffffffffffffffda,%rax  
>  ffff82d080112fa5:       83 ff 09                cmp    $0x9,%edi  
>  ffff82d080112fa8:       0f 87 24 05 00 00       ja     ffff82d0801134d2 
> <do_xen_version+0x534>  
> </pre>
> 
> with:
> 
> <pre>
>  ffff82d080112f9e <do_xen_version>:  
>  ffff82d080112f9e:       e9 XX YY ZZ QQ          jmpq   [new do_xen_version]  
> </pre>
> 
> which would lessen the amount of patching to just one location.
> 
> In summary this example patched the affected function to jump to the
> new replacement function which required:
>  * allocating memory for the new code to live in,
>  * inserting trampoline with new offset in the old function to point to the
>    new function.
>  * Optionally we can insert in the old function an trampoline jump to an 
> function
>    providing an BUG_ON to catch errant code.
> 
> The disadvantage of this are that the unconditional jump will consume a small
> I-cache penalty. However the simplicity of the patching of safety checks
> make this a worthwhile option.
> 
> ### Security
> 
> With this method we can re-write the hypervisor - and as such we **MUST** be
> diligent in only allowing certain guests to perform this operation.
> 
> Furthermore with SecureBoot or tboot, we **MUST** also verify the signature
> of the payload to be certain it came from a trusted source.
> 
> As such the hypercall **MUST** support an XSM policy to limit the what
> guest is allowed. If the system is booted with signature checking the
> signature checking will be enforced.
> 
> ## Payload format
> 
> The payload **MUST** contain enough data to allow us to apply the update
> and also safely reverse it. As such we **MUST** know:
> 
>  * What the old code is expected to be. We **MUST** verify it against the
>    runtime code.
>  * The locations in memory to be patched. This can be determined dynamically
>    via symbols or via virtual addresses.
>  * The new code to be used.
>  * Signature to verify the payload.
> 
> This binary format can be constructed using an custom binary format but
> there are severe disadvantages of it:
> 
>  * The format might need to be change and we need an mechanism to accommodate
>    that.
>  * It has to be platform agnostic.
>  * Easily constructed using existing tools.
> 
> As such having the payload in an ELF file is the sensible way. We would be
> carrying the various set of structures (and data) in the ELF sections under
> different names and with definitions. The prefix for the ELF section name
> would always be: *.xsplice_*
> 
> Note that every structure has padding. This is added so that the hypervisor
> can re-use those fields as it sees fit.
> 
> There are five sections *.xsplice_* sections:
> 
>  * `.xsplice_symbols` and `.xsplice_str`. The array of symbols to be 
> referenced
>    during the update. This can contain the symbols (functions) that will be
>    patched, or the list of symbols (functions) to be checked pre-patching 
> which
>    may not be on the stack.
> 
> * `.xsplice_reloc` and `.xsplice_reloc_howto`. The howto properly construct
>    trampolines for an patch. We can have multiple locations for which we
>    need to insert an trampoline for a payload and each location might require
>    a different way of handling it. This would naturally reference the `.text`
>    section and its proper offset. The `.xsplice_reloc` is not directly 
> concerned
>    with patches but rather is an ELF relocation - describing the target
>    of a relocation and how that is performed.  They're also used for where
>    the new code references the run code too.
> 
>  * `.xsplice_sections`. The safety data for the old code and new code.
>    This contains an array of symbols (pointing to `.xsplice_symbols` to
>    and `.text`) which are to be used during safety and dependency checking.
> 
> 
>  * `.xsplice_patches`: The description of the new functions to be patched
>    in (size, type, pointer to code, etc.).
> 
>  * `.xsplice_change`. The structure that ties all of this together and defines
>    the payload.
> 
> Additionally the ELF file would contain:
> 
>  * `.text` section for the new and old code (function).
>  * `.rela.text` relocation data for the `.text` (both new and old).
>  * `.rela.xsplice_patches` relocation data for `.xsplice_patches` (such as 
> offset
>    to the `.text` ,`.xsplice_symbols`, or `.xsplice_reloc` section).
>  * `.bss` section for the new code (function)
>  * `.data` and `.data.read_mostly` section for the new and old code (function)
>  * `.rodata` section for the new and old code (function).
> 
> In short the *.xsplice_* sections represent various structures and the
> ELF provides the mechanism to glue it all together when loaded in memory.
> 
> Note that a lot of these ideas are borrowed from kSplice which is
> available at: https://github.com/jirislaby/ksplice
> 
> For ELF understanding the best starting point is the OSDev Wiki
> (http://wiki.osdev.org/ELF). Furthermore the ELF specification is
> at http://www.skyfree.org/linux/references/ELF_Format.pdf and
> at Oracle's web site:
> http://docs.oracle.com/cd/E23824_01/html/819-0690/chapter6-46512.html#scrolltoc
> 
> ### ASCII art of the ELF structures
> 
> *TODO*: Include an ASCII art of how the sections are tied together.
> 
> ### xsplice_symbols
> 
> The section contains an array of an structure that outlines the name
> of the symbol to be patched (or checked against). The structure is
> as follow:
> 
> <pre>
> struct xsplice_symbol {  
>     const char *name; /* The ELF name of the symbol. */  
>     const char *label; /* A unique xSplice name for the symbol. */  
>     uint8_t pad[16]; /* Must be zero. */  
> };  
> </pre>
> The structures may be in the section in any order and in any amount
> (duplicate entries are permitted).
> 
> Both `name` and `label` would be pointing to entries in `.xsplice_str`.
> 
> The `label` is used for diagnostic purposes - such as including the
> name and the offset.
> 
> ### xsplice_reloc and xsplice_reloc_howto
> 
> The section contains an array of a structure that outlines the different
> locations (and howto) for which an trampoline is to be inserted.
> 
> The howto defines in the detail the change. It contains the type,
> whether the relocation is relative, the size of the relocation,
> bitmask for which parts of the instruction or data are to be replaced,
> amount of final relocation is shifted by (to drop unwanted data), and
> whether the replacement should be interpreted as signed value.
> 
> The structure is as follow:
> 
> <pre>
> #define XSPLICE_HOWTO_RELOC_INLINE  0 /* Inline replacement. */  
> #define XSPLICE_HOWTO_RELOC_PATCH   1 /* Add trampoline. */  
> #define XSPLICE_HOWTO_RELOC_DATA    2 /*  __DATE__ type change. */  
> #define XSPLICE_HOWTO_RELOC_TIME    3 /* __TIME__ type chnage. */  
> #define XSPLICE_HOWTO_BUG           4 /* BUG_ON being replaced.*/  
> #define XSPLICE_HOWTO_EXTABLE       5 /* exception_table change. */  
> #define XSPLICE_HOWTO_SYMBOL        6 /* change in symbol table. */  
> 
> #define XSPLICE_HOWTO_FLAG_PC_REL    0x00000001 /* Is PC relative. */  
> #define XSPLICE_HOWOT_FLAG_SIGN      0x00000002 /* Should the new value be 
> treated as signed value. */  
> 
> struct xsplice_reloc_howto {  
>     uint32_t    type; /* XSPLICE_HOWTO_* */  
>     uint32_t    flag; /* XSPLICE_HOWTO_FLAG_* */  
>     uint32_t    size; /* Size, in bytes, of the item to be relocated. */  
>     uint32_t    r_shift; /* The value the final relocation is shifted right 
> by; used to drop unwanted data from the relocation. */  
>     uint64_t    mask; /* Bitmask for which parts of the instruction or data 
> are replaced with the relocated value. */  
>     uint8_t     pad[8]; /* Must be zero. */  
> };  
> 
> </pre>
> 
> This structure is used in:
> 
> <pre>
> struct xsplice_reloc {  
>     uint64_t addr; /* The address of the relocation (if known). */  
>     struct xsplice_symbol *symbol; /* Symbol for this relocation. */  
>     struct xsplice_reloc_howto  *howto; /* Pointer to the above structure. */ 
>  
>     uint64_t isns_added; /* ELF addend resulting from quirks of instruction 
> one of whose operands is the relocation. For example, this is -4 on x86 
> pc-relative jumps. */  
>     uint64_t isns_target; /* rest of the ELF addend.  This is equal to the 
> offset against the symbol that the relocation refers to. */  
>     uint8_t pad[8];  /* Must be zero. */  
> };  
> </pre>
> 
> ### xsplice_sections
> 
> The structure defined in this section is used to verify that it is safe
> to update with the new changes. It can contain safety data on the old code
> and what kind of matching we are to expect.
> 
> It also can contain safety date of what to check when about to patch.
> That is whether any of the addresses (either provided or resolved
> when payload is loaded by referencing the symbols) are in memory
> with what we expect it to be.
> 
> As such the flags can be or-ed together:
> 
> <pre>
> #define XSPLICE_SECTION_TEXT   0x00000001 /* Section is in .text */  
> #define XSPLICE_SECTION_RODATA 0x00000002 /* Section is in .ro */  
> #define XSPLICE_SECTION_DATA   0x00000004 /* Section is in .rodata */  
> #define XSPLICE_SECTION_STRING 0x00000008 /* Section is in .str */  
> #define XSPLICE_SECTION_ALTINSTRUCTIONS 0x00000010 /* Section has 
> .altinstructions. */  
> #define XSPLICE_SECTION_TEXT_INPLACE 0x00000200 /* Change is in place. */   
> #dekine XSPLICE_SECTION_MATCH_EXACT 0x00000400 /* Must match exactly. */  
> #define XSPLICE_SECTION_NO_STACKCHECK 0x00000800 /* Do not check the stack. 
> */  
> 
> struct xsplice_section {  
>     struct xsplice_symbol *symbol; /* The symbol associated with this change. 
> */  
>     uint64_t address; /* The address of the section (if known). */  
>     uint64_t size; /* The size of the section. */  
>     uint64_t flags; /* Various XSPLICE_SECTION_* flags. */
>     uint8_t pad[16]; /* To be zero. */  
> };
> 
> </pre>
> 
> ### xsplice_patches
> 
> Within this section we have an array of a structure defining the new code 
> (patch).
> 
> This structure consist of an pointer to the new code (which in ELF ends up
> pointing to an offset in `.text` or `.data` section); the type of patch:
> inline - either text or data, or requesting an trampoline; and size of patch.
> 
> The structure is as follow:
> 
> <pre>
> #define XSPLICE_PATCH_INLINE_TEXT   0
> #define XSPLICE_PATCH_INLINE_DATA   1
> #define XSPLICE_PATCH_RELOC_TEXT    2
> 
> struct xsplice_patch {  
>     uint32_t type; /* XSPLICE_PATCH_* .*/  
>     uint32_t size; /* Size of patch. */  
>     uint64_t addr; /* The address of the new code (or data). */  
>     void *content; /* The bytes to be installed. */  
>     uint8_t pad[16]; /* Must be zero. */  
> };
> 
> </pre>
> 
> ### xsplice_code
> 
> The structure embedded within this section ties it all together.
> It has the name of the patch, and pointers to all the above
> mentioned structures (the start and end addresses).
> 
> The structure is as follow:
> 
> <pre>
> struct xsplice_code {  
>     const char *name; /* A sensible name for the patch. Up to 40 characters. 
> */  
>     struct xsplice_reloc *relocs, *relocs_end; /* How to patch it */  
>     struct xsplice_section *sections, *sections_end; /* Safety data */  
>     struct xsplice_patch *patches, *patches_end; /* Patch code & data */  
>     uint8_t pad[32]; /* Must be zero. */
> };
> </pre>
> 
> There should only be one such structure in the section.
> 
> ### Example
> 
> *TODO*: Include an objdump of how the ELF would look like for the XSA
> mentioned earlier.
> 
> ## Signature checking requirements.
> 
> The signature checking requires that the layout of the data in memory
> **MUST** be same for signature to be verified. This means that the payload
> data layout in ELF format **MUST** match what the hypervisor would be
> expecting such that it can properly do signature verification.
> 
> The signature is based on the all of the payloads continuously laid out
> in memory. The signature is to be appended at the end of the ELF payload
> prefixed with the string '~Module signature appended~\n", followed by
> an signature header then followed by the signature, key identifier, and 
> signers
> name.
> 
> Specifically the signature header would be:
> 
> <pre>
> #define PKEY_ALGO_DSA       0  
> #define PKEY_ALGO_RSA       1  
> 
> #define PKEY_ID_PGP         0 /* OpenPGP generated key ID */  
> #define PKEY_ID_X509        1 /* X.509 arbitrary subjectKeyIdentifier */  
> 
> #define HASH_ALGO_MD4          0  
> #define HASH_ALGO_MD5          1  
> #define HASH_ALGO_SHA1         2  
> #define HASH_ALGO_RIPE_MD_160  3  
> #define HASH_ALGO_SHA256       4  
> #define HASH_ALGO_SHA384       5  
> #define HASH_ALGO_SHA512       6  
> #define HASH_ALGO_SHA224       7  
> #define HASH_ALGO_RIPE_MD_128  8  
> #define HASH_ALGO_RIPE_MD_256  9  
> #define HASH_ALGO_RIPE_MD_320 10  
> #define HASH_ALGO_WP_256      11  
> #define HASH_ALGO_WP_384      12  
> #define HASH_ALGO_WP_512      13  
> #define HASH_ALGO_TGR_128     14  
> #define HASH_ALGO_TGR_160     15  
> #define HASH_ALGO_TGR_192     16  
> 
> 
> struct elf_payload_signature {  
>       u8      algo;           /* Public-key crypto algorithm PKEY_ALGO_*. */  
>       u8      hash;           /* Digest algorithm: HASH_ALGO_*. */  
>       u8      id_type;        /* Key identifier type PKEY_ID*. */  
>       u8      signer_len;     /* Length of signer's name */  
>       u8      key_id_len;     /* Length of key identifier */  
>       u8      __pad[3];  
>       __be32  sig_len;        /* Length of signature data */  
> };
> 
> </pre>
> (Note that this has been borrowed from Linux module signature code.).
> 
> 
> ## Hypercalls
> 
> We will employ the sub operations of the system management hypercall (sysctl).
> There are to be four sub-operations:
> 
>  * upload the payloads.
>  * listing of payloads summary uploaded and their state.
>  * getting an particular payload summary and its state.
>  * command to apply, delete, or revert the payload.
> 
> The patching is asynchronous therefore the caller is responsible
> to verify that it has been applied properly by retrieving the summary of it
> and verifying that there are no error codes associated with the payload.
> 
> We **MUST** make it asynchronous due to the nature of patching: it requires
> every physical CPU to be lock-step with each other. The patching mechanism
> while an implementation detail, is not an short operation and as such
> the design **MUST** assume it will be an long-running operation.
> 
> Furthermore it is possible to have multiple different payloads for the same
> function. As such an unique id has to be visible to allow proper manipulation.
> 
> The hypercall is part of the `xen_sysctl`. The top level structure contains
> one uint32_t to determine the sub-operations:
> 
> <pre>
> struct xen_sysctl_xsplice_op {  
>     uint32_t cmd;  
>       union {  
>           ... see below ...  
>         } u;  
> };  
> 
> </pre>
> while the rest of hypercall specific structures are part of the this 
> structure.
> 
> 
> ### XEN_SYSCTL_XSPLICE_UPLOAD (0)
> 
> Upload a payload to the hypervisor. The payload is verified and if there
> are any issues the proper return code will be returned. The payload is
> not applied at this time - that is controlled by *XEN_SYSCTL_XSPLICE_ACTION*.
> 
> The caller provides:
> 
>  * `id` unique id.
>  * `payload` the virtual address of where the ELF payload is.
> 
> The return value is zero if the payload was succesfully uploaded and the
> signature was verified. Otherwise an EXX return value is provided.
> Duplicate `id` are not supported.
> 
> The `payload` is the ELF payload as mentioned in the `Payload format` section.
> 
> The structure is as follow:
> 
> <pre>
> struct xen_sysctl_xsplice_upload {  
>     char id[40];  /* IN, name of the patch. */  
>     uint64_t size; /* IN, size of the ELF file. */
>     XEN_GUEST_HANDLE_64(uint8) payload; /* ELF file. */  
> }; 
> </pre>
> 
> ### XEN_SYSCTL_XSPLICE_GET (1)
> 
> Retrieve an summary of an specific payload. This caller provides:
> 
>  * `id` the unique id.
>  * `status` *MUST* be set to zero.
>  * `rc` *MUST* be set to zero.
> 
> The `summary` structure contains an summary of payload which includes:
> 
>  * `id` the unique id.
>  * `status` - whether it has been:
>  1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
>  2. *XSPLICE_STATUS_PROGRESS* (1) acting on the **XEN_SYSCTL_XSPLICE_ACTION** 
> command.
>  3. *XSPLICE_STATUS_CHECKED*  (2) the ELF payload safety checks passed.
>  4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
>  5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also 
> reverted.
>  6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult `rc` 
> for details.
>  * `rc` - its error state if any.
> 
> The structure is as follow:
> 
> <pre>
> #define XSPLICE_STATUS_LOADED    0  
> #define XSPLICE_STATUS_PROGRESS  1  
> #define XSPLICE_STATUS_CHECKED   2  
> #define XSPLICE_STATUS_APPLIED   3  
> #define XSPLICE_STATUS_REVERTED  4  
> #define XSPLICE_STATUS_IN_ERROR  5  
> 
> struct xen_sysctl_xsplice_summary {  
>     char id[40];  /* IN/OUT, name of the patch. */  
>     uint32_t status;   /* OUT */  
>     int32_t rc;  /* OUT */  
> }; 
> </pre>
> 
> ### XEN_SYSCTL_XSPLICE_LIST (2)
> 
> Retrieve an array of abbreviated summary of payloads that are loaded in the
> hypervisor.
> 
> The caller provides:
> 
>  * `idx` index iterator. Initially it *MUST* be zero.
>  * `count` the max number of entries to populate.
>  * `summary` virtual address of where to write payload summaries.
> 
> The hypercall returns zero on success and updates the `idx` (index) iterator
> with the number of payloads returned, `count` to the number of remaining
> payloads, and `summary` with an number of payload summaries.
> 
> If the hypercall returns E2BIG the `count` is too big and should be
> lowered.
> 
> Note that due to the asynchronous nature of hypercalls the domain might have
> added or removed the number of payloads making this information stale. It is
> the responsibility of the domain to provide proper accounting.
> 
> The `summary` structure contains an summary of payload which includes:
> 
>  * `id` unique id.
>  * `status` - whether it has been:
>  1. *XSPLICE_STATUS_LOADED* (0) has been loaded.
>  2. *XSPLICE_STATUS_PROGRESS* (1) acting on the **XEN_SYSCTL_XSPLICE_ACTION** 
> command.
>  3. *XSPLICE_STATUS_CHECKED*  (2) the payload `old` and `addr` match with the 
> hypervisor.
>  4. *XSPLICE_STATUS_APPLIED* (3) loaded, checked, and applied.
>  5. *XSPLICE_STATUS_REVERTED* (4) loaded, checked, applied and then also 
> reverted.
>  6. *XSPLICE_STATUS_IN_ERROR* (5) loaded and in a failed state. Consult `rc` 
> for details.
>  * `rc` - its error state if any.
> 
> The structure is as follow:
> 
> <pre>
> struct xen_sysctl_xsplice_list {  
>     uint32_t idx;  /* IN/OUT */  
>     uint32_t count;  /* IN/OUT */
>     XEN_GUEST_HANDLE_64(xen_sysctl_xsplice_summary) summary;  /* OUT */  
> };  
> 
> struct xen_sysctl_xsplice_summary {  
>     char id[40];  /* OUT, name of the patch. */  
>     uint32_t status;   /* OUT */  
>     int32_t rc;  /* OUT */  
> };  
> 
> </pre>
> ### XEN_SYSCTL_XSPLICE_ACTION (3)
> 
> Perform an operation on the payload structure referenced by the `id` field.
> The operation request is asynchronous and the status should be retrieved
> by using either **XEN_SYSCTL_XSPLICE_GET** or **XEN_SYSCTL_XSPLICE_LIST** 
> hypercall.
> 
> The caller provides:
> 
>  * `id` the unique id.
>  * `cmd` the command requested:
>   1. *XSPLICE_ACTION_CHECK* (0) check that the payload will apply properly.
>   2. *XSPLICE_ACTION_UNLOAD* (1) unload the payload.  
>   3. *XSPLICE_ACTION_REVERT* (2) revert the payload.  
>   4. *XSPLICE_ACTION_APPLY* (3) apply the payload.   
> 
> 
> The return value will be zero unless the provided fields are incorrect.
> 
> The structure is as follow:
> 
> <pre>
> #define XSPLICE_ACTION_CHECK  0  
> #define XSPLICE_ACTION_UNLOAD 1  
> #define XSPLICE_ACTION_REVERT 2  
> #define XSPLICE_ACTION_APPLY  3  
> 
> struct xen_sysctl_xsplice_action {  
>     char id[40];  /* IN, name of the patch. */  
>     uint32_t cmd; /* IN */  
> };  
> 
> </pre>
> 
> ## Sequence of events.
> 
> The normal sequence of events is to:
> 
>  1. *XEN_SYSCTL_XSPLICE_UPLOAD* to upload the payload. If there are errors 
> *STOP* here.
>  2. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in 
> *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_LOADED* go to next step.
>  3. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_CHECK* command to verify 
> that the payload can be succesfully applied.
>  4. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in 
> *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_CHECKED* go to next 
> step.
>  5. *XEN_SYSCTL_XSPLICE_ACTION* with *XSPLICE_ACTION_APPLY* to apply the 
> patch.
>  6. *XEN_SYSCTL_XSPLICE_GET* to check the `->status`. If in 
> *XSPLICE_STATUS_PROGRESS* spin. If in *XSPLICE_STATUS_APPLIED* exit with 
> success.
> 
>  
> ## Addendum
> 
> Implementation quirks should not be discussed in a design document.
> 
> However these observations can provide aid when developing against this
> document.
> 
> 
> ### Alternative assembler
> 
> Alternative assembler is a mechanism to use different instructions depending
> on what the CPU supports. This is done by providing multiple streams of code
> that can be patched in - or if the CPU does not support it - padded with
> `nop` operations. The alternative assembler macros cause the compiler to
> expand the code to place a most generic code in place - emit a special
> ELF .section header to tag this location. During run-time the hypervisor
> can leave the areas alone or patch them with an better suited opcodes.
> 
> As we might be patching the alternative assembler sections as well - by
> providing a new better suited op-codes or perhaps with nops - we need to
> also re-run the alternative assembler patching after we have done our
> patching.
> 
> Also when we are doing safety checks the code we are checking might be
> utilizing alternative assembler. As such we should relax out checks to
> accomodate that.
> 
> ### .rodata sections
> 
> The patching might require strings to be updated as well. As such we must be
> also able to patch the strings as needed. This sounds simple - but the 
> compiler
> has a habit of coalescing strings that are the same - which means if we 
> in-place
> alter the strings - other users will be inadvertently affected as well.
> 
> This is also where pointers to functions live - and we may need to patch this
> as well.
> 
> To guard against that we must be prepared to do patching similar to
> trampoline patching or in-line depending on the flavour. If we can
> do in-line patching we would need to:
> 
>  * alter `.rodata` to be writeable.
>  * inline patch.
>  * alter `.rodata` to be read-only.
> 
> If are doing trampoline patching we would need to:
> 
>  * allocate a new memory location for the string.
>  * all locations which use this string will have to be updated to use the
>    offset to the string.
>  * mark the region RO when we are done.
> 
> ### .bss sections
> 
> Patching writable data is not suitable as it is unclear what should be done
> depending on the current state of data. As such it should not be attempted.
> 
> 
> ### Patching code which is in the stack.
> 
> We should not patch the code which is on the stack. That can lead
> to corruption.
> 
> ### Trampoline (e9 opcode)
> 
> The e9 opcode used for jmpq uses a 32-bit signed displacement. That means
> we are limited to up to 2GB of virtual address to place the new code
> from the old code. That should not be a problem since Xen hypervisor has
> a very small footprint.
> 
> However if we need - we can always add two trampolines. One at the 2GB
> limit that calls the next trampoline.
> 
> ### Time rendezvous code instead of stop_machine for patching
> 
> The hypervisor's time rendezvous code runs synchronously across all CPUs
> every second. Using the stop_machine to patch can stall the time rendezvous
> code and result in NMI. As such having the patching be done at the tail
> of rendezvous code should avoid this problem.
> 
> ### Security
> 
> Only the privileged domain should be allowed to do this operation.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.