[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 0 of 5] docs: x86 PV MMU related functions

On Fri, Nov 16, 2012 at 04:55:13PM +0000, Ian Campbell wrote:
> On Fri, 2012-11-02 at 11:18 +0000, Ian Campbell wrote:
> > 
> > I also have a draft of a wiki article on the subject which references
> > the information in the public headers which I hope to post soon. 
> I realised I forgot to do this...
> It needs some polish but the majority of the XXX's are placeholder for
> links to the result of this applying this series.


Comments about some small typos..

> 8<------------------------------------
> Paravirtualised X86 Memory Management
> = Intro =
> One of the original innovations of the Xen hypervisor was the a

"was the a paravirtualisation". Extra "a" ? 

> paravirtualisation of the memory management unit (MMU). This allowed
> for fas and efficient virtualisation of Operating Systems which used


> paging compared to contemporary techniques.
> In this article we will describe the functionality of the PV MMU for
> X86 Xen guests. A familiarity with X86 paging and related concepts
> will be assumed.
> Other guest types, such as HVM or PVH guests on X86 or guest on ARM
> achieve virtualisation of the MMU usaing other techniques, such as the


> use of hardware assisted or shadow paging.
> = Direct Paging =
> In order to virtualised the memory subsystem all hypervisors introduce

"to virtualise" ? 

-- Pasi

> an additional level of abstraction between what the guest sees as
> physical memory (pseudo-physical) and the underlying memory of the
> machine (called machine addresses in Xen). This is usually done
> through the introduction of a physical to machine (P2M)
> mapping. Typically this would be maintained within the hypervisor and
> hidden from the guest Operating System through techniques such as
> Shadow Paging.
> The Xen paravirtualised MMU model instead requires that the guest be
> aware of the P2M mapping and be modified such that instead of writing
> page table entries mapping virtual addresses to the physical address
> space it would instead write entries mapping virtual addresses
> directly to the machine address space by mapping from pseudo physical
> to machine addresses using the P2M as it writes its page tables. This
> technique is known as direct paging.
> = Page Types and Invariants =
> In order to ensure that the guest cannot subvert the system Xen
> requires that certain invariants are met and therefore that all
> updates to the page table updates are performed by Xen through the use
> of hypercalls.
> To this end Xen defines a number of page types and ensures that any
> given page has exactly one type at any given time. The type of a page
> is reference counted and can only be changed when the "type count" is
> zero.
> The basic types are:
> * None: No special uses.
> * Page table page: Pages used as page tables (there are separate types
>   for each of the 4 levels on 64 bit and 3 levels on 32 bit PAE
>   guests).
> * Segment descriptor page: Page is used as part of the Global or Local
>   Descriptor table (GDT/LDT).
> * Writeable: Page is writable.
> Xen enforces the invariant that only pages with the writable type have
> a writable mapping in the page tables. Likewise it ensures that no
> writable mapping exists of a page with any other type. It also
> enforces other invariants such as requiring that no page table page
> can make a non-privlieged mapping of the hypervisor's virutal address
> space etc. By doing this it can ensure that the guest OS is not able
> to directly modify any critical data structures and therefore subvert
> the safety of the system, for example to map machine addresses which
> do not belong to it.
> Whenever a set of page-tables is loaded into the hardware page-table
> base register ('cr3') the hypervisor must take an appropriate type
> reference with the root page-table type (that is, an L4 reference on
> 64-bit or an L3 reference on 32-bit). If the page is not already of
> the required type then in order to take the initial reference it must
> first have a type count of zero (remember, a pages' type only be
> change while the type count is zero) and must be validated to ensure
> that it respects the invariants. This in turn means that the pages
> referenced by the root page-table must be validates as having the
> correct type (i.e. L3 or L2 on 64- or 32-bit repsectively), and so on
> down to the data pages at the leafs of the page-table, thereby
> ensuring that the page table as a whole is safe to load into 'cr3'.
> XXX link to appropriate header.
> In order to maintain the necessary invariants Xen must be involved in
> all updates to the page tables, as well as various other privileged
> operations. These are covered in the following sections.
> In order to prevent guest operating systems from subverting these
> mechanisms it is also necessary for guest kernels to run without the
> normal privileges associated with running in processor ring-0. For this
> reason Xen PV guest kernels usually run in either ring-1 (32-bit
> guests) or ring-3 (64-bit guests).
> = Updating Page Tables =
> Since the page tables are not writable by the guest Xen provides
> several machanisms by which the guest can update a page table entry.
> == mmu_update hypercall ==
> The first mechanism provided by Xen is the HYPERVISOR_mmu_update
> hypercall [XXX link]. This hypercall has the prototype:
>   struct mmu_update {
>       uint64_t ptr;       /* Machine address of PTE. */
>       uint64_t val;       /* New contents of PTE.    */
>   };
>   long HYPERVISOR_mmu_update(const struct mmu_update reqs[],
>                              unsigned count, unsigned *done_out,
>                              unsigned foreigndom)
> The operation takes an array of 'count' requests 'reqs'. The
> 'done_out' paramter returns an indication of the number of successful
> operations. 'foreigndom' can be used by a suitably privileged domain
> to access memory belonging to other domains (this usage is not covered
> here).
> Each request is a ('ptr','value') pair. The 'ptr' field is further
> divides into 'ptr[1:0]' indicating the type of update to perform and
> 'ptr[:2]' which indicates the the address to update.
> The valid values for 'ptr[1:0]' are:
> * MMU_NORMAL_PT_UPDATE: A normal page table update. 'ptr[:2]' contains
>   the machine address of the entry to update while 'val' is the Page
>   Table Entry to write. This effectively implements '*ptr = val' with
>   checks to ensure that the required invariants aree preserved.
> * MMU_MACHPHYS_UPDATE: Update the machine to physical address
>   mapping. This is covered below, see [XXX link]
>   preserving the Accessed and Dirty bits in the page table entry. The
>   'val' here is almost a standard Page Table Entry but with some
>   special handling. See the [XXX link hypercall documentation] for more
>   information.
> == update_va_mapping hypercall ==
> The second mechanism provided by Xen is the
> HYPERVISOR_update_va_mapping hypercall [XXX link]. This hypercall has
> the prototype:
>   long
>   HYPERVISOR_update_va_mapping(unsigned long va, u64 val,
>                                enum update_va_mapping_flags flags)
> This operation simply updates the leaf PTE entry (called and L1 in
> Xen) which maps the virtual address 'va' with the given value
> 'val', while of course performing the expected checks to ensure that
> the invariants are maintained. This can be though of as updating the
> PTE using a [XXX link linear mapping].
> The flags parameter can be used to request that Xen flush the TLB
> entries associated with the update. See the [XXX link hypercall
> documentation for more].
> == Trap and emulate of page table writes ==
> As well as the above Xen can also trap and emulate updates to leaf
> page table entries (L1) only. This trapping and emulating is
> relatively expensive and is best avoided but for little used code
> paths can provide a reasonable trade off vs.the requirement to modify
> the callsite in the guest OS.
> = Other privileged operations =
> As well as moderating page table updates in order to maintain the
> necessary invariants Xen must also be involved in certain other
> privileged operations, such as setting a new page table base
> ('cr3'). Because the guest kernel no longer runs in ring-0 certain
> other privleged operations must also be done by the hypervisor, such
> as flushing the TLB.
> These operations are performed via the HYPERVISOR_mmuext_op hypercall
> [XXX link]. This hypercall has the following prototype:
>   struct mmuext_op {
>       unsigned int cmd; /* => enum mmuext_cmd */
>       union {
>           xen_pfn_t     mfn;
>           unsigned long linear_addr;
>       } arg1;
>       union {
>           /* SET_LDT */
>           unsigned int nr_ents;
>           const void *vcpumask;
>           /* COPY_PAGE */
>           xen_pfn_t src_mfn;
>       } arg2;
>   };
>   long
>   HYPERVISOR_mmuext_op(struct mmuext_op uops[],
>                        unsigned int count,
>                        unsigned int *pdone,
>                        unsigned int foreigndom)
> The hypercall takes an array of 'count' operations each specified by
> the 'mmuext_op' struct. This hypercall allows access to various
> operations which must be performed via the hypervisor either because
> the guest kernel is no longer privileged or because the hypervisor
> must be involed in order to maintain safety, in general each available
> command corresponds to a low-level processor function. The include
> NEWBASE_PTR (write cr3), various types of TLB and cache flush and to
> set the LDT table address (see below). For more information on the
> available operations please see [XXX link the hypercall
> documentation].
> = Pinning Page Tables =
> As discussed above Xen ensures that various invariants are met
> concerning whether certain pages are mapped writable or not. This
> in turn means that Xen needs to validate the page tables whenever they
> are loaded into 'cr3'. However this is a potentially expensive
> operation since Xen needs to walk the complete set of page-tables and
> validate each one recursivley.
> In order to avoid this expense every time 'cr3' changes (i.e. on every
> context switch). Xen allows a page to be explictly ''pinned'' to a
> give type. This effectively means taking an extra reference of the
> relevant page table type, thereby forcing Xen to validate the
> page-table up front and to maintain the invariants for as long as the
> pin remains in place. By doing this the guest ensures that when a new
> 'cr3' is loaded the referenced page already has the appropriate type
> (L4 or L3) and therefore the type count can simply be incremented
> without the need to validate.
> For maximum performance a guest OS kernel will usually want to perform
> a pin operation as late as possible during the setup of a new set of
> page tables, so as to be able to construct them using normal writable
> mappings before blessing them as a set of page tables. Likewise on
> page-table teardown a guest OS will usually want to unpin the pages as
> soon as possible such that it can teardown the page tables without the
> use of hypercalls. These operations are usually refered to as 'late
> pin' and 'early unpin'.
> = The Physical-to-machine and machine-to-physical mapping tables =
> As discussed above direct paging requires that the guest Operating
> System be aware of the mapping between (pseudo-physical) and machine
> addresses (the P2M table). In addition in order to be able to read PTE
> entries (which contain machine addresses) and convert them back into
> pseudo-physical addresses a translation between, this is done using
> the M2P table.
> Each table is a simple array of frame numbers, indexed by either
> physical or machine frames and looking up the other.
> Since the P2M is sized according to the guest's pseudo-physical
> address it is left entirely up to the guest to provide and maintain in
> its own pages.
> However the M2P must be sized according to the total amount of RAM in
> the host and therefore could be of considerable ize compared to the
> amount of RAM available to the guest, not to mention sparse from the
> guest's point of view since the majority of machine pages will not
> belong to it.
> For this reason Xen exposes a read-only M2P of the entire host to the
> guest and allows guests to update this table using the
> MMU_MACHPHYS_UPDATE sub-op of the HYPERVISOR_mmu_update hypercall [XXX
> link].
> = Descriptor Tables =
> As well as protecting page tables from being writable by the guest Xen
> also requires that various descriptor tables must be made unavailable
> to the guest.
> == Interrupt Descriptor Table ==
> A Xen guest cannot access the IDT directly. Instead Xen maintains its
> own IDT and allows guest to write entries using the
> HYPERVISOR_set_trap_table hypercall. This has the following prototype:
> XXX link.
>   struct trap_info {
>       uint8_t       vector;  /* exception vector
> */
>       uint8_t       flags;   /* 0-3: privilege level; 4: clear event
> enable?  */
>       uint16_t      cs;      /* code selector
> */
>       unsigned long address; /* code offset
> */
>   };
>   long HYPERVISOR_set_trap_table(const struct trap_info traps[]);
> The entires of the ''trap_info'' struct correspond to the fields of a
> native IDT entry and each will be validated by Xen before it is
> used. The hypercall takes an array of traps terminated by an entry
> where ''address'' is zero.
> == Global/Local Descriptor Tables ==
> A Xen guest is not able to access the Global or Local descriptor
> tables directly. Pages which are in use as part of either table are
> given their own distinct type and must therefore be mapped as
> read-only in the guest. 
> The guest is also not privileged to update the descriptor base
> registers and must therefore do so using a hypercall. The hypercall to
> update the GDT is:
>   long HYPERVISOR_set_gdt(const xen_pfn_t frames[], unsigned int
> entries);
> This takes an array of machine frame numbers which are validated and
> loaded into the virtual GDTR. Note that unlike native X86 these are
> machine frames and not virtual addresses. These frames will be mapped
> by Xen into the virtual address which it reserves for this purpose.
> The LDT is set using the MMUEXT_SET_LDT sub-op of the
> HYPERVISOR_mmuext_op hypercall. [XXX link.] XXX a single page?
> Finally since the pages cannot be mapped as writable by the guest the
> HYPERVISOR_update_descriptor hypercall is provided:
>   long HYPERVISOR_update_descriptor(u64 pa, u64 desc);
> It takes a machine physical address of a descriptor entry to update
> and the requested contents of the descriptor itself, in the same
> format as the native descriptors.
> = Start Of Day = 
> The initial boot time environment of a Xen PV guest is somewhat
> different to the normal initial mode of an X86 processor. Rather than
> starting out in 16-bit mode with paging disabled a PV guest is
> started in either 32- or 64- bit mode with paging enabled running on
> an initial set of page tables provided by the hypervisor. These pages
> will be setup so as to meet the required invariants and will be loaded
> into the 'cr3' register but will not be explicitly pinned (in other
> words their type count is effectively one)
> The initial virtual and pseudo-physical layout of a new guest is
> described in XXX
> file:///home/ijc/devel/xen-unstable.hg/docs/html/hypercall/include,public,xen.h.html#incontents_startofday
> = Virtual Address Space =
> Xen enforces certain restrictions on the virtual addresses which are
> available to PV guests. These are enforced as part of the machinery for
> typing and writing page tables.
> Xen uses this to reserve certain addresses for its own use. Certain
> areas are also read-only for guests and contain shared datastructures
> such as the Macine-to-physical address lookup table.
> For a 64-bit guest Xen the virtual address space is setout as follows:
> 0x0000000000000000-0x00007fffffffffff Fully available to guests
> 0x0000800000000000-0xffff7fffffffffff Inaccessible (addresses are 48-bit
> sign extended)
> 0xffff800000000000-0xffff807fffffffff Read only to guests.
> 0xffff808000000000-0xffff87ffffffffff Reserved for Xen use
> 0xffff880000000000-0xffffffffffffffff Fully Available to guests
> For 32-bit guests running on a 64-bit hypervisor guests the virtual
> address space under 4G (which is all such guests can access is:
> 0x00000000-0xf57fffff Fully available to guests
> 0xf5800000-0xffffffff Read only to guests.
> For more information see "Memory Layout" under [XXX link
> xen/include/asm-x86/config.h]
> = Batching =
> For some memory management operations the overhead of making many
> hypercalls can become prohibively expensive. For this reason many of
> the hypercalls described above take a list of operations to
> perform. In addition Xen provides the concept of a multicall which can
> allow several different hypercalls to be batched
> together. HYPERVISOR_multicall has this prototype:
>   struct multicall_entry {
>       unsigned long op, result;
>       unsigned long args[6];
>   };
>   long HYPERVISOR_multicall(multicall_entry_t call_list[],
>                             unsigned int nr_calls);
> Each entry represents a hypercall and its associated arguments in the
> (hopefully) obvious way.
> = Guest Specific Details
> == Linux paravirt_ops ==
> === General PV MMU operation ===
> The Linux ''paravirt_ops'' infrastructure provides a mechanism by
> which the low-level MMU operations are abstracted into function
> pointers allowing the native operations where necessary.
> From the point of view of MMU operations the main entry point is
> ''struct pv_mmu_ops''. This contains entry points for low level
> operations such as:
>  * Allocating/freeing page table entries. These allow the kernel to
>    mark the pages read-only and read-write as the pages are reused.
>  * Creating, writing and reading PTE entries. These allow the kernel
>    to make the necessary translations between pseudo-physical and
>    machine addressing as well as using hypercalls instead of direct
>    writes.
>  * Reading and writing of control registers, e.g. cr3, to allow
>    hypercalls to be inserted.
>  * Various TLB flush operations, again to allow their replacement by
>    hypercalls.
> As well as these the interface includes some higher-level operations
> which allow for more efficient batching of compound operations such as
> duplicating (forking) a memory map. This is achieved by using the
> ''lazy_mmu_ops'' hooks to implement buffering of operations
> and flushing of larger batches or upon completion.
> The Xen paravirt_ops backend uses an additional page flag,
> ''PG_pinned'' in order to track whether a page has been pinned or not
> and implemented the late-pin early-unpin scheme described above.
> === Start of Day issues ===
> XXX get someone to describe these...
> = References =
> [XXX Xen and the art of virtualisation.]
> [XXX The hypercall interface documentation.]
> [XXX others? Chisnal Book?]
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.