|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 0 of 5] docs: x86 PV MMU related functions
On Fri, Nov 16, 2012 at 04:55:13PM +0000, Ian Campbell wrote:
> On Fri, 2012-11-02 at 11:18 +0000, Ian Campbell wrote:
> >
> > I also have a draft of a wiki article on the subject which references
> > the information in the public headers which I hope to post soon.
>
> I realised I forgot to do this...
>
> It needs some polish but the majority of the XXX's are placeholder for
> links to the result of this applying this series.
>
Hello,
Comments about some small typos..
> 8<------------------------------------
>
> Paravirtualised X86 Memory Management
>
> = Intro =
>
> One of the original innovations of the Xen hypervisor was the a
^^^^^
"was the a paravirtualisation". Extra "a" ?
> paravirtualisation of the memory management unit (MMU). This allowed
> for fas and efficient virtualisation of Operating Systems which used
^^
"fast".
> paging compared to contemporary techniques.
>
> In this article we will describe the functionality of the PV MMU for
> X86 Xen guests. A familiarity with X86 paging and related concepts
> will be assumed.
>
> Other guest types, such as HVM or PVH guests on X86 or guest on ARM
> achieve virtualisation of the MMU usaing other techniques, such as the
^^^^^^
"using".
> use of hardware assisted or shadow paging.
>
> = Direct Paging =
>
> In order to virtualised the memory subsystem all hypervisors introduce
^^^
"to virtualise" ?
-- Pasi
> an additional level of abstraction between what the guest sees as
> physical memory (pseudo-physical) and the underlying memory of the
> machine (called machine addresses in Xen). This is usually done
> through the introduction of a physical to machine (P2M)
> mapping. Typically this would be maintained within the hypervisor and
> hidden from the guest Operating System through techniques such as
> Shadow Paging.
>
> The Xen paravirtualised MMU model instead requires that the guest be
> aware of the P2M mapping and be modified such that instead of writing
> page table entries mapping virtual addresses to the physical address
> space it would instead write entries mapping virtual addresses
> directly to the machine address space by mapping from pseudo physical
> to machine addresses using the P2M as it writes its page tables. This
> technique is known as direct paging.
>
> = Page Types and Invariants =
>
> In order to ensure that the guest cannot subvert the system Xen
> requires that certain invariants are met and therefore that all
> updates to the page table updates are performed by Xen through the use
> of hypercalls.
>
> To this end Xen defines a number of page types and ensures that any
> given page has exactly one type at any given time. The type of a page
> is reference counted and can only be changed when the "type count" is
> zero.
>
> The basic types are:
>
> * None: No special uses.
> * Page table page: Pages used as page tables (there are separate types
> for each of the 4 levels on 64 bit and 3 levels on 32 bit PAE
> guests).
> * Segment descriptor page: Page is used as part of the Global or Local
> Descriptor table (GDT/LDT).
> * Writeable: Page is writable.
>
> Xen enforces the invariant that only pages with the writable type have
> a writable mapping in the page tables. Likewise it ensures that no
> writable mapping exists of a page with any other type. It also
> enforces other invariants such as requiring that no page table page
> can make a non-privlieged mapping of the hypervisor's virutal address
> space etc. By doing this it can ensure that the guest OS is not able
> to directly modify any critical data structures and therefore subvert
> the safety of the system, for example to map machine addresses which
> do not belong to it.
>
> Whenever a set of page-tables is loaded into the hardware page-table
> base register ('cr3') the hypervisor must take an appropriate type
> reference with the root page-table type (that is, an L4 reference on
> 64-bit or an L3 reference on 32-bit). If the page is not already of
> the required type then in order to take the initial reference it must
> first have a type count of zero (remember, a pages' type only be
> change while the type count is zero) and must be validated to ensure
> that it respects the invariants. This in turn means that the pages
> referenced by the root page-table must be validates as having the
> correct type (i.e. L3 or L2 on 64- or 32-bit repsectively), and so on
> down to the data pages at the leafs of the page-table, thereby
> ensuring that the page table as a whole is safe to load into 'cr3'.
>
> XXX link to appropriate header.
>
> In order to maintain the necessary invariants Xen must be involved in
> all updates to the page tables, as well as various other privileged
> operations. These are covered in the following sections.
>
> In order to prevent guest operating systems from subverting these
> mechanisms it is also necessary for guest kernels to run without the
> normal privileges associated with running in processor ring-0. For this
> reason Xen PV guest kernels usually run in either ring-1 (32-bit
> guests) or ring-3 (64-bit guests).
>
> = Updating Page Tables =
>
> Since the page tables are not writable by the guest Xen provides
> several machanisms by which the guest can update a page table entry.
>
> == mmu_update hypercall ==
>
> The first mechanism provided by Xen is the HYPERVISOR_mmu_update
> hypercall [XXX link]. This hypercall has the prototype:
>
> struct mmu_update {
> uint64_t ptr; /* Machine address of PTE. */
> uint64_t val; /* New contents of PTE. */
> };
>
> long HYPERVISOR_mmu_update(const struct mmu_update reqs[],
> unsigned count, unsigned *done_out,
> unsigned foreigndom)
>
> The operation takes an array of 'count' requests 'reqs'. The
> 'done_out' paramter returns an indication of the number of successful
> operations. 'foreigndom' can be used by a suitably privileged domain
> to access memory belonging to other domains (this usage is not covered
> here).
>
> Each request is a ('ptr','value') pair. The 'ptr' field is further
> divides into 'ptr[1:0]' indicating the type of update to perform and
> 'ptr[:2]' which indicates the the address to update.
>
> The valid values for 'ptr[1:0]' are:
>
> * MMU_NORMAL_PT_UPDATE: A normal page table update. 'ptr[:2]' contains
> the machine address of the entry to update while 'val' is the Page
> Table Entry to write. This effectively implements '*ptr = val' with
> checks to ensure that the required invariants aree preserved.
> * MMU_MACHPHYS_UPDATE: Update the machine to physical address
> mapping. This is covered below, see [XXX link]
> * MMU_PT_UPDATE_PRESERVE_AD: As per MMU_NORMAL_PT_UPDATE but
> preserving the Accessed and Dirty bits in the page table entry. The
> 'val' here is almost a standard Page Table Entry but with some
> special handling. See the [XXX link hypercall documentation] for more
> information.
>
> == update_va_mapping hypercall ==
>
> The second mechanism provided by Xen is the
> HYPERVISOR_update_va_mapping hypercall [XXX link]. This hypercall has
> the prototype:
>
> long
> HYPERVISOR_update_va_mapping(unsigned long va, u64 val,
> enum update_va_mapping_flags flags)
>
> This operation simply updates the leaf PTE entry (called and L1 in
> Xen) which maps the virtual address 'va' with the given value
> 'val', while of course performing the expected checks to ensure that
> the invariants are maintained. This can be though of as updating the
> PTE using a [XXX link linear mapping].
>
> The flags parameter can be used to request that Xen flush the TLB
> entries associated with the update. See the [XXX link hypercall
> documentation for more].
>
> == Trap and emulate of page table writes ==
>
> As well as the above Xen can also trap and emulate updates to leaf
> page table entries (L1) only. This trapping and emulating is
> relatively expensive and is best avoided but for little used code
> paths can provide a reasonable trade off vs.the requirement to modify
> the callsite in the guest OS.
>
> = Other privileged operations =
>
> As well as moderating page table updates in order to maintain the
> necessary invariants Xen must also be involved in certain other
> privileged operations, such as setting a new page table base
> ('cr3'). Because the guest kernel no longer runs in ring-0 certain
> other privleged operations must also be done by the hypervisor, such
> as flushing the TLB.
>
> These operations are performed via the HYPERVISOR_mmuext_op hypercall
> [XXX link]. This hypercall has the following prototype:
>
> struct mmuext_op {
> unsigned int cmd; /* => enum mmuext_cmd */
> union {
> /* [UN]PIN_TABLE, NEW_BASEPTR, NEW_USER_BASEPTR
> * CLEAR_PAGE, COPY_PAGE, [UN]MARK_SUPER */
> xen_pfn_t mfn;
> /* INVLPG_LOCAL, INVLPG_ALL, SET_LDT */
> unsigned long linear_addr;
> } arg1;
> union {
> /* SET_LDT */
> unsigned int nr_ents;
> /* TLB_FLUSH_MULTI, INVLPG_MULTI */
> const void *vcpumask;
> /* COPY_PAGE */
> xen_pfn_t src_mfn;
> } arg2;
> };
>
> long
> HYPERVISOR_mmuext_op(struct mmuext_op uops[],
> unsigned int count,
> unsigned int *pdone,
> unsigned int foreigndom)
>
> The hypercall takes an array of 'count' operations each specified by
> the 'mmuext_op' struct. This hypercall allows access to various
> operations which must be performed via the hypervisor either because
> the guest kernel is no longer privileged or because the hypervisor
> must be involed in order to maintain safety, in general each available
> command corresponds to a low-level processor function. The include
> NEWBASE_PTR (write cr3), various types of TLB and cache flush and to
> set the LDT table address (see below). For more information on the
> available operations please see [XXX link the hypercall
> documentation].
>
> = Pinning Page Tables =
>
> As discussed above Xen ensures that various invariants are met
> concerning whether certain pages are mapped writable or not. This
> in turn means that Xen needs to validate the page tables whenever they
> are loaded into 'cr3'. However this is a potentially expensive
> operation since Xen needs to walk the complete set of page-tables and
> validate each one recursivley.
>
> In order to avoid this expense every time 'cr3' changes (i.e. on every
> context switch). Xen allows a page to be explictly ''pinned'' to a
> give type. This effectively means taking an extra reference of the
> relevant page table type, thereby forcing Xen to validate the
> page-table up front and to maintain the invariants for as long as the
> pin remains in place. By doing this the guest ensures that when a new
> 'cr3' is loaded the referenced page already has the appropriate type
> (L4 or L3) and therefore the type count can simply be incremented
> without the need to validate.
>
> For maximum performance a guest OS kernel will usually want to perform
> a pin operation as late as possible during the setup of a new set of
> page tables, so as to be able to construct them using normal writable
> mappings before blessing them as a set of page tables. Likewise on
> page-table teardown a guest OS will usually want to unpin the pages as
> soon as possible such that it can teardown the page tables without the
> use of hypercalls. These operations are usually refered to as 'late
> pin' and 'early unpin'.
>
> = The Physical-to-machine and machine-to-physical mapping tables =
>
> As discussed above direct paging requires that the guest Operating
> System be aware of the mapping between (pseudo-physical) and machine
> addresses (the P2M table). In addition in order to be able to read PTE
> entries (which contain machine addresses) and convert them back into
> pseudo-physical addresses a translation between, this is done using
> the M2P table.
>
> Each table is a simple array of frame numbers, indexed by either
> physical or machine frames and looking up the other.
>
> Since the P2M is sized according to the guest's pseudo-physical
> address it is left entirely up to the guest to provide and maintain in
> its own pages.
>
> However the M2P must be sized according to the total amount of RAM in
> the host and therefore could be of considerable ize compared to the
> amount of RAM available to the guest, not to mention sparse from the
> guest's point of view since the majority of machine pages will not
> belong to it.
>
> For this reason Xen exposes a read-only M2P of the entire host to the
> guest and allows guests to update this table using the
> MMU_MACHPHYS_UPDATE sub-op of the HYPERVISOR_mmu_update hypercall [XXX
> link].
>
> = Descriptor Tables =
>
> As well as protecting page tables from being writable by the guest Xen
> also requires that various descriptor tables must be made unavailable
> to the guest.
>
> == Interrupt Descriptor Table ==
>
> A Xen guest cannot access the IDT directly. Instead Xen maintains its
> own IDT and allows guest to write entries using the
> HYPERVISOR_set_trap_table hypercall. This has the following prototype:
> XXX link.
>
> struct trap_info {
> uint8_t vector; /* exception vector
> */
> uint8_t flags; /* 0-3: privilege level; 4: clear event
> enable? */
> uint16_t cs; /* code selector
> */
> unsigned long address; /* code offset
> */
> };
> long HYPERVISOR_set_trap_table(const struct trap_info traps[]);
>
> The entires of the ''trap_info'' struct correspond to the fields of a
> native IDT entry and each will be validated by Xen before it is
> used. The hypercall takes an array of traps terminated by an entry
> where ''address'' is zero.
>
> == Global/Local Descriptor Tables ==
>
> A Xen guest is not able to access the Global or Local descriptor
> tables directly. Pages which are in use as part of either table are
> given their own distinct type and must therefore be mapped as
> read-only in the guest.
>
>
> The guest is also not privileged to update the descriptor base
> registers and must therefore do so using a hypercall. The hypercall to
> update the GDT is:
>
> long HYPERVISOR_set_gdt(const xen_pfn_t frames[], unsigned int
> entries);
>
> This takes an array of machine frame numbers which are validated and
> loaded into the virtual GDTR. Note that unlike native X86 these are
> machine frames and not virtual addresses. These frames will be mapped
> by Xen into the virtual address which it reserves for this purpose.
>
> The LDT is set using the MMUEXT_SET_LDT sub-op of the
> HYPERVISOR_mmuext_op hypercall. [XXX link.] XXX a single page?
>
> Finally since the pages cannot be mapped as writable by the guest the
> HYPERVISOR_update_descriptor hypercall is provided:
>
> long HYPERVISOR_update_descriptor(u64 pa, u64 desc);
>
> It takes a machine physical address of a descriptor entry to update
> and the requested contents of the descriptor itself, in the same
> format as the native descriptors.
>
> = Start Of Day =
>
> The initial boot time environment of a Xen PV guest is somewhat
> different to the normal initial mode of an X86 processor. Rather than
> starting out in 16-bit mode with paging disabled a PV guest is
> started in either 32- or 64- bit mode with paging enabled running on
> an initial set of page tables provided by the hypervisor. These pages
> will be setup so as to meet the required invariants and will be loaded
> into the 'cr3' register but will not be explicitly pinned (in other
> words their type count is effectively one)
>
> The initial virtual and pseudo-physical layout of a new guest is
> described in XXX
> file:///home/ijc/devel/xen-unstable.hg/docs/html/hypercall/include,public,xen.h.html#incontents_startofday
>
> = Virtual Address Space =
>
> Xen enforces certain restrictions on the virtual addresses which are
> available to PV guests. These are enforced as part of the machinery for
> typing and writing page tables.
>
> Xen uses this to reserve certain addresses for its own use. Certain
> areas are also read-only for guests and contain shared datastructures
> such as the Macine-to-physical address lookup table.
>
> For a 64-bit guest Xen the virtual address space is setout as follows:
>
> 0x0000000000000000-0x00007fffffffffff Fully available to guests
> 0x0000800000000000-0xffff7fffffffffff Inaccessible (addresses are 48-bit
> sign extended)
> 0xffff800000000000-0xffff807fffffffff Read only to guests.
> 0xffff808000000000-0xffff87ffffffffff Reserved for Xen use
> 0xffff880000000000-0xffffffffffffffff Fully Available to guests
>
> For 32-bit guests running on a 64-bit hypervisor guests the virtual
> address space under 4G (which is all such guests can access is:
> 0x00000000-0xf57fffff Fully available to guests
> 0xf5800000-0xffffffff Read only to guests.
>
> For more information see "Memory Layout" under [XXX link
> xen/include/asm-x86/config.h]
>
> = Batching =
>
> For some memory management operations the overhead of making many
> hypercalls can become prohibively expensive. For this reason many of
> the hypercalls described above take a list of operations to
> perform. In addition Xen provides the concept of a multicall which can
> allow several different hypercalls to be batched
> together. HYPERVISOR_multicall has this prototype:
>
> struct multicall_entry {
> unsigned long op, result;
> unsigned long args[6];
> };
> long HYPERVISOR_multicall(multicall_entry_t call_list[],
> unsigned int nr_calls);
>
> Each entry represents a hypercall and its associated arguments in the
> (hopefully) obvious way.
>
> = Guest Specific Details
>
> == Linux paravirt_ops ==
>
> === General PV MMU operation ===
>
> The Linux ''paravirt_ops'' infrastructure provides a mechanism by
> which the low-level MMU operations are abstracted into function
> pointers allowing the native operations where necessary.
>
> From the point of view of MMU operations the main entry point is
> ''struct pv_mmu_ops''. This contains entry points for low level
> operations such as:
>
> * Allocating/freeing page table entries. These allow the kernel to
> mark the pages read-only and read-write as the pages are reused.
> * Creating, writing and reading PTE entries. These allow the kernel
> to make the necessary translations between pseudo-physical and
> machine addressing as well as using hypercalls instead of direct
> writes.
> * Reading and writing of control registers, e.g. cr3, to allow
> hypercalls to be inserted.
> * Various TLB flush operations, again to allow their replacement by
> hypercalls.
>
> As well as these the interface includes some higher-level operations
> which allow for more efficient batching of compound operations such as
> duplicating (forking) a memory map. This is achieved by using the
> ''lazy_mmu_ops'' hooks to implement buffering of operations
> and flushing of larger batches or upon completion.
>
> The Xen paravirt_ops backend uses an additional page flag,
> ''PG_pinned'' in order to track whether a page has been pinned or not
> and implemented the late-pin early-unpin scheme described above.
>
> === Start of Day issues ===
>
> XXX get someone to describe these...
>
> = References =
>
> [XXX Xen and the art of virtualisation.]
> [XXX The hypercall interface documentation.]
> [XXX others? Chisnal Book?]
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |