[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 0 of 5] docs: x86 PV MMU related functions

On Fri, 2012-11-02 at 11:18 +0000, Ian Campbell wrote:
> I also have a draft of a wiki article on the subject which references
> the information in the public headers which I hope to post soon. 

I realised I forgot to do this...

It needs some polish but the majority of the XXX's are placeholder for
links to the result of this applying this series.


Paravirtualised X86 Memory Management

= Intro =

One of the original innovations of the Xen hypervisor was the a
paravirtualisation of the memory management unit (MMU). This allowed
for fas and efficient virtualisation of Operating Systems which used
paging compared to contemporary techniques.

In this article we will describe the functionality of the PV MMU for
X86 Xen guests. A familiarity with X86 paging and related concepts
will be assumed.

Other guest types, such as HVM or PVH guests on X86 or guest on ARM
achieve virtualisation of the MMU usaing other techniques, such as the
use of hardware assisted or shadow paging.

= Direct Paging =

In order to virtualised the memory subsystem all hypervisors introduce
an additional level of abstraction between what the guest sees as
physical memory (pseudo-physical) and the underlying memory of the
machine (called machine addresses in Xen). This is usually done
through the introduction of a physical to machine (P2M)
mapping. Typically this would be maintained within the hypervisor and
hidden from the guest Operating System through techniques such as
Shadow Paging.

The Xen paravirtualised MMU model instead requires that the guest be
aware of the P2M mapping and be modified such that instead of writing
page table entries mapping virtual addresses to the physical address
space it would instead write entries mapping virtual addresses
directly to the machine address space by mapping from pseudo physical
to machine addresses using the P2M as it writes its page tables. This
technique is known as direct paging.

= Page Types and Invariants =

In order to ensure that the guest cannot subvert the system Xen
requires that certain invariants are met and therefore that all
updates to the page table updates are performed by Xen through the use
of hypercalls.

To this end Xen defines a number of page types and ensures that any
given page has exactly one type at any given time. The type of a page
is reference counted and can only be changed when the "type count" is

The basic types are:

* None: No special uses.
* Page table page: Pages used as page tables (there are separate types
  for each of the 4 levels on 64 bit and 3 levels on 32 bit PAE
* Segment descriptor page: Page is used as part of the Global or Local
  Descriptor table (GDT/LDT).
* Writeable: Page is writable.

Xen enforces the invariant that only pages with the writable type have
a writable mapping in the page tables. Likewise it ensures that no
writable mapping exists of a page with any other type. It also
enforces other invariants such as requiring that no page table page
can make a non-privlieged mapping of the hypervisor's virutal address
space etc. By doing this it can ensure that the guest OS is not able
to directly modify any critical data structures and therefore subvert
the safety of the system, for example to map machine addresses which
do not belong to it.

Whenever a set of page-tables is loaded into the hardware page-table
base register ('cr3') the hypervisor must take an appropriate type
reference with the root page-table type (that is, an L4 reference on
64-bit or an L3 reference on 32-bit). If the page is not already of
the required type then in order to take the initial reference it must
first have a type count of zero (remember, a pages' type only be
change while the type count is zero) and must be validated to ensure
that it respects the invariants. This in turn means that the pages
referenced by the root page-table must be validates as having the
correct type (i.e. L3 or L2 on 64- or 32-bit repsectively), and so on
down to the data pages at the leafs of the page-table, thereby
ensuring that the page table as a whole is safe to load into 'cr3'.

XXX link to appropriate header.

In order to maintain the necessary invariants Xen must be involved in
all updates to the page tables, as well as various other privileged
operations. These are covered in the following sections.

In order to prevent guest operating systems from subverting these
mechanisms it is also necessary for guest kernels to run without the
normal privileges associated with running in processor ring-0. For this
reason Xen PV guest kernels usually run in either ring-1 (32-bit
guests) or ring-3 (64-bit guests).

= Updating Page Tables =

Since the page tables are not writable by the guest Xen provides
several machanisms by which the guest can update a page table entry.

== mmu_update hypercall ==

The first mechanism provided by Xen is the HYPERVISOR_mmu_update
hypercall [XXX link]. This hypercall has the prototype:

  struct mmu_update {
      uint64_t ptr;       /* Machine address of PTE. */
      uint64_t val;       /* New contents of PTE.    */

  long HYPERVISOR_mmu_update(const struct mmu_update reqs[],
                             unsigned count, unsigned *done_out,
                             unsigned foreigndom)

The operation takes an array of 'count' requests 'reqs'. The
'done_out' paramter returns an indication of the number of successful
operations. 'foreigndom' can be used by a suitably privileged domain
to access memory belonging to other domains (this usage is not covered

Each request is a ('ptr','value') pair. The 'ptr' field is further
divides into 'ptr[1:0]' indicating the type of update to perform and
'ptr[:2]' which indicates the the address to update.
The valid values for 'ptr[1:0]' are:

* MMU_NORMAL_PT_UPDATE: A normal page table update. 'ptr[:2]' contains
  the machine address of the entry to update while 'val' is the Page
  Table Entry to write. This effectively implements '*ptr = val' with
  checks to ensure that the required invariants aree preserved.
* MMU_MACHPHYS_UPDATE: Update the machine to physical address
  mapping. This is covered below, see [XXX link]
  preserving the Accessed and Dirty bits in the page table entry. The
  'val' here is almost a standard Page Table Entry but with some
  special handling. See the [XXX link hypercall documentation] for more

== update_va_mapping hypercall ==

The second mechanism provided by Xen is the
HYPERVISOR_update_va_mapping hypercall [XXX link]. This hypercall has
the prototype:

  HYPERVISOR_update_va_mapping(unsigned long va, u64 val,
                               enum update_va_mapping_flags flags)

This operation simply updates the leaf PTE entry (called and L1 in
Xen) which maps the virtual address 'va' with the given value
'val', while of course performing the expected checks to ensure that
the invariants are maintained. This can be though of as updating the
PTE using a [XXX link linear mapping].

The flags parameter can be used to request that Xen flush the TLB
entries associated with the update. See the [XXX link hypercall
documentation for more].

== Trap and emulate of page table writes ==

As well as the above Xen can also trap and emulate updates to leaf
page table entries (L1) only. This trapping and emulating is
relatively expensive and is best avoided but for little used code
paths can provide a reasonable trade off vs.the requirement to modify
the callsite in the guest OS.

= Other privileged operations =

As well as moderating page table updates in order to maintain the
necessary invariants Xen must also be involved in certain other
privileged operations, such as setting a new page table base
('cr3'). Because the guest kernel no longer runs in ring-0 certain
other privleged operations must also be done by the hypervisor, such
as flushing the TLB.

These operations are performed via the HYPERVISOR_mmuext_op hypercall
[XXX link]. This hypercall has the following prototype:

  struct mmuext_op {
      unsigned int cmd; /* => enum mmuext_cmd */
      union {
          xen_pfn_t     mfn;
          unsigned long linear_addr;
      } arg1;
      union {
          /* SET_LDT */
          unsigned int nr_ents;
          const void *vcpumask;
          /* COPY_PAGE */
          xen_pfn_t src_mfn;
      } arg2;

  HYPERVISOR_mmuext_op(struct mmuext_op uops[],
                       unsigned int count,
                       unsigned int *pdone,
                       unsigned int foreigndom)

The hypercall takes an array of 'count' operations each specified by
the 'mmuext_op' struct. This hypercall allows access to various
operations which must be performed via the hypervisor either because
the guest kernel is no longer privileged or because the hypervisor
must be involed in order to maintain safety, in general each available
command corresponds to a low-level processor function. The include
NEWBASE_PTR (write cr3), various types of TLB and cache flush and to
set the LDT table address (see below). For more information on the
available operations please see [XXX link the hypercall

= Pinning Page Tables =

As discussed above Xen ensures that various invariants are met
concerning whether certain pages are mapped writable or not. This
in turn means that Xen needs to validate the page tables whenever they
are loaded into 'cr3'. However this is a potentially expensive
operation since Xen needs to walk the complete set of page-tables and
validate each one recursivley.

In order to avoid this expense every time 'cr3' changes (i.e. on every
context switch). Xen allows a page to be explictly ''pinned'' to a
give type. This effectively means taking an extra reference of the
relevant page table type, thereby forcing Xen to validate the
page-table up front and to maintain the invariants for as long as the
pin remains in place. By doing this the guest ensures that when a new
'cr3' is loaded the referenced page already has the appropriate type
(L4 or L3) and therefore the type count can simply be incremented
without the need to validate.

For maximum performance a guest OS kernel will usually want to perform
a pin operation as late as possible during the setup of a new set of
page tables, so as to be able to construct them using normal writable
mappings before blessing them as a set of page tables. Likewise on
page-table teardown a guest OS will usually want to unpin the pages as
soon as possible such that it can teardown the page tables without the
use of hypercalls. These operations are usually refered to as 'late
pin' and 'early unpin'.

= The Physical-to-machine and machine-to-physical mapping tables =

As discussed above direct paging requires that the guest Operating
System be aware of the mapping between (pseudo-physical) and machine
addresses (the P2M table). In addition in order to be able to read PTE
entries (which contain machine addresses) and convert them back into
pseudo-physical addresses a translation between, this is done using
the M2P table.

Each table is a simple array of frame numbers, indexed by either
physical or machine frames and looking up the other.

Since the P2M is sized according to the guest's pseudo-physical
address it is left entirely up to the guest to provide and maintain in
its own pages.

However the M2P must be sized according to the total amount of RAM in
the host and therefore could be of considerable ize compared to the
amount of RAM available to the guest, not to mention sparse from the
guest's point of view since the majority of machine pages will not
belong to it.

For this reason Xen exposes a read-only M2P of the entire host to the
guest and allows guests to update this table using the
MMU_MACHPHYS_UPDATE sub-op of the HYPERVISOR_mmu_update hypercall [XXX

= Descriptor Tables =

As well as protecting page tables from being writable by the guest Xen
also requires that various descriptor tables must be made unavailable
to the guest.

== Interrupt Descriptor Table ==

A Xen guest cannot access the IDT directly. Instead Xen maintains its
own IDT and allows guest to write entries using the
HYPERVISOR_set_trap_table hypercall. This has the following prototype:
XXX link.

  struct trap_info {
      uint8_t       vector;  /* exception vector
      uint8_t       flags;   /* 0-3: privilege level; 4: clear event
enable?  */
      uint16_t      cs;      /* code selector
      unsigned long address; /* code offset
  long HYPERVISOR_set_trap_table(const struct trap_info traps[]);

The entires of the ''trap_info'' struct correspond to the fields of a
native IDT entry and each will be validated by Xen before it is
used. The hypercall takes an array of traps terminated by an entry
where ''address'' is zero.

== Global/Local Descriptor Tables ==

A Xen guest is not able to access the Global or Local descriptor
tables directly. Pages which are in use as part of either table are
given their own distinct type and must therefore be mapped as
read-only in the guest. 

The guest is also not privileged to update the descriptor base
registers and must therefore do so using a hypercall. The hypercall to
update the GDT is:

  long HYPERVISOR_set_gdt(const xen_pfn_t frames[], unsigned int

This takes an array of machine frame numbers which are validated and
loaded into the virtual GDTR. Note that unlike native X86 these are
machine frames and not virtual addresses. These frames will be mapped
by Xen into the virtual address which it reserves for this purpose.

The LDT is set using the MMUEXT_SET_LDT sub-op of the
HYPERVISOR_mmuext_op hypercall. [XXX link.] XXX a single page?

Finally since the pages cannot be mapped as writable by the guest the
HYPERVISOR_update_descriptor hypercall is provided:

  long HYPERVISOR_update_descriptor(u64 pa, u64 desc);

It takes a machine physical address of a descriptor entry to update
and the requested contents of the descriptor itself, in the same
format as the native descriptors.

= Start Of Day = 

The initial boot time environment of a Xen PV guest is somewhat
different to the normal initial mode of an X86 processor. Rather than
starting out in 16-bit mode with paging disabled a PV guest is
started in either 32- or 64- bit mode with paging enabled running on
an initial set of page tables provided by the hypervisor. These pages
will be setup so as to meet the required invariants and will be loaded
into the 'cr3' register but will not be explicitly pinned (in other
words their type count is effectively one)

The initial virtual and pseudo-physical layout of a new guest is
described in XXX

= Virtual Address Space =

Xen enforces certain restrictions on the virtual addresses which are
available to PV guests. These are enforced as part of the machinery for
typing and writing page tables.

Xen uses this to reserve certain addresses for its own use. Certain
areas are also read-only for guests and contain shared datastructures
such as the Macine-to-physical address lookup table.

For a 64-bit guest Xen the virtual address space is setout as follows:

0x0000000000000000-0x00007fffffffffff Fully available to guests
0x0000800000000000-0xffff7fffffffffff Inaccessible (addresses are 48-bit
sign extended)
0xffff800000000000-0xffff807fffffffff Read only to guests.
0xffff808000000000-0xffff87ffffffffff Reserved for Xen use
0xffff880000000000-0xffffffffffffffff Fully Available to guests

For 32-bit guests running on a 64-bit hypervisor guests the virtual
address space under 4G (which is all such guests can access is:
0x00000000-0xf57fffff Fully available to guests
0xf5800000-0xffffffff Read only to guests.

For more information see "Memory Layout" under [XXX link

= Batching =

For some memory management operations the overhead of making many
hypercalls can become prohibively expensive. For this reason many of
the hypercalls described above take a list of operations to
perform. In addition Xen provides the concept of a multicall which can
allow several different hypercalls to be batched
together. HYPERVISOR_multicall has this prototype:

  struct multicall_entry {
      unsigned long op, result;
      unsigned long args[6];
  long HYPERVISOR_multicall(multicall_entry_t call_list[],
                            unsigned int nr_calls);

Each entry represents a hypercall and its associated arguments in the
(hopefully) obvious way.

= Guest Specific Details

== Linux paravirt_ops ==

=== General PV MMU operation ===

The Linux ''paravirt_ops'' infrastructure provides a mechanism by
which the low-level MMU operations are abstracted into function
pointers allowing the native operations where necessary.

>From the point of view of MMU operations the main entry point is
''struct pv_mmu_ops''. This contains entry points for low level
operations such as:

 * Allocating/freeing page table entries. These allow the kernel to
   mark the pages read-only and read-write as the pages are reused.
 * Creating, writing and reading PTE entries. These allow the kernel
   to make the necessary translations between pseudo-physical and
   machine addressing as well as using hypercalls instead of direct
 * Reading and writing of control registers, e.g. cr3, to allow
   hypercalls to be inserted.
 * Various TLB flush operations, again to allow their replacement by

As well as these the interface includes some higher-level operations
which allow for more efficient batching of compound operations such as
duplicating (forking) a memory map. This is achieved by using the
''lazy_mmu_ops'' hooks to implement buffering of operations
and flushing of larger batches or upon completion.

The Xen paravirt_ops backend uses an additional page flag,
''PG_pinned'' in order to track whether a page has been pinned or not
and implemented the late-pin early-unpin scheme described above.

=== Start of Day issues ===

XXX get someone to describe these...

= References =

[XXX Xen and the art of virtualisation.]
[XXX The hypercall interface documentation.]
[XXX others? Chisnal Book?]

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.