[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 0 of 5] docs: x86 PV MMU related functions
On Fri, 2012-11-02 at 11:18 +0000, Ian Campbell wrote: > > I also have a draft of a wiki article on the subject which references > the information in the public headers which I hope to post soon. I realised I forgot to do this... It needs some polish but the majority of the XXX's are placeholder for links to the result of this applying this series. 8<------------------------------------ Paravirtualised X86 Memory Management = Intro = One of the original innovations of the Xen hypervisor was the a paravirtualisation of the memory management unit (MMU). This allowed for fas and efficient virtualisation of Operating Systems which used paging compared to contemporary techniques. In this article we will describe the functionality of the PV MMU for X86 Xen guests. A familiarity with X86 paging and related concepts will be assumed. Other guest types, such as HVM or PVH guests on X86 or guest on ARM achieve virtualisation of the MMU usaing other techniques, such as the use of hardware assisted or shadow paging. = Direct Paging = In order to virtualised the memory subsystem all hypervisors introduce an additional level of abstraction between what the guest sees as physical memory (pseudo-physical) and the underlying memory of the machine (called machine addresses in Xen). This is usually done through the introduction of a physical to machine (P2M) mapping. Typically this would be maintained within the hypervisor and hidden from the guest Operating System through techniques such as Shadow Paging. The Xen paravirtualised MMU model instead requires that the guest be aware of the P2M mapping and be modified such that instead of writing page table entries mapping virtual addresses to the physical address space it would instead write entries mapping virtual addresses directly to the machine address space by mapping from pseudo physical to machine addresses using the P2M as it writes its page tables. This technique is known as direct paging. = Page Types and Invariants = In order to ensure that the guest cannot subvert the system Xen requires that certain invariants are met and therefore that all updates to the page table updates are performed by Xen through the use of hypercalls. To this end Xen defines a number of page types and ensures that any given page has exactly one type at any given time. The type of a page is reference counted and can only be changed when the "type count" is zero. The basic types are: * None: No special uses. * Page table page: Pages used as page tables (there are separate types for each of the 4 levels on 64 bit and 3 levels on 32 bit PAE guests). * Segment descriptor page: Page is used as part of the Global or Local Descriptor table (GDT/LDT). * Writeable: Page is writable. Xen enforces the invariant that only pages with the writable type have a writable mapping in the page tables. Likewise it ensures that no writable mapping exists of a page with any other type. It also enforces other invariants such as requiring that no page table page can make a non-privlieged mapping of the hypervisor's virutal address space etc. By doing this it can ensure that the guest OS is not able to directly modify any critical data structures and therefore subvert the safety of the system, for example to map machine addresses which do not belong to it. Whenever a set of page-tables is loaded into the hardware page-table base register ('cr3') the hypervisor must take an appropriate type reference with the root page-table type (that is, an L4 reference on 64-bit or an L3 reference on 32-bit). If the page is not already of the required type then in order to take the initial reference it must first have a type count of zero (remember, a pages' type only be change while the type count is zero) and must be validated to ensure that it respects the invariants. This in turn means that the pages referenced by the root page-table must be validates as having the correct type (i.e. L3 or L2 on 64- or 32-bit repsectively), and so on down to the data pages at the leafs of the page-table, thereby ensuring that the page table as a whole is safe to load into 'cr3'. XXX link to appropriate header. In order to maintain the necessary invariants Xen must be involved in all updates to the page tables, as well as various other privileged operations. These are covered in the following sections. In order to prevent guest operating systems from subverting these mechanisms it is also necessary for guest kernels to run without the normal privileges associated with running in processor ring-0. For this reason Xen PV guest kernels usually run in either ring-1 (32-bit guests) or ring-3 (64-bit guests). = Updating Page Tables = Since the page tables are not writable by the guest Xen provides several machanisms by which the guest can update a page table entry. == mmu_update hypercall == The first mechanism provided by Xen is the HYPERVISOR_mmu_update hypercall [XXX link]. This hypercall has the prototype: struct mmu_update { uint64_t ptr; /* Machine address of PTE. */ uint64_t val; /* New contents of PTE. */ }; long HYPERVISOR_mmu_update(const struct mmu_update reqs[], unsigned count, unsigned *done_out, unsigned foreigndom) The operation takes an array of 'count' requests 'reqs'. The 'done_out' paramter returns an indication of the number of successful operations. 'foreigndom' can be used by a suitably privileged domain to access memory belonging to other domains (this usage is not covered here). Each request is a ('ptr','value') pair. The 'ptr' field is further divides into 'ptr[1:0]' indicating the type of update to perform and 'ptr[:2]' which indicates the the address to update. The valid values for 'ptr[1:0]' are: * MMU_NORMAL_PT_UPDATE: A normal page table update. 'ptr[:2]' contains the machine address of the entry to update while 'val' is the Page Table Entry to write. This effectively implements '*ptr = val' with checks to ensure that the required invariants aree preserved. * MMU_MACHPHYS_UPDATE: Update the machine to physical address mapping. This is covered below, see [XXX link] * MMU_PT_UPDATE_PRESERVE_AD: As per MMU_NORMAL_PT_UPDATE but preserving the Accessed and Dirty bits in the page table entry. The 'val' here is almost a standard Page Table Entry but with some special handling. See the [XXX link hypercall documentation] for more information. == update_va_mapping hypercall == The second mechanism provided by Xen is the HYPERVISOR_update_va_mapping hypercall [XXX link]. This hypercall has the prototype: long HYPERVISOR_update_va_mapping(unsigned long va, u64 val, enum update_va_mapping_flags flags) This operation simply updates the leaf PTE entry (called and L1 in Xen) which maps the virtual address 'va' with the given value 'val', while of course performing the expected checks to ensure that the invariants are maintained. This can be though of as updating the PTE using a [XXX link linear mapping]. The flags parameter can be used to request that Xen flush the TLB entries associated with the update. See the [XXX link hypercall documentation for more]. == Trap and emulate of page table writes == As well as the above Xen can also trap and emulate updates to leaf page table entries (L1) only. This trapping and emulating is relatively expensive and is best avoided but for little used code paths can provide a reasonable trade off vs.the requirement to modify the callsite in the guest OS. = Other privileged operations = As well as moderating page table updates in order to maintain the necessary invariants Xen must also be involved in certain other privileged operations, such as setting a new page table base ('cr3'). Because the guest kernel no longer runs in ring-0 certain other privleged operations must also be done by the hypervisor, such as flushing the TLB. These operations are performed via the HYPERVISOR_mmuext_op hypercall [XXX link]. This hypercall has the following prototype: struct mmuext_op { unsigned int cmd; /* => enum mmuext_cmd */ union { /* [UN]PIN_TABLE, NEW_BASEPTR, NEW_USER_BASEPTR * CLEAR_PAGE, COPY_PAGE, [UN]MARK_SUPER */ xen_pfn_t mfn; /* INVLPG_LOCAL, INVLPG_ALL, SET_LDT */ unsigned long linear_addr; } arg1; union { /* SET_LDT */ unsigned int nr_ents; /* TLB_FLUSH_MULTI, INVLPG_MULTI */ const void *vcpumask; /* COPY_PAGE */ xen_pfn_t src_mfn; } arg2; }; long HYPERVISOR_mmuext_op(struct mmuext_op uops[], unsigned int count, unsigned int *pdone, unsigned int foreigndom) The hypercall takes an array of 'count' operations each specified by the 'mmuext_op' struct. This hypercall allows access to various operations which must be performed via the hypervisor either because the guest kernel is no longer privileged or because the hypervisor must be involed in order to maintain safety, in general each available command corresponds to a low-level processor function. The include NEWBASE_PTR (write cr3), various types of TLB and cache flush and to set the LDT table address (see below). For more information on the available operations please see [XXX link the hypercall documentation]. = Pinning Page Tables = As discussed above Xen ensures that various invariants are met concerning whether certain pages are mapped writable or not. This in turn means that Xen needs to validate the page tables whenever they are loaded into 'cr3'. However this is a potentially expensive operation since Xen needs to walk the complete set of page-tables and validate each one recursivley. In order to avoid this expense every time 'cr3' changes (i.e. on every context switch). Xen allows a page to be explictly ''pinned'' to a give type. This effectively means taking an extra reference of the relevant page table type, thereby forcing Xen to validate the page-table up front and to maintain the invariants for as long as the pin remains in place. By doing this the guest ensures that when a new 'cr3' is loaded the referenced page already has the appropriate type (L4 or L3) and therefore the type count can simply be incremented without the need to validate. For maximum performance a guest OS kernel will usually want to perform a pin operation as late as possible during the setup of a new set of page tables, so as to be able to construct them using normal writable mappings before blessing them as a set of page tables. Likewise on page-table teardown a guest OS will usually want to unpin the pages as soon as possible such that it can teardown the page tables without the use of hypercalls. These operations are usually refered to as 'late pin' and 'early unpin'. = The Physical-to-machine and machine-to-physical mapping tables = As discussed above direct paging requires that the guest Operating System be aware of the mapping between (pseudo-physical) and machine addresses (the P2M table). In addition in order to be able to read PTE entries (which contain machine addresses) and convert them back into pseudo-physical addresses a translation between, this is done using the M2P table. Each table is a simple array of frame numbers, indexed by either physical or machine frames and looking up the other. Since the P2M is sized according to the guest's pseudo-physical address it is left entirely up to the guest to provide and maintain in its own pages. However the M2P must be sized according to the total amount of RAM in the host and therefore could be of considerable ize compared to the amount of RAM available to the guest, not to mention sparse from the guest's point of view since the majority of machine pages will not belong to it. For this reason Xen exposes a read-only M2P of the entire host to the guest and allows guests to update this table using the MMU_MACHPHYS_UPDATE sub-op of the HYPERVISOR_mmu_update hypercall [XXX link]. = Descriptor Tables = As well as protecting page tables from being writable by the guest Xen also requires that various descriptor tables must be made unavailable to the guest. == Interrupt Descriptor Table == A Xen guest cannot access the IDT directly. Instead Xen maintains its own IDT and allows guest to write entries using the HYPERVISOR_set_trap_table hypercall. This has the following prototype: XXX link. struct trap_info { uint8_t vector; /* exception vector */ uint8_t flags; /* 0-3: privilege level; 4: clear event enable? */ uint16_t cs; /* code selector */ unsigned long address; /* code offset */ }; long HYPERVISOR_set_trap_table(const struct trap_info traps[]); The entires of the ''trap_info'' struct correspond to the fields of a native IDT entry and each will be validated by Xen before it is used. The hypercall takes an array of traps terminated by an entry where ''address'' is zero. == Global/Local Descriptor Tables == A Xen guest is not able to access the Global or Local descriptor tables directly. Pages which are in use as part of either table are given their own distinct type and must therefore be mapped as read-only in the guest. The guest is also not privileged to update the descriptor base registers and must therefore do so using a hypercall. The hypercall to update the GDT is: long HYPERVISOR_set_gdt(const xen_pfn_t frames[], unsigned int entries); This takes an array of machine frame numbers which are validated and loaded into the virtual GDTR. Note that unlike native X86 these are machine frames and not virtual addresses. These frames will be mapped by Xen into the virtual address which it reserves for this purpose. The LDT is set using the MMUEXT_SET_LDT sub-op of the HYPERVISOR_mmuext_op hypercall. [XXX link.] XXX a single page? Finally since the pages cannot be mapped as writable by the guest the HYPERVISOR_update_descriptor hypercall is provided: long HYPERVISOR_update_descriptor(u64 pa, u64 desc); It takes a machine physical address of a descriptor entry to update and the requested contents of the descriptor itself, in the same format as the native descriptors. = Start Of Day = The initial boot time environment of a Xen PV guest is somewhat different to the normal initial mode of an X86 processor. Rather than starting out in 16-bit mode with paging disabled a PV guest is started in either 32- or 64- bit mode with paging enabled running on an initial set of page tables provided by the hypervisor. These pages will be setup so as to meet the required invariants and will be loaded into the 'cr3' register but will not be explicitly pinned (in other words their type count is effectively one) The initial virtual and pseudo-physical layout of a new guest is described in XXX file:///home/ijc/devel/xen-unstable.hg/docs/html/hypercall/include,public,xen.h.html#incontents_startofday = Virtual Address Space = Xen enforces certain restrictions on the virtual addresses which are available to PV guests. These are enforced as part of the machinery for typing and writing page tables. Xen uses this to reserve certain addresses for its own use. Certain areas are also read-only for guests and contain shared datastructures such as the Macine-to-physical address lookup table. For a 64-bit guest Xen the virtual address space is setout as follows: 0x0000000000000000-0x00007fffffffffff Fully available to guests 0x0000800000000000-0xffff7fffffffffff Inaccessible (addresses are 48-bit sign extended) 0xffff800000000000-0xffff807fffffffff Read only to guests. 0xffff808000000000-0xffff87ffffffffff Reserved for Xen use 0xffff880000000000-0xffffffffffffffff Fully Available to guests For 32-bit guests running on a 64-bit hypervisor guests the virtual address space under 4G (which is all such guests can access is: 0x00000000-0xf57fffff Fully available to guests 0xf5800000-0xffffffff Read only to guests. For more information see "Memory Layout" under [XXX link xen/include/asm-x86/config.h] = Batching = For some memory management operations the overhead of making many hypercalls can become prohibively expensive. For this reason many of the hypercalls described above take a list of operations to perform. In addition Xen provides the concept of a multicall which can allow several different hypercalls to be batched together. HYPERVISOR_multicall has this prototype: struct multicall_entry { unsigned long op, result; unsigned long args[6]; }; long HYPERVISOR_multicall(multicall_entry_t call_list[], unsigned int nr_calls); Each entry represents a hypercall and its associated arguments in the (hopefully) obvious way. = Guest Specific Details == Linux paravirt_ops == === General PV MMU operation === The Linux ''paravirt_ops'' infrastructure provides a mechanism by which the low-level MMU operations are abstracted into function pointers allowing the native operations where necessary. >From the point of view of MMU operations the main entry point is ''struct pv_mmu_ops''. This contains entry points for low level operations such as: * Allocating/freeing page table entries. These allow the kernel to mark the pages read-only and read-write as the pages are reused. * Creating, writing and reading PTE entries. These allow the kernel to make the necessary translations between pseudo-physical and machine addressing as well as using hypercalls instead of direct writes. * Reading and writing of control registers, e.g. cr3, to allow hypercalls to be inserted. * Various TLB flush operations, again to allow their replacement by hypercalls. As well as these the interface includes some higher-level operations which allow for more efficient batching of compound operations such as duplicating (forking) a memory map. This is achieved by using the ''lazy_mmu_ops'' hooks to implement buffering of operations and flushing of larger batches or upon completion. The Xen paravirt_ops backend uses an additional page flag, ''PG_pinned'' in order to track whether a page has been pinned or not and implemented the late-pin early-unpin scheme described above. === Start of Day issues === XXX get someone to describe these... = References = [XXX Xen and the art of virtualisation.] [XXX The hypercall interface documentation.] [XXX others? Chisnal Book?] _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |