Xen project Mailing List

Re: [Xen-devel] [PATCH 0 of 5] docs: x86 PV MMU related functions

To: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

From: Pasi Kärkkäinen <pasik@xxxxxx>

Date: Sun, 18 Nov 2012 23:02:14 +0200

Cc: "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Sun, 18 Nov 2012 21:02:54 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Fri, Nov 16, 2012 at 04:55:13PM +0000, Ian Campbell wrote: > On Fri, 2012-11-02 at 11:18 +0000, Ian Campbell wrote: > > > > I also have a draft of a wiki article on the subject which references > > the information in the public headers which I hope to post soon. > > I realised I forgot to do this... > > It needs some polish but the majority of the XXX's are placeholder for > links to the result of this applying this series. > Hello, Comments about some small typos.. > 8<------------------------------------ > > Paravirtualised X86 Memory Management > > = Intro = > > One of the original innovations of the Xen hypervisor was the a ^^^^^ "was the a paravirtualisation". Extra "a" ? > paravirtualisation of the memory management unit (MMU). This allowed > for fas and efficient virtualisation of Operating Systems which used ^^ "fast". > paging compared to contemporary techniques. > > In this article we will describe the functionality of the PV MMU for > X86 Xen guests. A familiarity with X86 paging and related concepts > will be assumed. > > Other guest types, such as HVM or PVH guests on X86 or guest on ARM > achieve virtualisation of the MMU usaing other techniques, such as the ^^^^^^ "using". > use of hardware assisted or shadow paging. > > = Direct Paging = > > In order to virtualised the memory subsystem all hypervisors introduce ^^^ "to virtualise" ? -- Pasi > an additional level of abstraction between what the guest sees as > physical memory (pseudo-physical) and the underlying memory of the > machine (called machine addresses in Xen). This is usually done > through the introduction of a physical to machine (P2M) > mapping. Typically this would be maintained within the hypervisor and > hidden from the guest Operating System through techniques such as > Shadow Paging. > > The Xen paravirtualised MMU model instead requires that the guest be > aware of the P2M mapping and be modified such that instead of writing > page table entries mapping virtual addresses to the physical address > space it would instead write entries mapping virtual addresses > directly to the machine address space by mapping from pseudo physical > to machine addresses using the P2M as it writes its page tables. This > technique is known as direct paging. > > = Page Types and Invariants = > > In order to ensure that the guest cannot subvert the system Xen > requires that certain invariants are met and therefore that all > updates to the page table updates are performed by Xen through the use > of hypercalls. > > To this end Xen defines a number of page types and ensures that any > given page has exactly one type at any given time. The type of a page > is reference counted and can only be changed when the "type count" is > zero. > > The basic types are: > > * None: No special uses. > * Page table page: Pages used as page tables (there are separate types > for each of the 4 levels on 64 bit and 3 levels on 32 bit PAE > guests). > * Segment descriptor page: Page is used as part of the Global or Local > Descriptor table (GDT/LDT). > * Writeable: Page is writable. > > Xen enforces the invariant that only pages with the writable type have > a writable mapping in the page tables. Likewise it ensures that no > writable mapping exists of a page with any other type. It also > enforces other invariants such as requiring that no page table page > can make a non-privlieged mapping of the hypervisor's virutal address > space etc. By doing this it can ensure that the guest OS is not able > to directly modify any critical data structures and therefore subvert > the safety of the system, for example to map machine addresses which > do not belong to it. > > Whenever a set of page-tables is loaded into the hardware page-table > base register ('cr3') the hypervisor must take an appropriate type > reference with the root page-table type (that is, an L4 reference on > 64-bit or an L3 reference on 32-bit). If the page is not already of > the required type then in order to take the initial reference it must > first have a type count of zero (remember, a pages' type only be > change while the type count is zero) and must be validated to ensure > that it respects the invariants. This in turn means that the pages > referenced by the root page-table must be validates as having the > correct type (i.e. L3 or L2 on 64- or 32-bit repsectively), and so on > down to the data pages at the leafs of the page-table, thereby > ensuring that the page table as a whole is safe to load into 'cr3'. > > XXX link to appropriate header. > > In order to maintain the necessary invariants Xen must be involved in > all updates to the page tables, as well as various other privileged > operations. These are covered in the following sections. > > In order to prevent guest operating systems from subverting these > mechanisms it is also necessary for guest kernels to run without the > normal privileges associated with running in processor ring-0. For this > reason Xen PV guest kernels usually run in either ring-1 (32-bit > guests) or ring-3 (64-bit guests). > > = Updating Page Tables = > > Since the page tables are not writable by the guest Xen provides > several machanisms by which the guest can update a page table entry. > > == mmu_update hypercall == > > The first mechanism provided by Xen is the HYPERVISOR_mmu_update > hypercall [XXX link]. This hypercall has the prototype: > > struct mmu_update { > uint64_t ptr; /* Machine address of PTE. */ > uint64_t val; /* New contents of PTE. */ > }; > > long HYPERVISOR_mmu_update(const struct mmu_update reqs[], > unsigned count, unsigned *done_out, > unsigned foreigndom) > > The operation takes an array of 'count' requests 'reqs'. The > 'done_out' paramter returns an indication of the number of successful > operations. 'foreigndom' can be used by a suitably privileged domain > to access memory belonging to other domains (this usage is not covered > here). > > Each request is a ('ptr','value') pair. The 'ptr' field is further > divides into 'ptr[1:0]' indicating the type of update to perform and > 'ptr[:2]' which indicates the the address to update. > > The valid values for 'ptr[1:0]' are: > > * MMU_NORMAL_PT_UPDATE: A normal page table update. 'ptr[:2]' contains > the machine address of the entry to update while 'val' is the Page > Table Entry to write. This effectively implements '*ptr = val' with > checks to ensure that the required invariants aree preserved. > * MMU_MACHPHYS_UPDATE: Update the machine to physical address > mapping. This is covered below, see [XXX link] > * MMU_PT_UPDATE_PRESERVE_AD: As per MMU_NORMAL_PT_UPDATE but > preserving the Accessed and Dirty bits in the page table entry. The > 'val' here is almost a standard Page Table Entry but with some > special handling. See the [XXX link hypercall documentation] for more > information. > > == update_va_mapping hypercall == > > The second mechanism provided by Xen is the > HYPERVISOR_update_va_mapping hypercall [XXX link]. This hypercall has > the prototype: > > long > HYPERVISOR_update_va_mapping(unsigned long va, u64 val, > enum update_va_mapping_flags flags) > > This operation simply updates the leaf PTE entry (called and L1 in > Xen) which maps the virtual address 'va' with the given value > 'val', while of course performing the expected checks to ensure that > the invariants are maintained. This can be though of as updating the > PTE using a [XXX link linear mapping]. > > The flags parameter can be used to request that Xen flush the TLB > entries associated with the update. See the [XXX link hypercall > documentation for more]. > > == Trap and emulate of page table writes == > > As well as the above Xen can also trap and emulate updates to leaf > page table entries (L1) only. This trapping and emulating is > relatively expensive and is best avoided but for little used code > paths can provide a reasonable trade off vs.the requirement to modify > the callsite in the guest OS. > > = Other privileged operations = > > As well as moderating page table updates in order to maintain the > necessary invariants Xen must also be involved in certain other > privileged operations, such as setting a new page table base > ('cr3'). Because the guest kernel no longer runs in ring-0 certain > other privleged operations must also be done by the hypervisor, such > as flushing the TLB. > > These operations are performed via the HYPERVISOR_mmuext_op hypercall > [XXX link]. This hypercall has the following prototype: > > struct mmuext_op { > unsigned int cmd; /* => enum mmuext_cmd */ > union { > /* [UN]PIN_TABLE, NEW_BASEPTR, NEW_USER_BASEPTR > * CLEAR_PAGE, COPY_PAGE, [UN]MARK_SUPER */ > xen_pfn_t mfn; > /* INVLPG_LOCAL, INVLPG_ALL, SET_LDT */ > unsigned long linear_addr; > } arg1; > union { > /* SET_LDT */ > unsigned int nr_ents; > /* TLB_FLUSH_MULTI, INVLPG_MULTI */ > const void *vcpumask; > /* COPY_PAGE */ > xen_pfn_t src_mfn; > } arg2; > }; > > long > HYPERVISOR_mmuext_op(struct mmuext_op uops[], > unsigned int count, > unsigned int *pdone, > unsigned int foreigndom) > > The hypercall takes an array of 'count' operations each specified by > the 'mmuext_op' struct. This hypercall allows access to various > operations which must be performed via the hypervisor either because > the guest kernel is no longer privileged or because the hypervisor > must be involed in order to maintain safety, in general each available > command corresponds to a low-level processor function. The include > NEWBASE_PTR (write cr3), various types of TLB and cache flush and to > set the LDT table address (see below). For more information on the > available operations please see [XXX link the hypercall > documentation]. > > = Pinning Page Tables = > > As discussed above Xen ensures that various invariants are met > concerning whether certain pages are mapped writable or not. This > in turn means that Xen needs to validate the page tables whenever they > are loaded into 'cr3'. However this is a potentially expensive > operation since Xen needs to walk the complete set of page-tables and > validate each one recursivley. > > In order to avoid this expense every time 'cr3' changes (i.e. on every > context switch). Xen allows a page to be explictly ''pinned'' to a > give type. This effectively means taking an extra reference of the > relevant page table type, thereby forcing Xen to validate the > page-table up front and to maintain the invariants for as long as the > pin remains in place. By doing this the guest ensures that when a new > 'cr3' is loaded the referenced page already has the appropriate type > (L4 or L3) and therefore the type count can simply be incremented > without the need to validate. > > For maximum performance a guest OS kernel will usually want to perform > a pin operation as late as possible during the setup of a new set of > page tables, so as to be able to construct them using normal writable > mappings before blessing them as a set of page tables. Likewise on > page-table teardown a guest OS will usually want to unpin the pages as > soon as possible such that it can teardown the page tables without the > use of hypercalls. These operations are usually refered to as 'late > pin' and 'early unpin'. > > = The Physical-to-machine and machine-to-physical mapping tables = > > As discussed above direct paging requires that the guest Operating > System be aware of the mapping between (pseudo-physical) and machine > addresses (the P2M table). In addition in order to be able to read PTE > entries (which contain machine addresses) and convert them back into > pseudo-physical addresses a translation between, this is done using > the M2P table. > > Each table is a simple array of frame numbers, indexed by either > physical or machine frames and looking up the other. > > Since the P2M is sized according to the guest's pseudo-physical > address it is left entirely up to the guest to provide and maintain in > its own pages. > > However the M2P must be sized according to the total amount of RAM in > the host and therefore could be of considerable ize compared to the > amount of RAM available to the guest, not to mention sparse from the > guest's point of view since the majority of machine pages will not > belong to it. > > For this reason Xen exposes a read-only M2P of the entire host to the > guest and allows guests to update this table using the > MMU_MACHPHYS_UPDATE sub-op of the HYPERVISOR_mmu_update hypercall [XXX > link]. > > = Descriptor Tables = > > As well as protecting page tables from being writable by the guest Xen > also requires that various descriptor tables must be made unavailable > to the guest. > > == Interrupt Descriptor Table == > > A Xen guest cannot access the IDT directly. Instead Xen maintains its > own IDT and allows guest to write entries using the > HYPERVISOR_set_trap_table hypercall. This has the following prototype: > XXX link. > > struct trap_info { > uint8_t vector; /* exception vector > */ > uint8_t flags; /* 0-3: privilege level; 4: clear event > enable? */ > uint16_t cs; /* code selector > */ > unsigned long address; /* code offset > */ > }; > long HYPERVISOR_set_trap_table(const struct trap_info traps[]); > > The entires of the ''trap_info'' struct correspond to the fields of a > native IDT entry and each will be validated by Xen before it is > used. The hypercall takes an array of traps terminated by an entry > where ''address'' is zero. > > == Global/Local Descriptor Tables == > > A Xen guest is not able to access the Global or Local descriptor > tables directly. Pages which are in use as part of either table are > given their own distinct type and must therefore be mapped as > read-only in the guest. > > > The guest is also not privileged to update the descriptor base > registers and must therefore do so using a hypercall. The hypercall to > update the GDT is: > > long HYPERVISOR_set_gdt(const xen_pfn_t frames[], unsigned int > entries); > > This takes an array of machine frame numbers which are validated and > loaded into the virtual GDTR. Note that unlike native X86 these are > machine frames and not virtual addresses. These frames will be mapped > by Xen into the virtual address which it reserves for this purpose. > > The LDT is set using the MMUEXT_SET_LDT sub-op of the > HYPERVISOR_mmuext_op hypercall. [XXX link.] XXX a single page? > > Finally since the pages cannot be mapped as writable by the guest the > HYPERVISOR_update_descriptor hypercall is provided: > > long HYPERVISOR_update_descriptor(u64 pa, u64 desc); > > It takes a machine physical address of a descriptor entry to update > and the requested contents of the descriptor itself, in the same > format as the native descriptors. > > = Start Of Day = > > The initial boot time environment of a Xen PV guest is somewhat > different to the normal initial mode of an X86 processor. Rather than > starting out in 16-bit mode with paging disabled a PV guest is > started in either 32- or 64- bit mode with paging enabled running on > an initial set of page tables provided by the hypervisor. These pages > will be setup so as to meet the required invariants and will be loaded > into the 'cr3' register but will not be explicitly pinned (in other > words their type count is effectively one) > > The initial virtual and pseudo-physical layout of a new guest is > described in XXX > file:///home/ijc/devel/xen-unstable.hg/docs/html/hypercall/include,public,xen.h.html#incontents_startofday > > = Virtual Address Space = > > Xen enforces certain restrictions on the virtual addresses which are > available to PV guests. These are enforced as part of the machinery for > typing and writing page tables. > > Xen uses this to reserve certain addresses for its own use. Certain > areas are also read-only for guests and contain shared datastructures > such as the Macine-to-physical address lookup table. > > For a 64-bit guest Xen the virtual address space is setout as follows: > > 0x0000000000000000-0x00007fffffffffff Fully available to guests > 0x0000800000000000-0xffff7fffffffffff Inaccessible (addresses are 48-bit > sign extended) > 0xffff800000000000-0xffff807fffffffff Read only to guests. > 0xffff808000000000-0xffff87ffffffffff Reserved for Xen use > 0xffff880000000000-0xffffffffffffffff Fully Available to guests > > For 32-bit guests running on a 64-bit hypervisor guests the virtual > address space under 4G (which is all such guests can access is: > 0x00000000-0xf57fffff Fully available to guests > 0xf5800000-0xffffffff Read only to guests. > > For more information see "Memory Layout" under [XXX link > xen/include/asm-x86/config.h] > > = Batching = > > For some memory management operations the overhead of making many > hypercalls can become prohibively expensive. For this reason many of > the hypercalls described above take a list of operations to > perform. In addition Xen provides the concept of a multicall which can > allow several different hypercalls to be batched > together. HYPERVISOR_multicall has this prototype: > > struct multicall_entry { > unsigned long op, result; > unsigned long args[6]; > }; > long HYPERVISOR_multicall(multicall_entry_t call_list[], > unsigned int nr_calls); > > Each entry represents a hypercall and its associated arguments in the > (hopefully) obvious way. > > = Guest Specific Details > > == Linux paravirt_ops == > > === General PV MMU operation === > > The Linux ''paravirt_ops'' infrastructure provides a mechanism by > which the low-level MMU operations are abstracted into function > pointers allowing the native operations where necessary. > > From the point of view of MMU operations the main entry point is > ''struct pv_mmu_ops''. This contains entry points for low level > operations such as: > > * Allocating/freeing page table entries. These allow the kernel to > mark the pages read-only and read-write as the pages are reused. > * Creating, writing and reading PTE entries. These allow the kernel > to make the necessary translations between pseudo-physical and > machine addressing as well as using hypercalls instead of direct > writes. > * Reading and writing of control registers, e.g. cr3, to allow > hypercalls to be inserted. > * Various TLB flush operations, again to allow their replacement by > hypercalls. > > As well as these the interface includes some higher-level operations > which allow for more efficient batching of compound operations such as > duplicating (forking) a memory map. This is achieved by using the > ''lazy_mmu_ops'' hooks to implement buffering of operations > and flushing of larger batches or upon completion. > > The Xen paravirt_ops backend uses an additional page flag, > ''PG_pinned'' in order to track whether a page has been pinned or not > and implemented the late-pin early-unpin scheme described above. > > === Start of Day issues === > > XXX get someone to describe these... > > = References = > > [XXX Xen and the art of virtualisation.] > [XXX The hypercall interface documentation.] > [XXX others? Chisnal Book?] > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.