[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: How does shadow page table work during migration?
On 22/02/2021 18:52, Kevin Negy wrote:
Your observation about "being like a TLB" is correct. Lets take the most simple case, of 4-on-4 shadows. I.e. Xen and the guest are both in 64bit mode, and using 4-level paging. Each domain also has a structure which Xen calls a P2M, for the guest physical => host physical mappings. (For PV guests, its actually identity transform, and for HVM, it is a set of EPT or (N)PT pagetables, but the exact structure isn't important here.) The other primitive required is an emulated pagewalk. I.e. we start at the guest's %cr3 value, and walk though the guests pagetables as hardware would. Each step involves a lookup in the P2M, as the guest PTEs are programmed with guest physical addresses, not host physical. In reality, we always have a "top level shadow" per vcpu. In this example, it is a level-4 pagetable, which starts out clear (i.e. no guest entries present). We need *something* to point hardware at when we start running the guest. Once we run the guest, we immediately take a pagefault. We look at %cr2 to find the linear address accessed, and perform a pagewalk. In the common case, we find that the linear address is valid in the guest, so we allocate a level 3 pagetable, again clear, then point the appropriate L4e at it, then re-enter the guest. This takes an immediate pagefault again, and we allocate an L2 pagetable, re-enter then allocate an L1 pagetable, and finally point an L1e at the host physical page. Now, we can successfully fetch the instruction (if it doesn't cross a page boundary), then repeat the process for every subsequent memory access. This example is simplified specifically to demonstrate the point. Everything is driven from pagefaults. There is of course far more complexity. We typically populate all the way down to an L1e in one go, because this is far more efficient than taking 4 real pagefaults. If we walk the guest pagetables and find a violation, we have to hand #PF back to the guest kernel rather than change the shadows. To emulate dirty bits correctly, we need to leave the shadow read-only even if the guest PTE was read/write so we can spot when hardware tries to set the D bit in the shadows, and copy it back into guest's view. Superpages are complicated to deal with (we have to splinter to 4k pages), and 2-on-3 (legacy 32bit OS with non-PAE paging) a total nightmare because of the different format of pagetable entries. Also notice that a guest TLB flush is also implemented as "drop all shadows under this virtual cr3".
It would be a massive security vulnerability to let PV guests write to their own pagetables. PV guest pagetables are read-only, and all updates are made via hypercall, so they can be audited for safety. (We do actually have pagetable emulation for PV guests for those which do write to their own pagetables, and feeds into the same logic as the hypercall, but is less efficient overall.)
PV guest share an address space with Xen. So actually the top level shadow for a PV guest is pre-populated with Xen's mappings, but all guest entries are faulted in on demand.
The first guest memory access is actually the instruction fetch at %cs:%rip. Once that address is shadowed, you further have to shadow any memory operands (which can be more than one, e.g. `PUSH ptr` has a regular memory operand, and an implicit stack operand which needs shadowing. With the AVX scatter/gather instructions, you can have an almost-arbitrary number of memory operands.) Also, be very careful with terminology. Linear and virtual addresses are different (by the segment selector base, which is commonly but not always 0). Lots of Xen code uses va/vaddr when it means linear addresses.
Yes. This is a combination of the pagewalk and P2M to identify the mfn in question for the linear address, along with suitable allocations/modifications to the shadow pagetables.
Actually, what we do when the VM is in global logdirty mode is always start by writing all shadow L1e's as read-only, even if the guest has them read-write. This causes all writes to trap with #PF, which lets us see which frame is being written to, and lets us set the appropriate bit in the logdirty bitmap.
Writeability of the guest's actual pagetables is complicated and guest-dependent. Under a strict TLB-like model, its not actually required to restrict writeability. In real hardware, the TLB is an explicitly non-coherent cache, and software is required to issue a TLB flush to ensure that changes to the PTEs in memory get propagated subsequently into the TLB.
Changing processes involves writing to %cr3, which is a TLB flush, so in a strict TLB-like model, all shadows must be dropped. In reality, this is where we start using restricted writeability to our advantage. If we know that no writes to pagetables happened, we know "the TLB" (== the currently established shadows) aren't actually stale, so may be retained and reused. We do maintain hash lists of types of pagetable, so we can locate preexisting shadows of a specific type. This is how we can switch between already-established shadows when the guest changes %cr3. In reality, the kernel half of virtual address space doesn't change much after after boot, so there is a substantial performance win from not dropping and reshadowing these entries. There are loads and loads of L4 pagetables (one per process), all pointing to common L3's which form the kernel half of the address space. If I'm being honestly - this is where my knowledge of exactly what Xen does breaks down - I'm not the author of the shadow code - I've merely debugged it a few times. I hope this is still informative. ~Andrew
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |