[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] PML (Page Modification Logging) design for Xen

On 02/11/2015 07:52 PM, Andrew Cooper wrote:
On 11/02/15 08:28, Kai Huang wrote:
Hi all,

PML (Page Modification Logging) is a new feature on Intel's Boardwell
server platfrom targeted to reduce overhead of dirty logging
mechanism. Below is the design for Xen. Would you help to review and
give comments?
Thankyou for this design.  It is a very good starting point!


Currently, dirty logging is done via write protection, which basically
sets guest memory we want to log to be read-only, then when guest
performs write to that memory, write fault (EPT violation in case of
EPT is used) happens, in which we are able to log the dirty GFN. This
mechanism works but at cost of one write fault for each write from the
Strictly speaking, repeated writes to the same gfn after the first fault
are amortised until the logdirty is next queried, which makes typical
access patterns far less costly than a fault for every single write.
Indeed. I do mean first fault here.

PML Introduction

PML is a hardware-assisted efficient way, based on EPT mechanism, for
dirty logging. Briefly, PML logs dirty GPA automatically to a 4K PML
buffer when CPU changes EPT table's D-bit from 0 to 1. To accomplish
this, A new PML buffer base address (machine address), a PML index,
and a new PML buffer full VMEXIT were added to VMCS. Initially PML
index can be set to 511 (8 bytes for each GPA) to indicate the buffer
is empty, and CPU decreases PML index by 1 after logging GPA. Before
performing GPA logging, PML checks PML index to see if PML buffer has
been fully logged, in which case a PML buffer full VMEXIT happens, and
VMM should flush logged GPAs (to data structure keeps dirty GPAs) and
reset PML index so that further GPAs can be logged again.

The specification of PML can be found at:

With PML, we don't have to use write protection but just clear D-bit
of EPT entry of guest memory to do dirty logging, with an additional
PML buffer full VMEXIT for 512 dirty GPAs. Theoretically, this can
reduce hypervisor overhead when guest is in dirty logging mode, and
therefore more CPU cycles can be allocated to guest, so it's expected
benchmarks in guest will have better performance comparing to non-PML.
One issue with basic EPT A/D tracking was the scan of the EPT tables.
Here, hardware will give us a list of affected gfns, but how is Xen
supposed to efficiently clear the dirty bits again?  Using EPT
misconfiguration is no better than the existing fault path.
See my reply to Jan's email.


- PML feature is used globally

A new Xen boot parameter, say 'opt_enable_pml', will be introduced to
control PML feature detection, and PML feature will only be detected
if opt_enable_pml = 1. Once PML feature is detected, it will be used
for dirty logging for all domains globally. Currently we don't support
to use PML on basis of per-domain as it will require additional
control from XL tool.
Rather than adding in a new top level command line option for an ept
subfeature, it would be preferable to add an "ept=" option which has
"pml" as a sub boolean.
Which is good to me, if Jan agrees.

Jan, which do you prefer here?

- PML enable/disable for particular Domain
I do not believe that this is an interesting use case at the moment.
Currently, PML would be an implementation detail of how Xen manages to
provide the logdirty bitmap to the toolstack or device model, and need
not be exposed at all.

If in the future, a toolstack component wishes to use the pml for other
purposes, there is more infrastructure which needs adjusting than just
per-domain PML.
I did't mean to expose PML to toolstack here. In fact, this is I want to avoid now, PML should be hidden in Xen hypervisor completely, as you said, just another mechanism to provide logdirty bitmap to userspace. Here I mean we need to enable PML for the domain (which means allocate PML buffer, initialize PML index, and turn PML on in VMCS) manually, as it's not turned on automatically after the PML feature detection.
Sorry for the confusion.

PML needs to be enabled (allocate PML buffer, initialize PML index,
PML base address, turn PML on VMCS, etc) for all vcpus of the domain,
as PML buffer and PML index are per-vcpu, but EPT table may be shared
by vcpus. Enabling PML on partial vcpus of the domain won't work. Also
PML will only be enabled for the domain when it is switched to dirty
logging mode, and it will be disabled when domain is switched back to
normal mode. As looks vcpu number won't be changed dynamically during
guest is running (correct me if I am wrong here), so we don't have to
consider enabling PML for new created vcpu when guest is in dirty
logging mode.
There are exactly d->max_vcpus worth of struct vcpus (and therefore
VMCSes) for a domain after creation, and will exist for the lifetime of
the domain.  There is no dynamic adjustment of numbers of vcpus during
Good to know.

After PML is enabled for the domain, we only need to clear EPT entry's
D-bit for guest memory in dirty logging mode. We achieve this by
checking if PML is enabled for the domain when p2m_ram_rx changed to
p2m_ram_logdirty, and updating EPT entry accordingly. However, for
super pages, we still write protect them in case of PML as we still
need to split super page to 4K page in dirty logging mode.
How is a superpage write reflected in the PML?

According to the whitepaper, transitioning the D bit from 0 to 1 results
in an entry being written into the log.  I presume that in the case of a
superpage, 512 entries are not written to the log,
No, only the GPA being written will be logged, with the last 12 bits cleared. Whether hardware just clears the last 12 bits, or does a 2M alignment is not certain, as the specification doesn't tell. Probably I'd better to confirm with hardware guys. But it doesn't impact the design anyway, explained below.

which presumably
means that the PML buffer flush needs to be aware of which gfns are
mapped by superpages to be able to correctly set a block of bits in the
logdirty bitmap.

Unfortunately PML itself can't tell us if the logged GPA comes from superpage or not, but even in PML we still need to split superpages to 4K page, just like traditional write protection approach does. I think this is because live migration should be based on 4K page granularity. Marking all 512 bits of a 2M page to be dirty by a single write doesn't make sense in both write protection and PML cases.

- PML buffer flush

There are two places we need to flush PML buffer. The first place is
PML buffer full VMEXIT handler (apparently), and the second place is
in paging_log_dirty_op (either peek or clean), as vcpus are running
asynchronously along with paging_log_dirty_op is called from userspace
via hypercall, and it's possible there are dirty GPAs logged in vcpus'
PML buffers but not full. Therefore we'd better to flush all vcpus'
PML buffers before reporting dirty GPAs to userspace.
Why apparently?  It would be quite easy for a guest to dirty 512 frames
without otherwise taking a vmexit.
The PML buffer full VMEXIT indicates the buffer is fully logged, so clearly we need to flush it and make it empty to be able to log GPA again. See my reply to Jan's email.

We handle above two cases by flushing PML buffer at the beginning of
all VMEXITs. This solves the first case above, and it also solves the
second case, as prior to paging_log_dirty_op, domain_pause is called,
which kicks vcpus (that are in guest mode) out of guest mode via
sending IPI, which cause VMEXIT, to them.

This also makes log-dirty radix tree more updated as PML buffer is
flushed on basis of all VMEXITs but not only PML buffer full VMEXIT.
My gut feeling is that this is substantial overhead on a common path,
but this largely depends on how the dirty bits can be cleared efficiently.
Yes but I don't think the overhead will be substantial. See my reply to Jan's email.

- Video RAM tracking (and partial dirty logging for guest memory range)

Video RAM is in dirty logging mode unconditionally during guest's
run-time, and it is partial memory range of the guest. However, PML
operates on the whole guest memory (the whole valid EPT table, more
precisely), so we need to choose whether to use PML if only partial
guest memory ranges are in dirty logging mode.

Currently, PML will be used as long as there's guest memory in dirty
logging mode, no matter globally or partially. And in case of partial
dirty logging, we need to check if the logged GPA in PML buffer is in
dirty logging range.
I am not sure this is a problem.  HAP vram tracking already leaks
non-vram frames into the dirty bitmap, caused by calls to
paging_mark_dirty() from paths which are not caused by a p2m_logdirty fault.
Hmm. Seems right. Probably this also depends on how userspace uses the dirty bitmap.

If this is not a problem, we can avoid the checking of whether logged GPAs are in logdirty ranges but unconditionally update them to log-dirty radix tree.

Jan, what's your comments here?



Xen-devel mailing list

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.