[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I've been looking for)
OK, here's the long version (/me crosses
fingers and hopes to get away from this
for at least some of the weekend)...
Proposal ("pv rdtscp"):
The rdtscP instruction was added to the x86
architecture by AMD a couple of years ago and
Intel added it starting at Nehalem. It is
essentially the same as an rdtsc except in
addition it copies the value of a privileged
MSR register "TSC_AUX" into a specified memory
location. There is a CPUID bit that can
be checked to determine if the processor
supports the rdtscp instruction. Xen currently
does not expose hardware support for rdtscp
I propose to paravirtualize support for
rdtscp as follows:
If guest vm.cfg has vrdtscp=0 (default):
rdtscp is emulated and returns nsec since guest
boot (same as emulated rdtsc), value returned
for TSC_AUX is -1
If guest vm.cfg has vrdtscp=1:
If underlying hardware has rdtscp support:
rdtscp is directly executed by hardware,
value returned for TSC_AUX is non-zero
Else: (no hardware rdtscp support)
rdtscp is emulated and returns nsec since
guest boot, value returned for TSC_AUX is 0
How it works from the app point-of-view:
Guest app must have some capability of getting 64-bit
pvclock parameters directly from Xen without OS changes,
e.g. emulated userland wrmsr, userland hypercall,
or userland mapped shared page. (This will be done
rarely so need not be fast! But it does create
a new userland<->Xen ABI that must be kept compatible.)
On first rdtscp, app records returned TSC_AUX value,
verifies that it is neither 0 nor -1,
fetches pvclock parameters from Xen, executes
another rdtscp. If TSC_AUX matches previous value,
app applies pvclock algorithm to tsc value to
obtain nsec since guest boot. If TSC_AUX is
zero or -1, tsc value IS nsec since guest boot.
If TSC_AUX differs from last recorded value,
fetch pvclock parameters from Xen again.
On subsequent rdtscp's, app compares
returned TSC_AUX against the previous one,
and fetches pvclock parameters from Xen only
if it differs (which should be rare).
What Xen needs to do:
Xen must record the setting for each guest's vrdtscp
config variable and ensure that it persists across
save/restore and migration. If the guest has
vrdtscp=1, a vrdtscp "version" number is also
part of the guest's state and must persist
Xen must know whether or not it is running on a
machine where TSC is reliable. If TSC is NOT
reliable AND rdtscp is supported by hardware,
Xen must ensure that TSC_AUX is -1 on all pcpu's
that are running a guest with vrdtscp=0, and 0
on all pcpu's that are running a guest where
vrdtscp=1 (and must enable CR4.TSD on those
pcpus if it wasn't already). If TSC is NOT
reliable AND rdtscp is NOT supported by hardware,
Xen must emulate rdtscp (e.g.
return Xen system time) and emulate the
same behavior for TSC_AUX. If TSC IS reliable,
Xen sets TSC_AUX to the guest's vrdtscp version
number on all pcpu's that are running the guest.
Finally, when a guest transitions from one
"TSC domain" to another (restore/migrate/NUMA)
it increments the vrdtscp version number.
I think this will work even for a NUMA machine
provided Xen always schedules all the vcpus
for one guest on pcpus in the same NUMA node,
and increments the version number when
the guest is rescheduled from one NUMA node to
another (assuming TSC on each node is reliable).
I think this pv-rdtscp mechanism will work
for both PV and HVM (with minor additional work
in Xen for HVM); it will be very fast on any
hardware that supports rdtscp in hardware
(which for Intel only includes Nehalem+ but
that provides even more incentive for
customers to upgrade). Apps that currently
use rdtscp will continue to work (as long as
they don't have
some wild use model that I don't know about).
Pvclock algorithm in the OS would need to be
changed to use rdtscp (instead of rdtsc)
and check for TSC_AUX=0 to do the right thing.
If not changed, it will continue to work
but slower (whether or not rdtsc is emulated
because when emulated it returns the hardware
TSC when the instruction was attempted in kernel
The only problem I can see is that when
vrdtscp==1, other apps that are running on that guest
that use rdtsc (no p) directly (i.e. haven't been
modified to use pv-rdtscp) will continue to
have the same kinds of failure on save/restore/
migration. But this is true of all the solutions
proposed so far: Xen can only turn on emulation
guest-wide, not per-app.
Also even on machines where TSC is reliable,
there is a small chance that consecutive
TSC values read will be from different
processors and so TSC might appear to go
backwards by some small amount. So apps
must still put raw TSC values through
a "monotonicity filter". (Xen already
does this for emulated reads of TSC.)
> -----Original Message-----
> From: Dan Magenheimer
> Sent: Friday, September 18, 2009 10:30 AM
> To: Xen-Devel (E-mail)
> Subject: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I've
> been looking for)
> Xen doesn't appear to support the rdtscp instruction.
> Should it? (And specifically I'm wondering whether
> it should be emulated whenever rdtsc is emulated
> but see below for another intriguing possibility.)
> Rdtscp is unprivileged and we have apps that are using it
> on bare metal, after validating that the CPU supports it.
> The instruction is available on most (all?) recent AMD
> CPUs and Intel's Nehalem supports it.
> For an OS to support rdtscp properly, the OS must (once at boot)
> wrmsr a different value for each cpu to a "TSC_AUX" register
> and this register is read along with the TSC when the rdtscp
> instruction is executed. This allows an app to determine
> if two consecutive rdtsc's are (or are not) executed on the
> same CPU.
> It appears that all recent RHEL kernels write to TSC_AUX if
> the CPU supports rdtscp. I'm told Windows 2008 notably does
> not. Don't know about SLES or other Windoze.
> Its not clear to me if/how rdtscp can/should be virtualized.
> To do it properly, the value written to the TSC_AUX msr
> would become part of the vcpu's state, and would need to
> be changed whenever a vcpu->pcpu mapping changes. To meet
> only the current use model of the instruction, Xen could write
> TSC_AUX for each pcpu on Xen boot and always ignore guest
> OS writes to TSC_AUX. (This assumes that no OS ever reads
> TSC_AUX and attempts to match it with the value that it
> thought it wrote to TSC_AUX; and assumes that
> One solution is for Xen to deny the existence of rdtscp even
> when Xen is running on hardware that supports it. Is that
> exactly what is happening?
> Now thinking creatively, could TSC_AUX be used similar
> to the pvclock version number... Xen bumps it whenever a
> migration occurs which would prompt an app to go out
> and reread new values for scaling and offset (possibly
> via specially-handled-by-Xen usermode rdmsr)? Hmmm...
> I think it might be the answer I've been looking for!
> (Go ahead, shoot me down :-)
> Xen-devel mailing list
Xen-devel mailing list