[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)
> My current thinking is that we (the Linux and > Xen and KVM community) should architect a > userland API using the pvclock mechanism. OK, here's a slightly refined proposal. To reiterate, the problem is that Xen's current mechanism for handling the rdtsc instruction may silently provide incorrect results while alternative mechanisms are too slow (vs VMware which is both fast and correct). My goal is to provide a paravirtualized tsc mechanism for apps running on Xen that is reliably correct, is not dependent on a particular OS or processor family, is approximately as fast as rdtsc (or at least much faster than emulated rdtsc), provides adequate (e.g. nanosecond) resolution, does not require recompilation to work both on Xen and bare metal, and works properly across: vcpu-to-pcpu rescheduling even on NUMA machines; system sleep/hibernation; and save/restore/migration between machines with dissimilar clock rates. Implementation requires changes in Xen and "the app" but no OS changes thus making it still viable on legacy OS's and possibly(?) HVM domains. Note that only apps that need to sample time on the order of >5-100K/core/second would use this; for other apps, rdtsc emulation overhead is probably negligible (<0.2%). 0) Xen implements rdtsc emulation by default 1) Guest OS is launched with pvtsc=1 in vm.cfg 2) App running on guest OS sets up a SIGILL handler 3) App executes a special rdmsr instruction or hypercall. 4a) If SIGILL results, not running on Xen at all, or on old Xen; app uses rdtsc at own risk. Done. 4b) Else, rdmsr/hypercall returns virtual address of special pvclock page ("pvclock_va"). 5) App executes another special rdmsr instruction/ hypercall to disable rdtsc emulation. This affects ALL execution for all processes in this VM. 6) Xen maintains mapping of pvclock_va to a different physical page for each processor and transparently handles TLB misses for pvclock_va 7) App uses (unemulated) rdtsc and applies pvclock algorithm (using values in memory at pvclock_va) resulting in pvtsc, which is nanoseconds since VM start. App can further apply local algorithms to enforce monotonicity or frequency scaling as desired. Comments appreciated. I realize that this is hacky and ugly... better alternatives gladly solicited. Thanks, Dan P.S. While it would be nice if we could just tell apps to use a fast vgettimeofday equivalent, this does not exist today and, even if it did, would not be widely available for years in the kernel running under most enterprise app deployments (and, even then, only on 64-bit Linux.) > -----Original Message----- > From: Dan Magenheimer > Sent: Friday, August 28, 2009 11:50 AM > To: Xen-Devel (E-mail) > Cc: Jeremy Fitzhardinge; Keir Fraser; Alan Cox > Subject: rdtsc: correctness vs performance on Xen (and KVM?) > > > To summarize: > > Xen and KVM currently allow rdtsc to be executed > directly by userland. As a result, apps that > use rdtsc smartly and effectively on (some) physical > machines may break badly in Xen or KVM because of > the disassociation of physical and virtual cpus. > (Readers not familiar with why rdtsc is a problem, > can read e.g. http://en.wikipedia.org/wiki/Rdtsc) > > VMware always emulates rdtsc, both for kernel and > userland rdtsc's. (I don't know what HyperV does.) > > Xen currently has a boot option to always emulate > rdtsc in HVM guests and just added code such that > the same boot option will always emulate rdtsc for > userland-only in PVM guests. There is some agreement > in the Xen community that rdtsc emulation should > always be the default though the default is currently > off. KVM is having a similar discussion and, I'm > told, has also come to the conclusion that emulating > rdtsc is a necessary evil. > > The problem is that emulating rdtsc is slow. On > my dual-core Conroe, rdtsc is about 72 cycles and > emulating rdtsc (returning a fixed frequency 1GHz > Xen monotonic system time) is over 15x slower. > This is a big hit for apps that do tens to hundreds > of thousands of rdtsc's per processor per second. > (And yes these apps are more common than one > might think.) > > VMware has the advantage of binary translation; > rdtsc can be translated to return a "conforming" > value in ~200 cycles (on an older processor so > probably faster if you are comparing against my > dual-core Conroe numbers above). This value > is "stale" (not linear with wallclock time). > For VMs that need rdtsc to more accurately reflect > wallclock time, full emulation can be optionally > enabled for a VM. > > I'm searching for alternatives that provide the > correctness of emulation, but better performance > than emulation. Jeremy points out that the > pvclock mechanism in upstream Linux works well, > but the pvclock data is currently only exposed > to kernel... and exposing it to userland still > requires apps-using-rdtsc to be rewritten. > But Jeremy claims that all apps-that-use-rdtsc > MUST be rewritten because using rdtsc is unsafe, > and that they should be rewritten to use > gettimeofday (or actually vgettimeofday). > But on older OS's (including the vast majority > of installed units) and machines where tsc is > "unsafe", gettimeofday can be MUCH slower than > emulating rdtsc. So telling app writers to > convert all uses of rdtsc to gettimeofday is > not an acceptable solution for these apps in > the shortterm. > > My current thinking is that we (the Linux and > Xen and KVM community) should architect a > userland API using the pvclock mechanism. > The underlying implementation of this API would > utilize Linux only to "register" the mechanism, > preferably via a module so that it, like > disk and network frontends, could easily be > bolted on to shipping OS's. Individual uses > of "pvclock_read" would need no syscall... like > the kernel pvclock mechanism, they need only > access memory to get the necessary scaling > and offset data. Once instantiated, rdtsc > is executed directly by the app as part of the > pvclock protocol. If never registered, > rdtsc would always be trapped and emulated. > > I realize this idea is half-baked, but would like > to invite other TSC/time experts to determine > if some or all of the idea might be used to > achieve a fully-baked solution. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |