Xen project Mailing List

RE: [Xen-devel] write_tsc in a PV domain?

To: dan.magenheimer@xxxxxxxxxx, Jeremy Fitzhardinge <jeremy@xxxxxxxx>

From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>

Date: Mon, 31 Aug 2009 11:11:50 -0700 (PDT)

Cc: "Xen-Devel \(E-mail\)" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, Alan Cox <alan@xxxxxxxxxxxxxxxxxxx>

Delivery-date: Mon, 31 Aug 2009 11:12:42 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

I'm experimenting with clock_gettime(), gettimeofday(), and rdtsc with a 2.6.30 64-bit pvguest. I have tried both with kernel.vsyscall64 equal to 0 and 1 (but haven't seen any significant difference between the two). I have confirmed from sysfs that clocksource=xen I have yet to get a measurement of either syscall that is better than 2.5x WORSE than emulating rdtsc. On my dual-core Conroe (Intel E6850) with 64-bit Xen and 32-bit dom0, I get approximately: rdtsc native: 22ns softtsc (rdtsc emulated): 360ns gettime syscall w/softtsc: 1400ns gettime syscall native tsc: 980ns gettimeofday w/softtsc: 1750ns gettimeofday native tsc: 900ns I'm hoping this is either a bug in the 2.6.30 xen pvclock implementation or in my measurement methodology, so would welcome others measuring this. A couple other minor observations: 1) The syscalls seem to be somewhat slower when usermode rdtscs are being emulated, by approximately the cost of emulating an rdtsc. I suppose this makes sense since vsyscalls are executed in userland and since vgettimeofday does a rdtsc. However it complicates strategy if emulating rdtsc is the default. 2) The syscall clock_getres() does not seem to reflect the fact that > -----Original Message----- > From: Dan Magenheimer > Sent: Saturday, August 29, 2009 11:52 AM > To: Jeremy Fitzhardinge > Cc: Alan Cox; Xen-Devel (E-mail); Keir Fraser > Subject: RE: [Xen-devel] write_tsc in a PV domain? > > > (Reordered with most important points first...) > > > You are talking about three different cases: > > I agree with your analysis for case 1 and case 3. > > > So, there's case 2: pv usermode. There are four > > classes of apps worth considering here: > > I agree with your classification. But a key point > is that VMware provides correctness for all > of these classes. AND provides it at much better > performance than trap-and-emulate. AND provides > correctness+performance regardless of the underlying > OS (e.g. even "old" OS's such as RHEL4 and RHEL5). > AND provides it regardless whether the guest OS is > 32-bit or 64-bit. AND, by the way, provides it for > your case 1 (PV OS) and case 3 (HVM) as well. > > > So if you want to address these problems, it seems to me > > you'll get most > > bang for the buck by fixing (v)gettimeofday to use pvclock, and > > convincing app writers to trust in gettimeofday. > > (Partially irrelevant point, but gettimeofday returns > microseconds which is not enough resolution for many > cases where rdtsc has been used in apps. Clock_gettime > is the relevant API I think.) > > If we can come up with a way for a kernel-loadable module > to handle some equivalent of clock_gettime so that > the most widely used shipping PV OS's can provide a > pvclock interface to apps, this might be workable. > If we tell app providers and customers: "You > can choose either performance OR correctness but > not both, unless you upgrade to a new OS (that is > not even available yet)", I don't think that will > be acceptable. > > Any ideas on how pvclock might be provided through > a module that could be added to, eg. RHEL4 or RHEL5? > > > > There ARE guaranteed properties specified by > > > the Intel SDM for any _single_ processor... > > > > Yes, but those are fairly weak guarantees. It does not > guarantee that > > the tsc won't change rate arbitrarily, or stop outright > between reads. > > They are weak guarantees only if one uses rdtsc > to accurately track wallclock time. They are > perfectly useful guarantees if one simply wants to > either timestamp data to record ordering (e.g. > for journaling or transaction replay), or > approximate the passing of time to provide > approximate execution metrics (e.g. for > performance tools). > > > > What is NOT guaranteed, but is widely and > > > incorrectly assumed to be implied and has > > > gotten us into this mess, is that > > > the same properties applies across multiple > > > processors. > > > > Yes, Linux offers even weaker guarantees than Intel. Aside from the > > processor migration issue, the tsc can jump arbitrarily as > a result of > > suspend/resume (ie, it can be non-monotonic). > > Please explain. Suspend/resume is an S state isn't > it? Is it possible to suspend/resume one processor > in an SMP system and not another processor? I think > not. Your point is valid for C-states and P-states > but those are what Intel/AMD has fixed in the most > recent families of multi-core processors. > > So I don't see how (in the most recent familes of > processors) tsc can be non-monotonic. > > > Even very recent processors with "constant" tscs (ie, they > > don't change > > rate with the core frequency) stop in certain power states. > > For the most recent families of processors, the TSC > continues to run at a fixed rate even for all the > P-states and C-states. We should confirm this with > Intel and AMD. > > > Any motherboard design which runs packages in different > > clock-domains will lose tsc-sync between those packages, > > regardless of what's in the packages. > > I'm told this is not true for recent multi-socket systems > where the sockets are on the same motherboard. And at > least one large vendor that ships a new one-socket-per- > motherboard NUMA-ish system claims that it is not even > true when the sockets are on different motherboards. > > Dan > > (no further replies below, remaining original text retained > for context) > > > You are talking about three different cases: > > > > 1. the reliability of the tsc in a PV guest in kernel mode > > 2. the reliability of the tsc in a PV guest in user mode > > 3. the reliability of the tsc in an HVM guest > > > > I don't think 1. needs any attention. The current scheme > works fine. > > > > The only option for 3 is to try make a best-effort of tsc > > quality, which > > ranges from trapping every rdtsc to make them all give globally > > monotonic results, or use the other VT/SVM features to > apply an offset > > from the raw tsc to a guest tsc, etc. Either way the > situation isn't > > much different from running native (ie, apps will see > > basically the same > > tsc behaviour as in the native case, to some degree of > approximation). > > > > So, there's case 2: pv usermode. There are four classes of > apps worth > > considering here: > > > > 1. Old apps which make unwarranted assumptions about the > > behavour of > > the tsc. They assume they're basically running on some > > equivalent > > of a P54, and so will get junk on any modernish > system with SMP > > and/or power management. If people are still using > > such apps, it > > probably means their performance isn't critically > > dependent on the > > tsc. > > 2. More sophisticated apps which know the tsc has some > limitations > > and try to mitigate them by filtering discontinuities, using > > rdtscp, etc. They're best-effort, but they inherently > > lack enough > > information to do a complete job (they have to guess at where > > power transitions occured, etc). > > 3. New apps which know about modern processor capabilities, and > > attempt to rely on constant_tsc forgoing all the best-effort > > filtering, etc > > 4. Apps which use gettimeofday() and/or clock_gettime() > > for all time > > measurement. They're guaranteed to get consistent > time results, > > perhaps at the cost of a syscall. On systems which > support it, > > they'll get vsyscall implementations which avoid the > > syscall while > > still using the best-possible clocksource. Even if > they don't a > > syscall will outperform an emulated rdtsc. > > > > Class 1 apps are just broken. We can try to emulate a UP, no-PM > > processor for them, and that's probably best done in an HVM domain. > > There's no need to go to extraordinary efforts for them because the > > native hardware certainly won't. > > > > Class 2 apps will work as well as ever in a Xen PV domain as-is. If > > they use rdtscp then they will be able to correlate the tsc to the > > underlying pcpu and manage consistency that way. If they pin > > threads to > > VCPUs, then they may also requre VCPUs to be pinned to PCPUs. But > > there's no need to make deep changes to Xen's tsc handling to > > accommodate them. > > > > Class 3 apps will get a bit of a rude surprise in a PV Xen > > domain. But > > they're also new enough to use another mechanism to get > time. They're > > new enough to "know" that gettimeofday can be very efficient, > > and should > > not be going down the rathole of using rdtsc directly. And unless > > they're going to be restricted to a very narrow class of > machines (for > > example, not my relatively new Core2 laptop which stops the > "constant" > > tsc in deep sleep modes), they need to fall back to being a > > class 2 or 4 > > app anyway. > > > > Class 4 apps are not well-served under Xen. I think the vsyscall > > mechanism will be disabled and they'll always end up doing a real > > syscall. However, I think it would be relatively easy to add a new > > vgettimeofday implementation which directly uses the > pvclock mechanism > > from usermode (the same code would work equally well for Xen > > and KVM). > > There's no need to add a new usermode ABI to get quick, high-quality > > time in usermode. Performance-wise it would be more or less > > indistinguishable from using a raw rdtsc, but it has the benefit of > > getting full cooperation from the kernel and Xen, and can take into > > account all tsc variations (if any). > > > > > > So if you want to address these problems, it seems to me > > you'll get most > > bang for the buck by fixing (v)gettimeofday to use pvclock, and > > convincing app writers to trust in gettimeofday. > > > > J > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.