[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen [1 of 4]: Reliable TSC
============= Premise 1: A large and growing percentage of servers running Xen have a "reliable" TSC and Xen can determine conclusively whether a server does or does not have a reliable TSC. ============= The truth of this statement has been vociferously challenged in other threads, so I'd LOVE TO GET FEEDBACK OR CONFIRMATION FROM PROCESSOR AND SERVER VENDORS. The rest of this is long though hopefully educational, but if you have no interest in the rdtsc instruction or timestamping, please move on to [2 of 4]. Since my overall premise is a bit vague, I need to first very clearly define my terms. And to define those terms clearly, I need to provide some more background. As far as I can find, there is no publication which clearly describes all of these concepts. The rdtsc instruction was at one time the easiest and cheapest and most precise method for "approximating the passage of time"; as such rdtsc was widely used by x86 performance practitioners and high-end apps that needed to provide extensive metrics. When commodity SMP x86 systems emerged, rdtsc fell into disfavor because: (a) it was difficult to for different CPU packages to share a crystal or ensure different crystals were synchronized or increasing at precisely the same rate, and (b) SMP apps were oblivious to which CPU their thread(s) were running on so two rdtsc instructions in the same thread might execute on different CPU's and thus unwittingly use different crystals, resulting in strange things like the appearance that time went backwards (sometimes by a large amount) or events appearing to take different amounts of time depending on whether they were running on processor A or processor B. We will call this the "inconsistent TSC" problem. Processor and system vendors attempted to fix the inconsistent TSC problem by providing a new class of "platform timers" (e.g. HPET), but these proved to be slow and difficult to use, especially for apps that required frequent fine metrics. Processor and system vendors eventually figured out how to synchronize TSC with the same crystal, but then a new set of problems emerged: Power features sometimes caused the clock on one processor to slow down or even stop, thus destroying the synchrony with other processors. This was fixed first by ensuring that the tick rate did not change ("constant TSC") and later that it did not stop ("nonstop TSC"), unless ALL of the TSCs on all of the processors stopped. Nearly all of the most recent generations of server processors support these capabilities, so as a result on most recent servers, the TSC on all processors/cores/sockets is driven by the same crystal, always ticks at the same rate, and doesn't stop independently of other processors' TSCs. This is what we call a "reliable TSC". But we're not done yet. What does a reliable TSC provide? We need to define a few more terms. A "perfect TSC" would be one where a magic logic analyzer with a cesium clock could confirm that the TSC's on every processor increment at precisely the same femtosecond. Both the speed of light and the pricing models of commodity processors make a perfect TSC unlikely :-) How close is good enough? We define two TSCs as being "unobservably different" if code running on the two processors can never see time going backwards, because the difference bettween their TSCs is smaller than the memory access overhead due to cache synchronization. (This is sometimes called a "cache bounce".) To wit, suppose processor A does a rdtsc and writes the result into memory; meanwhile processor B is spinning until it sees that the memory location has changed, reads A's value from memory and then does its own rdtsc. If B's rdtsc is NEVER less OR equal to A's rdtsc, we will call this an "optimal TSC". A reliable TSC is not guaranteed to be optimal; it may just be very close to optimal, meaning the difference between two TSCs may sometimes be observable but it will always be very small. (As far as I know, processor and server vendors will not guarantee exactly how small.) To simulate an optimal TSC with a reliable TSC, a software wrapper can be placed around the reads from a reliable TSC to catch and "fix" the rare circumstances where time goes backwards. If this wrapper, ensures that time never goes backwards AND ensures that time always moves forward, we call this a monotonically-increasing wrapper. If it instead ensures that time never goes backwards AND may appear to stop, we call this a monotonically-non-decreasing wrapper. Note also that a reliable TSC is not guaranteed to never stop; it is just guaranteed that if the TSC on one processor is stopped, the TSC on all other processors will also be stopped. As a result, a reliable TSC cannot be used as a wallclock, at least without other software support that can properly adjust the TSC on all processors when all processors awaken. Last, there is the issue of whether or not Xen can conclusively determine if the TSC is reliable. This is still an open challenge. There exists a CPUID bit which purports to do this, but it is not known with certainty if there are exceptions. Notably, there is concern if certain newer larger NUMA servers will truly provide a reliable TSC across all system processors even if the CPUID bit on each CPU package says the package does provide a reliable TSC. One large server vendor claims that this is not a problem anymore, but ideally we would like to test this dynamically and there is GPL code available to do exactly that. This code is used in Linux in some circumstances once at boot-time to test for an "optimal TSC". But in some cases the CPUID bit defuses this test. And in any case a boottime test may not catch all problems, such as a power event that doesn't handle TSC quite properly. So without some form of ongoing post-boottime test, we just don't know. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |