[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-devel] Potential side-effects and mitigations after cpuidle enabled by default
On Tuesday, March 31, 2009 10:52 PM, Carsten Schiers wrote: > Sorry for my ignorance, but as I find all that very interesting, prior to read > it all over and as I suffer from skew after using cpuidle and also cpufreq > management (on an AMD CPU ;-, which TSC frequency is variant across > freq/voltage scaling) a few > questions: > > - you mention lost ticks in some guests, does this include Dom0? It's where > my messages mainly show up. I haven't observe lost ticks warning in Dom0 for current Xen3.4 tip by far. > - you recomend to limit cpuidle either to C1 or C2 (in case APIC > timer is not stopping. How to know that? You may need refer to processor's spec. > - xm debug-key c reports active C1, max_cstate C2, but only lists C1 usage. > C1 Clock Ramping seems to be disabled. Platform timer is 25MHz HPET. > Excuse my ignorance again, but doesn't that mean I am not using C-states > at all? In Xen3.3 the C1 residency is not counted yet. The max_cstate=C2 does not mean your platform support C2, it just means if your platform support C-states deeper than C2, the deepest used C-state will be C2. I guess xm debug-key c didn't report any C2 information (usage, residency) in your platform, right? If yes, that means your system only support C1. > I understand you speak about Xen 3.4. Currently, I am at 3.3.1 and have to > wait > for a slot to test 3.4. I am curious to see what happens. Dan told me how to > use > xm debug-key t and said, max cycles skew is so much smaller than max stime > (xen > system time) skew. This makes him believe 3.4 will help. Yes, I also strongly suggesting you to have a try on 3.4. But I doesn't expect much for the variant TSC case, just like what I said in the orginal mail. BTW, I believe enabling cpuidle or not should have no impact on your case. Have you checked the result while cpufreq disabled? Thanks Jimmy > > BR, > Carsten. > > ----- Originalnachricht ----- > Von: "Wei, Gang" <gang.wei@xxxxxxxxx> > Gesendet: Die, 31.3.2009 16:00 > An: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx> > Cc: "Tian, Kevin" <kevin.tian@xxxxxxxxx> ; Keir Fraser > <keir.fraser@xxxxxxxxxxxxx> ; "Yu, Ke" <ke.yu@xxxxxxxxx> Betreff: [Xen-devel] > Potential side-effects and mitigations after cpuidle enabled by default > > In xen3.4, cpuidle is defaultly enabled by c/s 19421. But some side-effects > may exist under different h/w C-states implementations or h/w configurations, > so that user may occasionally observe latency or system time/tsc skew. Below > are conditions causing these side-effects and means to mitigate them: > > 1. Latency > > Latency could be caused by two factors: C-state entry/exit latency, and extra > latency caused by broadcast mechanism. > > C-state entry/exit latency is inevitable since powering on/off gates takes > time. Normally shallower C-state incurs lighter latency but less power saving > capability, and vice versa for deeper C-state. Cpuidle governor tries to > balance performance and power tradeoff in high level, which is one area where > we'll continue to tune. > > Broadcast is necessary to handle APIC timer stop at deep C-state (>=C3) on > some platforms. One platform timer source is chosen to carry per-cpu timer > deadline, and then wakeup CPUs in deep C-state timely at expected expiry. > By far Xen3.4 supports PIT/HPET as the broadcast source. In current > implementation PIT broadcast is implemented in periodical mode (10ms) which > means up to 10ms extra latency could be added on expiry expected from sleep > CPUs. This is just initial implementation choice which of course could be > enhanced to on-demand on/off mode in the future. We didn't go into that > complexity in current implementation, due to its slow access and also short > wrap count. So HPET broadcast is always preferred, once this facility is > available which adds negligible overhead with timely wakeup. Then... world is > not always perfect, and some side-effects also exist along with HPET. > > Detail is listed as below: > > 1.1. For h/w supporting ACPI C1 (halt) only (BIOS reported in ACPI _CST > method): > > It's immune from this side-effect as only instruction execution is halted. > > 1.2. For h/w supporting ACPI C2 in which TSC and apic timer don't stop: > > ACPI C2 type is a bit special which is sometimes alias to a deep CPU > C-state and thus current Xen3.4 treat ACPI C2 type in same manner as > ACPI C3 type (i.e. broadcast is activated). If user knows on that platform > ACPI C2 type has not that h/w limitation, 'lapic_timer_c2_ok' could be > added in grub to deactivate software mitigation. > > 1.3. For the rest implementations support ACPI C2+ in which apic timer > will be stopped: > > 1.3.1. HPET as broadcast timer source > > HPET can delivery timely wakeup event to CPUs sleep in deep > C-states with negligible overhead, as stated earlier. But > HPET mode being used does make some differences to worthy of > our noting: > > 1.3.1.1. If h/w supports per-channel MSI delivery mode (intr via FSB), it's > the best broadcast mechanism known so far. No side effect regarding to > latency, and IPIs used to broadcast wakeup event could be reduced by a factor > of number of available channels (each channel could independently serve one > or several sleeping CPUs). > > As long as this feature is available, it's always first prefered automatically > > 1.3.1.2. when MSI delivery mode is absent, we have to use legacy replacement > mode with only one HPET channel available. Well, it's not that bad as this > only one channel could serve all sleeping CPUs by using IPIs to wake up. > However another side-effect occurs, as PIT/RTC interrupts (IRQ0/IRQ8) are > replaced by HPET channel. Then RTC alarm feature in dom0 will be lost, unless > we add RTC emulation between dom0's rtc module and Xen's hpet logic (however, > it's not implemented by far.) > > Due to above side-effect, this broadcast option is disabled by default. In > that case, PIT broadcast is the default. If user is sure that he doesn't need > RTC alarm, then use 'hpetbroadcast' grub option to force enabling it. > > 1.3.2. PIT as broadcast timer source > > If MSI based HPET intr delivery is not available or HPET is missing, in all > cases PIT broadcast is the current default one. As said earlier, 10ms > periodical mode is implemented on PIT broadcast which thus could incur up to > 10ms latency for each deep C-state entry/exit. One natural result is to > observe 'many lost ticks' in some guests. > > 1.4 Suggestions > > So, if user doesn't care about power consumption while his platform does > expose deep C-states, one mitigation is to add 'max_cstate=' boot option to > restrict maximum allowed C-states (If limited to C2, ensure adding > 'lapic_timer_c2_ok' if applied). Runtime modification on 'max_cstate' is > allowed by xenpm (patch posted in 3/24/2009, not checked in yet). > > If user does care about power consumption w/o requirement on RTC alarm, then > always using HPET is preferred. > > Last, we could either add RTC emulation on HPET or enhance PIT broadcast to > use single shot mode, but would like to see comments from community whether > it's worthy of doing. :-) > > 2. system time/TSC skew > > Similarly to APIC timer stop, TSC is also stopped at deep C-states in some > implementations, which thus requires Xen to recover lost counts at exit from > deep C-state by software means. It's easy to think kinds of errors caused by > software methods. For the detail how TSC skew could occur, its side effects > and possible solutions, you could refer to our Xen summit presentation: > http://www.xen.org/files/xensummit_oracle09/XenSummit09pm.pdf > > Below is the brief introduction about which algorithm is available in > different implementations: > > 2.1. Best case is to have non-stop TSC at h/w implementation level. For > example, Intel Core-i7 processors supports this green feature which could be > detected by CPUID. Xen will do nothing once this feature is detected, and thus > no extra software-caused skew besides dozens of cycles due to crystal drift. > > 2.2. If TSC frequency is invariant across freq/voltage scaling (true for all > Intel processors supporting VTx), Xen will sync AP's TSCs to BSP's at 1 second > interval in per-cpu time calibration, meanwhile do recover in a per-cpu style, > where only elapsed platform counter since last calibration point is > compensated to local TSC with a boot-time-calculated scale factor. This > global synchronization along with per-cpu compensation limits TSC skew to ns > level in most cases. > > 2.3. If TSC frequency is variant across freq/voltage scaling, Xen will only do > recover in a per-cpu style, where only elapsed platform counter since last > calibration point is compensated to local TSC with local scale factor. In such > manner TSC skew across cpus is accumulating and easy to be observed after > system is up for some time. > > 2.4. Solution > > Once you observe obvious system time/TSC skew, and you don't care power > consumption specially, then similar to handle broadcast latency: > > Limit 'max_cstate' to C1 or limit 'max_cstate' to a real C2 and give > 'lapic_timer_c2_ok' option. > > Or, better to run your work on a newer platform with either constant TSC > frequency or no-stop TSC feature supported. :-) > > Jimmy > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |