Xen project Mailing List

RE: [Xen-devel] cpuidle causing Dom0 soft lockups

To: Jan Beulich <JBeulich@xxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>

From: "Yu, Ke" <ke.yu@xxxxxxxxx>

Date: Wed, 3 Feb 2010 01:07:14 +0800

Accept-language: en-US

Acceptlanguage: en-US

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>

Delivery-date: Tue, 02 Feb 2010 09:07:43 -0800

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: Acqj3Q0s1oo/kfQzT7a3+g7dJJDr3gAQv8qw

Thread-topic: [Xen-devel] cpuidle causing Dom0 soft lockups

>-----Original Message----- >From: Jan Beulich [mailto:JBeulich@xxxxxxxxxx] >Sent: Tuesday, February 02, 2010 3:55 PM >To: Keir Fraser; Yu, Ke >Cc: xen-devel@xxxxxxxxxxxxxxxxxxx >Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups > >>>> Keir Fraser <keir.fraser@xxxxxxxxxxxxx> 21.01.10 12:03 >>> >>On 21/01/2010 10:53, "Jan Beulich" <JBeulich@xxxxxxxxxx> wrote: >>> I can see your point. But how can you consider shipping with something >>> apparently severely broken. As said before - the fact that this manifests >>> itself by hanging many-vCPU Dom0 has the very likely implication that >>> there are (so far unnoticed) problems with smaller Dom0-s. If I had a >>> machine at hand that supports C3, I'd try to do some measurements >>> with smaller domains... >> >>Well it's a fallback I guess. If we can't make progress on solving it then I >>suppose I agree. > >Just fyi, we now also have seen an issue on a 24-CPU system that went >away with cpuidle=0 (and static analysis of the hang hinted in that >direction). All I can judge so far is that this likely has something to do >with our kernel's intensive use of the poll hypercall (i.e. we see vCPU-s >not waking up from the call despite there being pending unmasked or >polled for events). > >Jan Hi Jan, We just identified the cause of this issue, and is trying to find appropriate way to fix it. This issue is the result of following sequence: 1. every dom0 vCPU has one 250HZ timer (i.e. 4ms period). The vCPU timer_interrupt handler will acquire a global ticket spin lock xtime_lock. When xtime_lock is hold by other vCPU, the vCPU will poll event channel and become blocked. As a result, the pCPU where the vCPU runs will become idle. Later, when the lock holder release xtime_lock, it will notify event channel to wake up the vCPU. As a result, the pCPU will wake up from idle state, and schedule the vCPU to run. >From the above, we can see the latency of vCPU timer interrupt is consisted of >the following items. The "latency" here means the time between beginning to >acquire lock and finally lock acquired. T1 - CPU execution time ( e.g. timer interrupt lock holding time, event channel notification time) T2 - CPU idle wake up time, i.e. the time CPU wake up from deep C state (e.g. C3) to C0, usually it is in the order of several 10us or 100us 2. then let's consider the case of large number of CPUs, e.g. 64 pCPU and 64 VCPU in dom0, let's assume the lock holding sequence is VCPU0 -> VCPU1->VCPU2 ... ->VCPU63. Then vCPU63 will spend 64*(T1 + T2) to acquire the xtime_lock. if T1+T2 is 100us, then the total latency would be ~6ms. As we have known that the timer is 250HZ, or 4ms period, so when event channel notification issued, and pCPU schedule vCPU63, hypervisor will find the timer is over-due, and will send another TIMER_VIRQ for vCPU63 (see schedule()->vcpu_periodic_timer_work() for detail). In this case, vCPU63 will be always busy handling timer interrupt, and not be able to update the watch dog, thus cause the softlock up. So from the above sequence, we can see: - cpuidle driver add extra latency, thus make this issue more easy to occurs. - Large number of CPU multiply the latency - ticket spin lock lead fixed lock acquiring sequence, thus lead the latency repeatedly being 64*(T1+T2), thus make this issue more easy to occurs. and the fundamental cause of this issue is that vCPU timer interrupt handler is not good for scaling, due to the global xtime_lock. >From cpuidle point of view, one thing we are trying to do is: changing the >cpuidle driver to not enter deep C state when there is vCPU with local irq >disabled and event channel polling. In this case, the T2 latency will be >eliminated. Anyway, cpuidle is just one side, we can anticipate that if CPU number is large enough to lead NR_CPU * T1 > 4ms, this issue will occurs again. So another way is to make dom0 scaling well by not using xtime_lock, although this is pretty hard currently. Or another way is to limit dom0 vCPU number to certain reasonable level. Regards Ke _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.