[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] cpuidle causing Dom0 soft lockups



>-----Original Message-----
>From: Jan Beulich [mailto:JBeulich@xxxxxxxxxx]
>Sent: Tuesday, February 02, 2010 3:55 PM
>To: Keir Fraser; Yu, Ke
>Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
>Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>>>> Keir Fraser <keir.fraser@xxxxxxxxxxxxx> 21.01.10 12:03 >>>
>>On 21/01/2010 10:53, "Jan Beulich" <JBeulich@xxxxxxxxxx> wrote:
>>> I can see your point. But how can you consider shipping with something
>>> apparently severely broken. As said before - the fact that this manifests
>>> itself by hanging many-vCPU Dom0 has the very likely implication that
>>> there are (so far unnoticed) problems with smaller Dom0-s. If I had a
>>> machine at hand that supports C3, I'd try to do some measurements
>>> with smaller domains...
>>
>>Well it's a fallback I guess. If we can't make progress on solving it then I
>>suppose I agree.
>
>Just fyi, we now also have seen an issue on a 24-CPU system that went
>away with cpuidle=0 (and static analysis of the hang hinted in that
>direction). All I can judge so far is that this likely has something to do
>with our kernel's intensive use of the poll hypercall (i.e. we see vCPU-s
>not waking up from the call despite there being pending unmasked or
>polled for events).
>
>Jan

Hi Jan,

We just identified the cause of this issue, and is trying to find appropriate 
way to fix it.

This issue is the result of following sequence:
1. every dom0 vCPU has one 250HZ timer (i.e. 4ms period). The vCPU 
timer_interrupt handler will acquire a global ticket spin lock xtime_lock. When 
xtime_lock is hold by other vCPU, the vCPU will poll event channel and become 
blocked. As a result, the pCPU where the vCPU runs will become idle. Later, 
when the lock holder release xtime_lock, it will notify event channel to wake 
up the vCPU. As a result, the pCPU will wake up from idle state, and schedule 
the vCPU to run.

>From the above, we can see the latency of vCPU timer interrupt is consisted of 
>the following items. The "latency" here means the time between beginning to 
>acquire lock and finally lock acquired.
T1 - CPU execution time ( e.g. timer interrupt lock holding time, event channel 
notification time)
T2 - CPU idle wake up time, i.e. the time CPU wake up from deep C state (e.g. 
C3) to C0, usually it is in the order of several 10us or 100us

2. then let's consider the case of large number of CPUs, e.g. 64 pCPU and 64 
VCPU in dom0, let's assume the lock holding sequence is VCPU0 -> VCPU1->VCPU2 
... ->VCPU63. 
Then vCPU63 will spend 64*(T1 + T2) to acquire the xtime_lock. if T1+T2 is 
100us, then the total latency would be ~6ms.
As we have known that the timer is 250HZ, or 4ms period, so when event channel 
notification issued, and pCPU schedule vCPU63, hypervisor will find the timer 
is over-due, and will send another TIMER_VIRQ for vCPU63 (see 
schedule()->vcpu_periodic_timer_work() for detail). In this case, vCPU63 will 
be always busy handling timer interrupt, and not be able to update the watch 
dog, thus cause the softlock up.

So from the above sequence, we can see:
- cpuidle driver add extra latency, thus make this issue more easy to occurs.
- Large number of CPU multiply the latency
- ticket spin lock lead fixed lock acquiring sequence, thus lead the latency 
repeatedly being 64*(T1+T2), thus make this issue more easy to occurs.
and the fundamental cause of this issue is that vCPU timer interrupt handler is 
not good for scaling, due to the global xtime_lock.

>From cpuidle point of view, one thing we are trying to do is: changing the 
>cpuidle driver to not enter deep C state when there is vCPU with local irq 
>disabled and event channel polling. In this case, the T2 latency will be 
>eliminated. 

Anyway, cpuidle is just one side, we can anticipate that if CPU number is large 
enough to lead NR_CPU * T1 > 4ms, this issue will occurs again. So another way 
is to make dom0 scaling well by not using xtime_lock, although this is pretty 
hard currently. Or another way is to limit dom0 vCPU number to certain 
reasonable level.

Regards
Ke

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.