Xen project Mailing List

Re: [Xen-devel] RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.

To: George Dunlap <george.dunlap@xxxxxxxxxxxxx>, keir.xen@xxxxxxxxx

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Tue, 6 May 2014 13:36:27 -0400

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, dario.faggioli@xxxxxxxxxx

Delivery-date: Tue, 06 May 2014 17:36:58 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Tue, Apr 29, 2014 at 08:42:06AM -0400, Konrad Rzeszutek Wilk wrote: > On Tue, Apr 29, 2014 at 10:16:39AM +0100, George Dunlap wrote: > > On 04/23/2014 10:28 PM, Konrad Rzeszutek Wilk wrote: > > >What we are observing is that if a domain is idle its steal > > >time* goes up. My first thought was - well that is the initial > > >domain taking the time - but after looking at the trace > > >did not see the initial domain to be scheduled at all. > > > > > >(*steal time: RUNSTATE_runnable + RUNSTATE_offline). > > > > "Up" like how much? > > 6.7msec. Or ~1/4 of the timeslice > > > > Steal time includes the time *being woken* up. It takes time to be > > woken up; typically if it's being woken up from domain 0, for > > instance, the wake (which sets it to RUNSTATE_runnable) will happen > > on a different pcpu than the vcpu being woken is on, so there's the > > delay of the IPI, waking up, going through the scheduler, &c. > > Right. In this case there are no IPIs. Just softirq handlers being [edit: There is the IPI associated with raise_softirq_action being broadcast to CPUs] > triggered (by some other VCPU it seems) and they run.. And the > time between the 'vcpu_wake' and the 'schedule' softirq are quite > long. > > > > The more frequently a VM is already running, the lower probability > > that an interrupt will actually wake it up. > > Right. But there are no interrupt here at all. It is just idling. [edit: Just the IPI when it is halted and the idle guest has been scheduled in] > > > > BTW, is there a reason you're using xentrace_format instead of xenalyze? > > I did use xenanalyze and it told me that the vCPU is busy spending most > of its time in 'runnable' condition. The other vCPUs are doing other > work. I finally narrowed it down. We are contending for the 'tasklet_work' spinlock. The steps that we take to get to this state are as follow (imagine four 30VCPU guests pinned to their sockets - and there is one socket per guest). a). Guest does 'HLT', we schedule in idle domain. b). The guest's timer is triggered, an IPI comes in, we get out of hlt;pause c). and softirq_pending has 'TIMER' (0) set. We end up doing this: idle_loop ->do_softirq timer_softirq_action vcpu_singleshot_timer_fn vcpu_periodic_timer_work send_timer_event-> send_guest_vcpu_virq-> evtchn_set_pending-> vcpu_mark_events_pending-> [ here can call hvm_assert_evtchn_irq] which schedules a tasklet. 211 tasklet_schedule(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet); d). We got back to soft_irq and softirq_pending has 'TASKLET' set. We call: idle_loop ->do_softirq ->tasklet_softirq_action -> spin_lock_irq(&tasklet_lock); [takes a while] -> do_tasklet_work -> unlock -> call the work function: ->hvm_assert_evtchn_irq vcpu_kick-> vcpu_unblock-> vcpu_wake [vcpu_runstate_change] [blocked->runnable] -> from here on any activity is accounted to the guest ] [_runq_tickle], sets the SCHEDULE_SOFTIRQ softirq] -> spin_lock_irq(&tasklet_lock); [takes also a bit of time] -> unlock e). We get back to soft_irq and softirq_pending has 'SCHEDULE' set. We swap out the idle domain and stick in the new guest. The runtime in RUNNABLE includes the time to take the 'tasklet_lock'. f). Call INJ_VIRQ with the 0xf3 to wake guest up. N.B. The softirq handlers that are run end up being: TIMER, TASKLET, and SCHEDULE. As in, low latency (TIMER), high latency (TASKLET), and low latency (SCHEDULE). The 'tasklet_lock' on this particular setup ends up being hit by three different NUMA nodes, and of course at the same time. My belief is that the application in question that sets the user-space times sets the same 'alarm' timer in all the guests - and when they all go to sleep, they are suppose to wake up at the same time. And I think this is done on all of the guests, so it is stampede of everybody waking up at the same time from 'hlt'. There is no oversubscription. The reason we schedule a tasklet instead of continuing with an 'vcpu_kick' is not yet known to me. This commit added the mechanism to do it via the tasklet: commit a5db2986d47fafc5e62f992616f057bfa43015d9 Author: Keir Fraser <keir.fraser@xxxxxxxxxx> Date: Fri May 8 11:50:12 2009 +0100 x86 hvm: hvm_set_callback_irq_level() must not be called in IRQ context or with IRQs disabled. Ensure this by deferring to tasklet (softirq) context if required. Signed-off-by: Keir Fraser <keir.fraser@xxxxxxxxxx> But I am not sure why: a). 'must not be called in IRQ context or with IRQs disabled' is important - I haven't dug in the code yet to understand the crucial reasons for - is there a known issue about this? b). Why do we have a per-cpu tasklet lists, but any manipulation of the items of them are protected by a global lock. Looking at the code in Linux and Xen the major difference is that Xen can schedule on specific CPUs (or even the tasklet can schedule itself on another CPU). Linux's variants of tasklets are much simpler - and don't have any spinlocks (except the atomic state of the tasklet running or scheduled to be run). I can see the need for the tasklets being on different CPUs for the microcode, and I am digging through the other ones to get a feel for it - but has anybody thought about improving this code? Has there been any suggestions/ideas tossed around in the past (the mailing list didn't help or my Google-fun sucks). Thanks. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.