[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.



On Tue, Apr 29, 2014 at 08:42:06AM -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Apr 29, 2014 at 10:16:39AM +0100, George Dunlap wrote:
> > On 04/23/2014 10:28 PM, Konrad Rzeszutek Wilk wrote:
> > >What we are observing is that if a domain is idle its steal
> > >time* goes up. My first thought was - well that is the initial
> > >domain taking the time - but after looking at the trace
> > >did not see the initial domain to be scheduled at all.
> > >
> > >(*steal time: RUNSTATE_runnable + RUNSTATE_offline).
> > 
> > "Up" like how much?
> 
> 6.7msec. Or ~1/4 of the timeslice
> > 
> > Steal time includes the time *being woken* up.  It takes time to be
> > woken up; typically if it's being woken up from domain 0, for
> > instance, the wake (which sets it to RUNSTATE_runnable) will happen
> > on a different pcpu than the vcpu being woken is on, so there's the
> > delay of the IPI, waking up, going through the scheduler, &c.
> 
> Right. In this case there are no IPIs. Just softirq handlers being

[edit: There is the IPI associated with raise_softirq_action being
broadcast to CPUs]

> triggered (by some other VCPU it seems) and they run.. And the
> time between the 'vcpu_wake' and the 'schedule' softirq are quite
> long.
> > 
> > The more frequently a VM is already running, the lower probability
> > that an interrupt will actually wake it up.
> 
> Right. But there are no interrupt here at all. It is just idling.

[edit: Just the IPI when it is halted and the idle guest has been
scheduled in]
> > 
> > BTW, is there a reason you're using xentrace_format instead of xenalyze?
> 
> I did use xenanalyze and it told me that the vCPU is busy spending most
> of its time in 'runnable' condition. The other vCPUs are doing other
> work.

I finally narrowed it down. We are contending for the 'tasklet_work' spinlock.

The steps that we take to get to this state are as follow (imagine four
30VCPU guests pinned to their sockets - and there is one socket per guest).

 a). Guest does 'HLT', we schedule in idle domain.
 b). The guest's timer is triggered, an IPI comes in, we get out of
        hlt;pause
 c).  and softirq_pending has 'TIMER' (0) set. We end up doing this:
        
        idle_loop
          ->do_softirq
               timer_softirq_action
                  vcpu_singleshot_timer_fn
                     vcpu_periodic_timer_work
                        send_timer_event->
                            send_guest_vcpu_virq->
                                 evtchn_set_pending->
                                      vcpu_mark_events_pending->

            [ here can call hvm_assert_evtchn_irq] which schedules a tasklet.
211         tasklet_schedule(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet);

 d). We got back to soft_irq and softirq_pending has 'TASKLET' set. We
     call:
        idle_loop
          ->do_softirq
                ->tasklet_softirq_action
                   -> spin_lock_irq(&tasklet_lock);
                [takes a while]
                   -> do_tasklet_work
                        -> unlock
                        -> call the work function:
                           ->hvm_assert_evtchn_irq
                                vcpu_kick->
                                   vcpu_unblock->
                                    vcpu_wake
                                        [vcpu_runstate_change] 
[blocked->runnable]
   -> from here on any activity is accounted to the guest ]
                             [_runq_tickle], sets the SCHEDULE_SOFTIRQ softirq]
                        -> spin_lock_irq(&tasklet_lock);
        [takes also a bit of time]
                        -> unlock

  e). We get back to soft_irq and softirq_pending has 'SCHEDULE' set. We
      swap out the idle domain and stick in the new guest. The runtime
      in RUNNABLE includes the time to take the 'tasklet_lock'.
  f). Call INJ_VIRQ with the 0xf3 to wake guest up.

N.B.
The softirq handlers that are run end up being: TIMER, TASKLET, and SCHEDULE.
As in, low latency (TIMER), high latency (TASKLET), and low latency (SCHEDULE).


The 'tasklet_lock' on this particular setup ends up being hit by three
different NUMA nodes, and of course at the same time. My belief is that
the application in question that sets the user-space times sets the same
'alarm' timer in all the guests - and when they all go to sleep, they
are suppose to wake up at the same time. And I think this is done on all
of the guests, so it is stampede of everybody waking up at the same
time from 'hlt'.  There is no oversubscription.

The reason we schedule a tasklet instead of continuing with an
'vcpu_kick' is not yet known to me. This commit added the mechanism
to do it via the tasklet:

commit a5db2986d47fafc5e62f992616f057bfa43015d9
Author: Keir Fraser <keir.fraser@xxxxxxxxxx>
Date:   Fri May 8 11:50:12 2009 +0100

    x86 hvm: hvm_set_callback_irq_level() must not be called in IRQ
    context or with IRQs disabled. Ensure this by deferring to tasklet
    (softirq) context if required.
    
    Signed-off-by: Keir Fraser <keir.fraser@xxxxxxxxxx>

But I am not sure why:

 a). 'must not be called in IRQ context or with IRQs disabled' is
    important - I haven't dug in the code yet to understand the
    crucial reasons for - is there a known issue about this?

 b). Why do we have a per-cpu tasklet lists, but any manipulation of the
     items of them are protected by a global lock. Looking at the code in
     Linux and Xen the major difference is that Xen can schedule on specific 
CPUs
     (or even the tasklet can schedule itself on another CPU).

     Linux's variants of tasklets are much simpler - and don't have
     any spinlocks (except the atomic state of the tasklet running
     or scheduled to be run).

     I can see the need for the tasklets being on different CPUs for
     the microcode, and I am digging through the other ones to get
     a feel for it - but has anybody thought about improving this
     code? Has there been any suggestions/ideas tossed around in the
     past (the mailing list didn't help or my Google-fun sucks).

Thanks.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.