[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] remove blocked time accounting from xen "clockchip"

On Thu, Nov 10, 2011 at 10:05:14AM -0800, Jeremy Fitzhardinge wrote:
> On 11/09/2011 05:35 AM, Jan Beulich wrote:
> >>>> On 18.10.11 at 22:42, Laszlo Ersek <lersek@xxxxxxxxxx> wrote:
> >> ... because the "clock_event_device framework" already accounts for idle
> >> time through the "event_handler" function pointer in
> >> xen_timer_interrupt().
> >>
> >> The patch is intended as the completion of [1]. It should fix the double
> >> idle times seen in PV guests' /proc/stat [2]. It should be orthogonal to
> >> stolen time accounting (the removed code seems to be isolated).
> > After some more looking around I still think it's incorrect, albeit for
> > a different reason: What tick_nohz_restart_sched_tick() accounts
> > as idle time is *all* time that passed while in cpu_idle(). What gets
> > accounted in do_stolen_accounting() (without your patch) is
> > different:
> > - time the vCPU was in RUNSTATE_blocked gets accounted as idle
> > - time the vCPU was in RUNSTATE_runnable and RUNSTATE_offline
> >   gets accounted as stolen.
> >
> > That is, on an overcommitted system (and without your patch) I
> > would expect you to not see the (full) double idle increment for a not
> > fully idle and not fully loaded vCPU.
> >
> > Now we can of course say that for most purposes it's fine to
> > ignore the difference between idle and steal time, but there must
> > be a reason these have been getting accounted separately in Linux
> > without regard to Xen.
> >
> > So I really think that what needs to be suppressed to address this
> > is tick_nohz_restart_sched_tick()'s call to account_idle_ticks(). But
> > simply forcing CONFIG_VIRT_CPU_ACCOUNTING is not an immediate
> > option (not even when considering a Xen-only kernel), as that has
> > further implications. Instead I wonder whether under Xen we should
> > e.g. update per_cpu(tick_cpu_sched, cpu).idle_jiffies prior to calling
> > tick_nohz_restart_sched_tick() (possibly implicitly by doing the
> > update in do_stolen_accounting(), though that wouldn't work when
> > the wakeup is due to an interrupt other than the timer one).
> >
> > The alternative of accounting the negative of the steal time as
> > idle time in do_stolen_accounting() doesn't look correct either,
> > since not all stolen time was necessarily accounted as idle (some
> > would have got - also incorrectly - accounted as process or system
> > time).
> >
> > So my conclusion is that only the full implementation of what
> > CONFIG_VIRT_CPU_ACCOUNTING expects would actually get
> > things sorted out properly (but that would imply Xen *and* native
> > are willing to pay the performance price, or the enabling of this
> > can be made dynamic so that at least native could remain
> > unaffected).
> >
> > Or the negative stolen time should be accounted to the current
> > process, the system, or as idle, depending on the context which
> > do_stolen_accounting() runs in (just like timer_interrupt() did in
> > the XenoLinux kernels for positive accounting).
> I was always suspicious of the timer-interrupt-based stolen time
> accounting code.  It's really a hold-over from when ticks were the main
> timekeeping mechanism, but with highres timers its massive overkill and
> probably a source of performance degradation.
> Overall, the stolen time accounting isn't really very important.  The
> kernel doesn't use it at all internally - it doesn't affect scheduling
> decisions, for example.  It's only exported in /proc/stat, and some
> tools like top will display it if its non-zero.  It could be useful for
> diagnosing performance problems on a heavily loaded host system, but
> other tools like "xentop" would give you as much information.
> So really, all this code is nice to have, but I'm not sure its worth a
> lot of time to fix, especially if it makes idle accounting hard (which
> *is* important).
> However, that said, PeterZ recently added some code to the scheduler so
> that time "stolen" by interrupt handling isn't accounted against the
> task running at the time, and the design is specifically intended to
> also be useful for virtualization stolen time as well.  There are some
> KVM patches floating around to implement this, and we should look at

I finally got some time to look at them and I think they are these ones:

git log --oneline 
3c404b5 KVM guest: Add a pv_ops stub for steal time
c9aaa89 KVM: Steal time implementation
9ddabbe KVM: KVM Steal time guest/host interface
4b6b35f KVM: Add constant to represent KVM MSRs enabled bit in guest/host 

What is interesting is that they end up inserting a bunch of:

+       if (steal_account_process_tick())
+               return;

in irqtime_account_process_tick and in account_process_tick.

The steal_account_process_tick just makes a pv_time_ops.steal_clock
based on the asm goto of paravirt_steal_enabled key.

I think if we were to persue this we could rip out in the
'do_stolen_accounting' the call to 'accont_idle_tick' and
'account_idle_time' completly. In essence making that function
behave as a pv_time_ops.steal_clock implementation.

Also that would mean we could remove in the 'xen_timer_interrupt'
function the call to 'do_stolen_accounting'

I think this would mean we would account _only_ for the
RUNSTATE_runnable and RUNSTATE_offline as Laszlo's earlier patch

And the RUNSTATE_blocked would be figured out by the existing
mechanism in the sched.c code?

> doing a Xen implementation.  That would be much more practically useful,
> since (I think) it allows the scheduler to be aware of stolen time and
> not penalize tasks for time they never got to use.  Or at the very least
> it could give you per-task stolen stats (maybe?) which would be useful.
>     J
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.