[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 5/5] xen: RCU: avoid busy waiting until the end of grace period.

On Mon, 2017-08-07 at 02:54 -0600, Jan Beulich wrote:
> > > > Dario Faggioli <dario.faggioli@xxxxxxxxxx> 07/27/17 10:01 AM
> > > > >>>
> > 
> > Instead of having the CPU where a callback is queued, busy
> > looping on rcu_pending(), use a timer.
> Isn't this rcu_needs_cpu(), according to patch 4?
I was referring to the rcu_pending() in do_softirq(). In fact, if this
(roughly) is idle_loop:

    for ( ; ; )
        if ( unlikely(tasklet_work_to_do(cpu)) )


we don't have tasklet work to do, so we call pm_idle(). However, if we
have a callback queued, rcu_needs_cpu() returns true (without calling
rcu_pending()), which means cpu_is_haltable() returns false, and hence
we exit pm_idle() without actually going idle, and we call
do_softirq(), which does:

    if ( rcu_pending(cpu) )

in a rather tight loop.

IAC, I certainly can rephrase the sentence above, and make all this
more clear.

> > --- a/xen/arch/x86/acpi/cpu_idle.c
> > +++ b/xen/arch/x86/acpi/cpu_idle.c
> > @@ -576,10 +576,10 @@ static void acpi_processor_idle(void)
> > return;
> > }
>  >
> > +    rcu_idle_timer_start();
> > cpufreq_dbs_timer_suspend();
> Is the ordering wrt cpufreq handling here arbitrary? 

> If so, wouldn't it be
> better to suspend cpufreq handling before starting the idle timer
> (and
> also invert the ordering when going back to normal operation below)?
Well, as said above, it's arbitrary. Therefore, yes, I'm ok with doing
things like you suggest.

> > -
> > sched_tick_suspend();
> At which point I'd then wonder whether this couldn't be integrated
> into
> sched_tick_{suspend,resume}(), avoiding arch-specific code to be
> altered.
Sure, I'm all for it. It's easier, cleaner, and makes a lot of sense!

Only problem may be if some new arch decide not/forget to call
sched_tick_suspend() and resume().

Perhaps I can add some checking in place, to make sure that this can't
happen (in debug builds only, of course).

> > @@ -756,6 +756,7 @@ static void mwait_idle(void)
> >        local_irq_enable();
> >        sched_tick_resume();
> >        cpufreq_dbs_timer_resume();
> > +                rcu_idle_timer_stop();
> Indentation looks odd here, but I can't exclude this being an effect
> produced
> by my mail web frontend.
Yep, it was indeed wrong. Fixed. Sorry. Thanks. :-)

> > --- a/xen/common/rcupdate.c
> > +++ b/xen/common/rcupdate.c
> > @@ -84,8 +84,14 @@ struct rcu_data {
> > 
> If I'm not mistaken someone else had already commented on this: If
> this
> is (mostly) arbitrary, please say so in a comment.
Indeed there is being a long 'debate'. :-)

For v2, I'm changing this to something more dynamic. And I'll sure add
a comment explaining the adopted solution.

> > @@ -402,7 +408,48 @@ int rcu_needs_cpu(int cpu)
> > {
> > struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
>  >
> > -    return (!!rdp->curlist || rcu_pending(cpu));
> > +    return (!!rdp->curlist || rcu_pending(cpu)) && !rdp-
> > >idle_timer_active;
> Please take the opportunity and drop the pointless !! here (unless
> it's needed
> for better matching up with the Linux original).
It's in the Linux code, indeed:

However, this very line is already different (because Linux has
rdp_bh->curlist, which we don't, and we have rdp->idle_timer_active).
So I guess I can drop '!!'.

Any patch touching this line coming from Linux will need manual
adjustment anyway.

> > +/*
> > + * Timer for making sure the CPU where a callback is queued does
> > + * periodically poke rcu_pedning(), so that it will invoke the
> > callback
> > + * not too late after the end of the grace period.
> > + */
> > +void rcu_idle_timer_start()
> > +{
> > +    struct rcu_data *rdp = &this_cpu(rcu_data);
> > +
> > +    if (likely(!rdp->curlist))
> > +        return;
> I would have expected this to be the inverse of the original
> condition in
> rcu_needs_cpu() - why is there no rcu_pending() invocation here?
The original rcu_needs_cpu() condition is:

 rcu->curlist || rcu_pending(cpu)

So, you're saying we need something like this:

 if (!rdp->curlist && !rcu_pending(cpu))

Well, if I do this, what happens is that the if evaluate to false (and
hence we don't exit the function, and we arm the timer), on CPUs that:
 - does not have a calback queued (curlist == NULL)
 - are in the process of quiescing, by going idle.

In fact, in that case, here's what happens (I compacted the trace, and
slightly edited some of the lines, for better readability, and for
keeping the highlighting the characteristics of the situation under

* complete_domain_destroy callback is queued on CPU 0. The CPU quiesces
  (goes through softirq), but stay budy, so no timer is armed (that's 
  not super relevant, but is what happens in this case):

x-|- d0v12 rcu_call fn=complete_domain_destroy
x-|- d0v12 rcu_pending? yes (ret=2): no pending entries but new entries
x-|- d0v12 raise_softirq nr 3
x-|- d0v12 rcu_process_callbacks, rdp_curlist: null, rdp_nxtlist: yes
x-|- d0v12 do_softirq
x-|- d0v12 rcu_pending? yes (ret=5): waiting for CPU to quiesce 
x-|- d0v12 raise_softirq nr 3
x-|- d0v12 rcu_process_callbacks, rdp_curlist: yes, rdp_nxtlist: null
x-|- d0v12 rcu_check_quiesc_state, rdp_qs_pending: yes
x-|- d0v12 rcu_cpu_quiet, rcp_cpumask=0x00004100
x-|- d0v12 rcu_pending? no

* the vCPU running on CPU 2, which participates in the grace period, 
  blocks, and CPU 2 goes idle. That means that CPU 2 quiesces (goes 
  through softirq), and we can forget about it. Surely we don't need 
  the timer to fire on it, as the callback is not queued there...

|-x- d0v7 vcpu_block d0v7
|-x- d0v7 raise_softirq nr 1
|-x- d0v7 do_softirq
|-x- d0v7 rcu_pending? yes (ret=4): waiting for CPU to quiesce
|-x- d0v7 raise_softirq nr 3
|-x- d0v7 softirq_handler nr 1
|-x- d0v7 runstate_change d0v7 running->blocked
|-x- d?v? runstate_change d32767v14 runnable->running

  Now, this is cpufreq_dbs_timer_suspend():

|-x- d32767v14 timer_stop t=do_dbs_timer

  And this is sched_tick_suspend(), which now calls 
  rcu_idle_timer_start(), gated by the condition suggested above.

|-x- d32767v14 rcu_pending? yes (ret=4): waiting for CPU to quiesce 
|-x- d32767v14 timer_set t=rcu_idle_timer_handler, expires_in=9787us

  So, the CPU is actually quiescing, and since it has no callback
  queues, as we said, we wouldn't have needed the timer. But the
  condition is false, because, at this stage, rcu_pending() was still
  true. So, we don't bail early from rcu_idle_timer_start(), and we 
  actually have armed the timer.

|-x- d32767v14 rcu_pending? yes (ret=4): waiting for CPU to quiesce 
|-x- d32767v14 raise_softirq nr 3
|-x- d32767v14 rcu_process_callbacks, rdp_curlist: null, rdp_nxtlist: null
|-x- d32767v14 rcu_check_quiesc_state, rdp_qs_pending: no
|-x- d32767v14 rcu_process_callbacks, rdp_curlist: null, rdp_nxtlist: null
|-x- d32767v14 rcu_check_quiesc_state, rdp_qs_pending: yes
|-x- d32767v14 rcu_cpu_quiet, rcp_cpumask=0x00000100
|-x- d32767v14 pm_idle_start c2

* The timer fires, for no useful reason, and cause a spurious wakeup:

|-x- d32767v14 raise_softirq nr 0
|-x- d32767v14 pm_idle_end c2
|-x- d32767v14 irq_enter
|-x- d32767v14 irq_direct, vec=0xfa, handler=apic_timer_interrupt
|-x- d32767v14 raise_softirq nr 0
|-x- d32767v14 irq_exit, in_irq = 0
|-x- d32767v14 softirq_handler nr 0
|-x- d32767v14 timer_exec t=rcu_idle_timer_handler
|-x- d32767v14 timer_reprogr deadline=954.566us
|-x- d32767v14 rcu_pending? no
|-x- d32767v14 pm_idle_start c3

So, this is why I don't want rcu_pending() in that if. If we don't like
this, I can see about moving around a bit the timer starting and
stopping helpers (I've a couple of ideas in mind already, but I need to

Actually, it's entirely possible that it is having rcu_pending(cpu) in
rcu_needs_cpu() is, for us, redundant. In fact, although it does make
sense in Linux, both code inspection and some investigation I've just
done, makes me believe that there won't be cases where a CPU is denied
going offline because it sees rcu_pending() returning 1.

In fact, when we call rcu_pending(), within cpu_is_haltable(), we have
already gone through it before. And if there were pending work, we've
raised the softirq and dealt with it. If there weren't, neither there
is now.

I'm therefore leaning toward removing rcu_pending() from the
rcu_needs_cpu() check as well. At that point, we'll indeed have the
check inside rcu_start_idle_timer(), be the opposite of the original
check in rcu_needs_cpu(). :-)

> > @@ -451,6 +500,7 @@ static void rcu_init_percpu_data(int cpu,
> > struct rcu_ctrlblk *rcp,
> > rdp->qs_pending = 0;
> > rdp->cpu = cpu;
> > rdp->blimit = blimit;
> > +    init_timer(&rdp->idle_timer, rcu_idle_timer_handler, (void*)
> > rdp, cpu);
> Again, unless it is this bogus way in the Linux original, please drop
> the
> pointless cast, or at least correct its style.
Ah, no, this one, I can kill it.

Thanks and Regards,
<<This happens because I choose it to happen!>> (Raistlin Majere)
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.