[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Xen PVM: Strange lockups when running PostgreSQL load



I am currently looking at a bug report[1] which is happening when
a Xen PVM guest with multiple VCPUs is running a high IO database
load (a test script is available in the bug report).

In experimenting it seems that this happens (or is getting more
likely) when the number of VCPUs is 8 or higher (though I have
not tried 6, only 2 and 4), having autogroup enabled seems to
make it more likely, too (at some point thought it would actually
prevent it but we were wrong) and pv spinlocks enabled.
It has happened with older (3.4.3) and newer (4.1.2) versions of
Xen as a host and with 3.2 and 3.5 kernels as guests.

The pv spinlock assumption I will try to get re-verified by asking
to reproduce under a real load with a kernel that just disables
that. However, the dumps I am looking at really do look odd there.

The first dump I looked at had the spinlock of runqueue[0] being
placed into the per-cpu lock_spinners variable for cpu#0 and
cpu#7 (doing my tests with 8 VCPUs). So apparently both cpus were
waiting on the slow path for it to become free. Though actually
it was free! Now, here is one issue I have in understanding the
dump: the back traces produced in crash are in the normal form
not showing any cpu in the poll_irq HV call. Only when using
the form that uses the last known stack location and displays
all text symols found will get close for cpu#7. cpu#0 still does
not seem to be anywhere close. This could be a problem with crash,
or with the way PVM works, I am not sure.

Anyway, from what I could take from that situation, it seemed that
cpu#1 (that one had soft lockup warnings in the log) seemed to try
to do a wake_up on the task that just seemed to have done an io
schedule on cpu#7 (but the waitqueue_head spinlock is currently
locked). cpu#7 tries to get the runqueue[0] spinlock to do an idle
migration (though the lock currently is free). So I guessed that
maybe cpu#0 was just woken for the free lock but has not grabbed
it yet.

>From there I wondered whether something that acquires a spinlock
usually more than the quick path timeout (and this causes new
lockers to go into the slow path), could cause a stall on a high
numbered cpu when the lock is contented (as the search and
signalling is only done for the first waiter starting from 0).
That lead to below patch. The good thing about it, it does not
break things immediately, the bad thing, it does not help with
the problem. :/

The interesting thing when looking at a second dump, taken with
a testkernel using the patch below, was that again 2 cpus seemed
to spin slow on a lock that was free in the dump. And again at
least one did not seem to be doing anything spinlock related
(anymore?).
One other detail (but that could be just incidence as well) was
that in unmodified kernel I usually would end up with soft
lockup messages, with the patched kernel, that was a stall
detected by rcu_bh (0 and 1 stalled, detected by 3).

Now I am a bit confused and wonder about some basic things:
1. When a cpu goes into the slow lock path and into the poll_irq,
   shouldn't I expect this one to be in hypercall_page in the
   dump?
2. When does the whole handling for interrupted spinlock wait
   apply? I assume only for a spinlock taken without irqs
   disabled and then trying to acquire another one in an
   interrupt handler. Is that correct?
3. Not really related but I just stumbled over it:
   In xen_spin_trylock: "asm("xchgb %b0,%1"...)
   What does the "b" part of %b0 do? I thought xchgb already
   would say it is a byte value...

But putting aside those questions, the fact that two times
there was two cpus waiting on the same lock which from the
lock count (=0) was free seems a bit too much of a coincidence.
Oh and the spinners count in the lock was 2 as one would
expect.

struct rq {
  lock = {
    raw_lock = {
      {
        head_tail = 512, 
        tickets = {
          head = 0 '\000', 
          tail = 2 '\002'
        }
      }
    }
  }, 
  nr_running = 1,
  ...
}

I really don't know how this happens. Especially cpu#0 seems at
least past the wakeup and should have removed itself from the
spinners list...
 
-Stefan

[1] http://bugs.launchpad.net/bugs/1011792

>From 635a4e101ccbc9a324c8000f7e264ed4e646592c Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.bader@xxxxxxxxxxxxx>
Date: Tue, 16 Oct 2012 17:46:09 +0200
Subject: [PATCH] xen/spinlocks: Make wakeup fairer

Instead of sending the IPI signalling the free lock to the first
online CPU found waiting for it, start the search from the CPU
that had the lock last.

Signed-off-by: Stefan Bader <stefan.bader@xxxxxxxxxxxxx>
---
 arch/x86/xen/spinlock.c |   22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index d69cc6c..8b86efb 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -320,17 +320,23 @@ static void xen_spin_lock_flags(struct arch_spinlock 
*lock, unsigned long flags)
 static noinline void xen_spin_unlock_slow(struct xen_spinlock *xl)
 {
        int cpu;
+       int this_cpu = smp_processor_id();
 
        ADD_STATS(released_slow, 1);
 
-       for_each_online_cpu(cpu) {
-               /* XXX should mix up next cpu selection */
-               if (per_cpu(lock_spinners, cpu) == xl) {
-                       ADD_STATS(released_slow_kicked, 1);
-                       xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
-                       break;
-               }
-       }
+       cpu = cpumask_next(this_cpu, cpu_online_mask);
+       do {
+               if (cpu < nr_cpu_ids) {
+                       if (per_cpu(lock_spinners, cpu) == xl) {
+                               ADD_STATS(released_slow_kicked, 1);
+                               xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
+                               break;
+                       }
+               } else
+                       cpu = -1;
+
+               cpu = cpumask_next(cpu, cpu_online_mask);
+       } while (cpu != this_cpu);
 }
 
 static void xen_spin_unlock(struct arch_spinlock *lock)
-- 
1.7.9.5


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.