[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split

To: Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>
From: Andre Przywara <andre.przywara@xxxxxxx>
Date: Fri, 11 Feb 2011 08:39:10 +0100
Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "Diestelhorst, Stephan" <Stephan.Diestelhorst@xxxxxxx>
Delivery-date: Thu, 10 Feb 2011 23:43:32 -0800
List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Juergen Gross wrote:

On 02/10/11 15:18, Andre Przywara wrote:

Andre Przywara wrote:

On 02/10/2011 07:42 AM, Juergen Gross wrote:

On 02/09/11 15:21, Juergen Gross wrote:

Andre, George,


What seems to be interesting: I think the problem did always occur when
a new cpupool was created and the first cpu was moved to it.

I think my previous assumption regarding the master_ticker was not
too bad.
I think somehow the master_ticker of the new cpupool is becoming active
before the scheduler is really initialized properly. This could
happen, if
enough time is spent between alloc_pdata for the cpu to be moved and
the
critical section in schedule_cpu_switch().

The solution should be to activate the timers only if the scheduler is
ready for them.

George, do you think the master_ticker should be stopped in
suspend_ticker
as well? I still see potential problems for entering deep C-States.
I think
I'll prepare a patch which will keep the master_ticker active for the
C-State case and migrate it for the schedule_cpu_switch() case.

Okay, here is a patch for this. It ran on my 4-core machine without any
problems.
Andre, could you give it a try?

Did, but unfortunately it crashed as always. Tried twice and made sure
I booted the right kernel. Sorry.
The idea with the race between the timer and the state changing
sounded very appealing, actually that was suspicious to me from the
beginning.

I will add some code to dump the state of all cpupools to the BUG_ON
to see in which situation we are when the bug triggers.

OK, here is a first try of this, the patch iterates over all CPU pools
and outputs some data if the BUG_ON
((sdom->weight * sdom->active_vcpu_count) > weight_left) condition
triggers:
(XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask: fffffffc003f
(XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
(XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
(XEN) Xen BUG at sched_credit.c:1010
....
The masks look proper (6 cores per node), the bug triggers when the
first CPU is about to be(?) inserted.


Sure? I'm missing the cpu with mask 2000.
I'll try to reproduce the problem on a larger machine here (24 cores, 4 numa
nodes).
Andre, can you give me your xen boot parameters? Which xen changeset are you
running, and do you have any additional patches in use?


The grub lines:
kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200

module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0


All of my experiments are use c/s 22858 as a base.

If you use a AMD Magny-Cours box for your experiments (socket C32 orG34), you should add the following patch (removing the line)

--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
         __clear_bit(X86_FEATURE_SKINIT % 32, &c);
         __clear_bit(X86_FEATURE_WDT % 32, &c);
         __clear_bit(X86_FEATURE_LWP % 32, &c);
-        __clear_bit(X86_FEATURE_NODEID_MSR % 32, &c);
         __clear_bit(X86_FEATURE_TOPOEXT % 32, &c);
         break;
     case 5: /* MONITOR/MWAIT */

This is not necessary (in fact that reverts my patch c/s 22815), butraises the probability to trigger the bug, probably because it increasesthe pressure of the Dom0 scheduler. If you cannot trigger it with Dom0,try to create a guest with many VCPUs and squeeze it into a small CPU-pool.


Good luck ;-)
Andre.

--
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

Follow-Ups:
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: George Dunlap

References:
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Stephan Diestelhorst
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Juergen Gross
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Stephan Diestelhorst
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Juergen Gross
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Juergen Gross
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Andre Przywara
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Andre Przywara
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Juergen Gross
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: George Dunlap
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Juergen Gross
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: George Dunlap
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: George Dunlap
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Andre Przywara
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: George Dunlap
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: George Dunlap
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Andre Przywara
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Juergen Gross
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Juergen Gross
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Andre Przywara
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Andre Przywara
- Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
  - From: Juergen Gross

Prev by Date: [Xen-users] Can Xenoprof monitor Core events and Uncore Events simultaneously?
Next by Date: RE: [Xen-devel] xen-unstable on OL6 (RHEL6 clone) problems
Previous by thread: Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
Next by thread: Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.