Xen project Mailing List

[Xen-devel] Is: RCU callback detects an RCU hang with Linux 3.12+ Was: Re: Status of FLR in Xen 4.4

To: Matthias <matthias.kannenberg@xxxxxxxxxxxxxx>

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Fri, 27 Sep 2013 13:53:27 -0400

Cc: Ian Campbell <Ian.Campbell@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Fri, 27 Sep 2013 17:53:57 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Fri, Sep 27, 2013 at 07:07:33PM +0200, Matthias wrote: > Hi Konrad, > > good call! I was able to reproduce the error with the 3.12-rc2 kernel, got > a lot of information with the new NMI traces (log attached), but since I'm > not a xen hacker I don't really know how to continue from here. So I might > add this to the original post and maybe someone can help me. After all the > error persists for half a year now and besides 2 kernel version / .config > Combinations (a 3.8.2 and a 3.6.something) I could never trace this issue > back (even with bisecting the .config because at some point it seemed > random). Can you tell me a bit on how this happens? Is it happening after you boot the machine? Does it happen after a specific workload? It looks like something in the RCU is taking far too long and the RCU callback mechanism starts complaining. The CPU0 is when the RCU mechanism detects that something is off and starts sending NMI to all CPUs. CPU2 is the only one that looks to be doing RCU callback: NMI backtrace for cpu 1 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2 Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029 10/09/2012 task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000 RIP: e030:[<ffffffff8125b2b2>] [<ffffffff8125b2b2>] cfb_imageblit+0x1b3/0x411 RSP: e02b:ffff88007de439f0 EFLAGS: 00000046 RAX: 0000000000000000 RBX: ffff88001e1c2800 RCX: 0000000000000003 RDX: 000000000000003b RSI: ffff88001e00614e RDI: 0000000000000000 RBP: 0000000000000013 R08: 0000000000000001 R09: ffffffff814655f0 R10: ffff88001e006116 R11: ffffc90014875710 R12: 000000000000000d R13: 0000000000000000 R14: ffffc90014875714 R15: ffffc90014875000 FS: 00007fb294ab4900(0000) GS:ffff88007de40000(0000) knlGS:0000000000000000 CS: e033 DS: 002b ES: 002b CR0: 000000008005003b CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660 Stack: 0000000100aaaaaa 00000000000001d8 0000000000000000 0000000000aaaaaa ffff8800532f0a40 ffff88001e1c2800 0000000000000001 ffff88001e1c2800 0000000000000000 ffff88007d424400 00000000ffff00ff 000000000000003b Call Trace: <IRQ> [<ffffffff81256ac4>] ? bit_putcs+0x352/0x39d [<ffffffff81219825>] ? paravirt_read_tsc+0x5/0x8 [<ffffffff81256772>] ? bit_cursor+0x45d/0x45d [<ffffffff812523a8>] ? fbcon_putcs+0xbd/0xcc [<ffffffff812bc6b6>] ? vt_console_print+0x234/0x290 [<ffffffff810b336f>] ? call_console_drivers.constprop.18+0xb3/0xfc [<ffffffff810b3c7d>] ? console_unlock+0x131/0x306 [<ffffffff810b420e>] ? vprintk_emit+0x3bc/0x3eb [<ffffffff812c92f5>] ? paravirt_read_tsc+0x5/0x8 [<ffffffff812cae43>] ? add_interrupt_randomness+0x3f/0x15d [<ffffffff813db9c8>] ? printk+0x4f/0x51 [<ffffffff810e4433>] ? rcu_check_callbacks+0x195/0x598 <================== [<ffffffff810a3b50>] ? irqtime_account_process_tick.isra.2+0xd6/0x239 [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e [<ffffffff81084c35>] ? update_process_times+0x30/0x5b [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159 [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa [<ffffffff8103df22>] ? check_events+0x12/0x20 [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5 [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52 [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6 [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51 [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32 [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30 <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13 [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160 Code: fb 4c 89 d6 b9 08 00 00 00 ff cd 83 fd ff 74 32 44 0f be 2e 44 29 c1 8b 44 24 18 4d 8d 73 04 41 d3 fd 44 23 6c 24 04 43 23 04 a9 <41> 89 c5 41 31 fd 45 89 2b 85 c9 75 05 48 ff c6 b1 08 4d 89 f3 Which looks to be printing something on the VT console (which is running in KMS mode as it uses framebuffer calls). So is there something on the screen scrolling widly in a loop? But then there are also complains about INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.115 msecs this taking too long. I am wondering if there is some time issue on your box. What version of Xen do you have? > > > 2013/9/27 Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > > > On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote: > > > I'm currently on a vanilla 3.8.2 kernel because this is the only >3.4 > > > kernel I found which doesn't give me this issue: > > > http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html > > > > So v3.12 (or rather the latest and greaters of the Linus) has the mechanism > > for the NMI - so you can actually see what is causing the stall. > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.