[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Is: RCU callback detects an RCU hang with Linux 3.12+ Was: Re: Status of FLR in Xen 4.4
On Fri, Oct 04, 2013 at 12:34:56AM +0200, Matthias wrote: > Hi Konrad, > > sorry I missed your entry, google mail might not be the best software to > view mailing lists ;) > > The RCU stall happens roughly 2 minutes after the machine is fully booted, > and I'm usually working via SSH by then.. > > I basically have two cases where the stall happens: > > 1) Without the no-cpuidle function, It happens when I start xencommons > 2) With or without no-cpuidle, this happens sometimes and arbitrary and I > have the feeling that logging in via SSH (or network traffic in general?) > will increase the chance of the rcu stall and (and this is only a guess) > in most cases this actually happens when I enter a command of more then 16 > chars in the ssh command prompt. (I don't really think that this is really > causing the issue, I just noticed that when entering the usual commands to > start all the xen stuff / boot the domUs, it stalls mostly on the same > commands / when ssh freezes I came to the same part of the command). But > more ssh-intensive commands like 'dmesg' or 'htop' don't cause it.. > > Also, I can't really say what is on the screen because my dom0 does not > have a vga card / both vga cards in the server are passed to different > domUs and when I don't hide the vga cards on boot via xen-pciback.hide, > the rcu usually does not stall and everything is fine.. > For debugging you should have a serial console.. so maybe get a pci serial card, if you don't have any management processors offering SOL ? -- Pasi > 2013/9/27 Konrad Rzeszutek Wilk <[1]konrad.wilk@xxxxxxxxxx> > > On Fri, Sep 27, 2013 at 07:07:33PM +0200, Matthias wrote: > > Hi Konrad, > > > > good call! I was able to reproduce the error with the 3.12-rc2 kernel, > got > > a lot of information with the new NMI traces (log attached), but since > I'm > > not a xen hacker I don't really know how to continue from here. So I > might > > add this to the original post and maybe someone can help me. After all > the > > error persists for half a year now and besides 2 kernel version / > .config > > Combinations (a 3.8.2 and a 3.6.something) I could never trace this > issue > > back (even with bisecting the .config because at some point it seemed > > random). > > Can you tell me a bit on how this happens? Is it happening after you > boot the machine? Does it happen after a specific workload? > > It looks like something in the RCU is taking far too long and > the RCU callback mechanism starts complaining. The CPU0 is when the > RCU mechanism detects that something is off and starts sending NMI to > all CPUs. CPU2 is the only one that looks to be doing RCU callback: > > NMI backtrace for cpu 1 > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2 > Hardware name: System manufacturer System Product Name/Crosshair IV > Formula, BIOS 3029 10/09/2012 > task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000 > RIP: e030:[<ffffffff8125b2b2>] [<ffffffff8125b2b2>] > cfb_imageblit+0x1b3/0x411 > RSP: e02b:ffff88007de439f0 EFLAGS: 00000046 > RAX: 0000000000000000 RBX: ffff88001e1c2800 RCX: 0000000000000003 > RDX: 000000000000003b RSI: ffff88001e00614e RDI: 0000000000000000 > RBP: 0000000000000013 R08: 0000000000000001 R09: ffffffff814655f0 > R10: ffff88001e006116 R11: ffffc90014875710 R12: 000000000000000d > R13: 0000000000000000 R14: ffffc90014875714 R15: ffffc90014875000 > FS: 00007fb294ab4900(0000) GS:ffff88007de40000(0000) > knlGS:0000000000000000 > CS: e033 DS: 002b ES: 002b CR0: 000000008005003b > CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660 > Stack: > 0000000100aaaaaa 00000000000001d8 0000000000000000 0000000000aaaaaa > ffff8800532f0a40 ffff88001e1c2800 0000000000000001 ffff88001e1c2800 > 0000000000000000 ffff88007d424400 00000000ffff00ff 000000000000003b > Call Trace: > <IRQ> [<ffffffff81256ac4>] ? bit_putcs+0x352/0x39d > [<ffffffff81219825>] ? paravirt_read_tsc+0x5/0x8 > [<ffffffff81256772>] ? bit_cursor+0x45d/0x45d > [<ffffffff812523a8>] ? fbcon_putcs+0xbd/0xcc > [<ffffffff812bc6b6>] ? vt_console_print+0x234/0x290 > [<ffffffff810b336f>] ? call_console_drivers.constprop.18+0xb3/0xfc > [<ffffffff810b3c7d>] ? console_unlock+0x131/0x306 > [<ffffffff810b420e>] ? vprintk_emit+0x3bc/0x3eb > [<ffffffff812c92f5>] ? paravirt_read_tsc+0x5/0x8 > [<ffffffff812cae43>] ? add_interrupt_randomness+0x3f/0x15d > [<ffffffff813db9c8>] ? printk+0x4f/0x51 > [<ffffffff810e4433>] ? rcu_check_callbacks+0x195/0x598 > <================== > [<ffffffff810a3b50>] ? irqtime_account_process_tick.isra.2+0xd6/0x239 > [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e > [<ffffffff81084c35>] ? update_process_times+0x30/0x5b > [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a > [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c > [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159 > [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca > [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b > [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d > [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa > [<ffffffff8103df22>] ? check_events+0x12/0x20 > [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5 > [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52 > [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c > [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb > [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6 > [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51 > [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32 > [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30 > <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 > [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 > [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13 > [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e > [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160 > Code: fb 4c 89 d6 b9 08 00 00 00 ff cd 83 fd ff 74 32 44 0f be 2e 44 29 > c1 8b 44 24 18 4d 8d 73 04 41 d3 fd 44 23 6c 24 04 43 23 04 a9 <41> 89 > c5 41 31 fd 45 89 2b 85 c9 75 05 48 ff c6 b1 08 4d 89 f3 > > Which looks to be printing something on the VT console (which is running > in KMS mode as it uses framebuffer calls). So is there something on the > screen scrolling widly in a loop? > > But then there are also complains about > > INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long > to run: 1.115 msecs > > this taking too long. I am wondering if there is some time issue > on your box. > > What version of Xen do you have? > > > > > > 2013/9/27 Konrad Rzeszutek Wilk <[2]konrad.wilk@xxxxxxxxxx> > > > > > On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote: > > > > I'm currently on a vanilla 3.8.2 kernel because this is the only > >3.4 > > > > kernel I found which doesn't give me this issue: > > > > > [3]http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html > > > > > > So v3.12 (or rather the latest and greaters of the Linus) has the > mechanism > > > for the NMI - so you can actually see what is causing the stall. > > > > > _______________________________________________ > Xen-devel mailing list > [4]Xen-devel@xxxxxxxxxxxxx > [5]http://lists.xen.org/xen-devel > > References > > Visible links > 1. mailto:konrad.wilk@xxxxxxxxxx > 2. mailto:konrad.wilk@xxxxxxxxxx > 3. http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html > 4. mailto:Xen-devel@xxxxxxxxxxxxx > 5. http://lists.xen.org/xen-devel > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |