[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] 4.9.52: INFO: task XXX blocked for more than 300 seconds.



Hello Jan,

thank you for you answer.

Am 05.10.2017 um 12:12 schrieb Jan Beulich:
>>>> On 04.10.17 at 20:26, <hahn@xxxxxxxxxxxxx> wrote:
>> with linux-4.9.52 running on Debian-Wheezy with Xen-4.1 I observed
>> several stuck processes: Here is one (truncated) dump of the Linux
>> kernel messages:
>>
>>>  [<ffffffff8160db6d>] ? __schedule+0x23d/0x6d0
>>>  [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
>>>  [<ffffffff8160e032>] ? schedule+0x32/0x80
>>>  [<ffffffff8161150c>] ? schedule_timeout+0x1ec/0x360
>>>  [<ffffffff8130f277>] ? __blk_mq_run_hw_queue+0x327/0x3e0* see below
>>>  [<ffffffff8101b7f1>] ? xen_clocksource_get_cycles+0x11/0x20
>>>  [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
>>>  [<ffffffff8160d8b4>] ? io_schedule_timeout+0xb4/0x130
>>>  [<ffffffff810bb6f7>] ? prepare_to_wait+0x57/0x80
>>>  [<ffffffff8160e8b7>] ? bit_wait_io+0x17/0x60
>>>  [<ffffffff8160e3ac>] ? __wait_on_bit+0x5c/0x90
>>>  [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
>>>  [<ffffffff8160e50e>] ? out_of_line_wait_on_bit+0x7e/0xa0
>>>  [<ffffffff810bba20>] ? autoremove_wake_function+0x40/0x40
>>>  [<ffffffffc00abd18>] ? jbd2_journal_commit_transaction+0xd48/0x17e0 [jbd2]
>>>  [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720
>>>  [<ffffffff810e67fd>] ? try_to_del_timer_sync+0x4d/0x80
>>>  [<ffffffffc00b099d>] ? kjournald2+0xdd/0x280 [jbd2]
>>>  [<ffffffff810bb9e0>] ? wake_up_atomic_t+0x30/0x30
>>>  [<ffffffffc00b08c0>] ? commit_timeout+0x10/0x10 [jbd2]
>>>  [<ffffffff81097660>] ? kthread+0xf0/0x110
>>>  [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720
>>>  [<ffffffff81097570>] ? kthread_park+0x60/0x60
>>>  [<ffffffff81612bb5>] ? ret_from_fork+0x25/0x30
>>> NMI backtrace for cpu 2
>>> CPU: 2 PID: 35 Comm: khungtaskd Not tainted 4.9.0-ucs105-amd64 #1 Debian 
>>> 4.9.30-2A~4.2.0.201709271649
>>>  0000000000000000 ffffffff81331935 0000000000000000 0000000000000002
>>>  ffffffff81335e60 0000000000000002 ffffffff8104cb70 ffff8801f0c90e80
>>>  ffffffff81335f6a ffff8801f0c90e80 00000000003fffbc ffffffff81128048
>>> Call Trace:
>>>  [<ffffffff81331935>] ? dump_stack+0x5c/0x77
>>>  [<ffffffff81335e60>] ? nmi_cpu_backtrace+0x90/0xa0
>>>  [<ffffffff8104cb70>] ? irq_force_complete_move+0x140/0x140
>>>  [<ffffffff81335f6a>] ? nmi_trigger_cpumask_backtrace+0xfa/0x130
>>>  [<ffffffff81128048>] ? watchdog+0x2b8/0x330
>>>  [<ffffffff81127d90>] ? reset_hung_task_detector+0x10/0x10
>>>  [<ffffffff81097660>] ? kthread+0xf0/0x110
>>>  [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720
>>>  [<ffffffff81097570>] ? kthread_park+0x60/0x60
>>>  [<ffffffff81612bb5>] ? ret_from_fork+0x25/0x30
...
>> Looking at the dis-assembly of xen_clocksource_get_cycles() in
>> arch/x86/xen/time.c I see no path how that should call
>> __blk_mq_run_hw_queue():
> 
> Hence the question marks ahead of the stack entries: What you see
> there are likely leftovers from prior call trees. It just so happens
> that the old return address slots haven't got overwritten yet. You
> need to first sanitize the stack trace e.g. by having the kernel
> dump more of the stack in raw hex form, and then looking at the
> disassembly to figure out how large each stack frame is, starting
> at the top-most address (i.e. the one in RIP).

That explains the strange call trace for me, thank you for the
enlightenment.

> On Wed, Oct 04, 2017 at 06:26:27PM +0000, Philipp Hahn wrote:
>>> INFO: task btrfs-transacti:522 blocked for more than 300 seconds.
> [...] 
>> And another one:
>>> INFO: task smbd:20101 blocked for more than 300 seconds.
> [...] 
>> This does not look normal to me or did I miss something?
> 
> So I see that both of the stuck processes listed above (smbd and
> btrfs-*) are disk related processes. Might I ask how many disk/nics
> (PV) do you have attached to this DomU, and how many queues does each
> have?

Nothing special configured, how would I best fetch that info?


Which leads me back to my original problem: How can I diagnose *why* the
task is blocked for that time? From my understanding this can happen if
IO is too slow and task just have to wait for too long. ¹
Even if IO is slow the system should stabilize itself when no new IO is
generated and the old one has been processed, right? So looking at
`vmstat` or `blktrace` should tell me, that Xen/Linux/whatever is busy
with IO and it is simply not fast enough to keep up with the load.

Thanks again, but any hint how to diagnose this does help.

Philipp

¹
<https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.