[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] 4.9.52: INFO: task X blocked for more than 300 seconds.



Hello Ankur,

Am 05.10.2017 um 19:59 schrieb Ankur Arora:
> On 2017-10-05 06:20 AM, Konrad Rzeszutek Wilk wrote:
>> On Wed, Oct 04, 2017 at 08:26:27PM +0200, Philipp Hahn wrote:
...
>> Adding Ankur to this as I think he saw something similar.
>>
>> But in the meantime - do you see this with the latest version of Linux?
>>> with linux-4.9.52 running on Debian-Wheezy with Xen-4.1 I observed
>>> several stuck processes: Here is one (truncated) dump of the Linux
>>> kernel messages:
>>>
>>>>   [<ffffffff8160db6d>] ? __schedule+0x23d/0x6d0
>>>>   [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
>>>>   [<ffffffff8160e032>] ? schedule+0x32/0x80
>>>>   [<ffffffff8161150c>] ? schedule_timeout+0x1ec/0x360
>>>>   [<ffffffff8130f277>] ? __blk_mq_run_hw_queue+0x327/0x3e0* see below
>>>>   [<ffffffff8101b7f1>] ? xen_clocksource_get_cycles+0x11/0x20
>>>>   [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
>>>>   [<ffffffff8160d8b4>] ? io_schedule_timeout+0xb4/0x130
>>>>   [<ffffffff810bb6f7>] ? prepare_to_wait+0x57/0x80
>>>>   [<ffffffff8160e8b7>] ? bit_wait_io+0x17/0x60
>>>>   [<ffffffff8160e3ac>] ? __wait_on_bit+0x5c/0x90
>>>>   [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
>>>>   [<ffffffff8160e50e>] ? out_of_line_wait_on_bit+0x7e/0xa0
>>>>   [<ffffffff810bba20>] ? autoremove_wake_function+0x40/0x40
>>>>   [<ffffffffc00abd18>] ?
>>>> jbd2_journal_commit_transaction+0xd48/0x17e0 [jbd2]
>>>>   [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720
>>>>   [<ffffffff810e67fd>] ? try_to_del_timer_sync+0x4d/0x80
>>>>   [<ffffffffc00b099d>] ? kjournald2+0xdd/0x280 [jbd2]
>>>>   [<ffffffff810bb9e0>] ? wake_up_atomic_t+0x30/0x30
>>>>   [<ffffffffc00b08c0>] ? commit_timeout+0x10/0x10 [jbd2]
>>>>   [<ffffffff81097660>] ? kthread+0xf0/0x110
>>>>   [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720
>>>>   [<ffffffff81097570>] ? kthread_park+0x60/0x60
>>>>   [<ffffffff81612bb5>] ? ret_from_fork+0x25/0x30
> This looks like this race: https://patchwork.kernel.org/patch/9853443/

I built a new kernel, for which I picked that patch on top of 4.9.56. We
are currently testing that, but it crashed again yesterday evening. Here
is the dmesg output:

> INFO: task systemd:1 blocked for more than 120 seconds.
>       Not tainted 4.9.0-ucs105-amd64 #1 Debian 4.9.30-2A~4.2.0.201710161640
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> systemd         D    0     1      0 0x00000000
>  ffff8801f1797c00 0000000000000000 ffffffff81c0e540 ffff8801f4956040
>  ffff8801f5a187c0 ffffc90040c4f880 ffffffff8160ebbd ffffffff81186295
>  ffffffff81350b39 0000000002cebf80 ffffffff811863c8 ffff8801f4956040
> Call Trace:
>  [<ffffffff8160ebbd>] ? __schedule+0x23d/0x6d0
>  [<ffffffff81186295>] ? move_freepages+0x95/0xd0
>  [<ffffffff81350b39>] ? list_del+0x9/0x20
>  [<ffffffff811863c8>] ? __rmqueue+0x88/0x3e0
>  [<ffffffff8160f082>] ? schedule+0x32/0x80
>  [<ffffffff8161255c>] ? schedule_timeout+0x1ec/0x360
>  [<ffffffff81187530>] ? get_page_from_freelist+0x350/0xad0
>  [<ffffffff8160e904>] ? io_schedule_timeout+0xb4/0x130
>  [<ffffffff81364fd4>] ? __sbitmap_queue_get+0x24/0x90
>  [<ffffffff813114f9>] ? bt_get.isra.6+0x129/0x1c0
>  [<ffffffff81350b39>] ? list_del+0x9/0x20
>  [<ffffffff810bba00>] ? wake_up_atomic_t+0x30/0x30
>  [<ffffffff81311843>] ? blk_mq_get_tag+0x23/0x90
>  [<ffffffff8130d1aa>] ? __blk_mq_alloc_request+0x1a/0x220
>  [<ffffffff8130dffd>] ? blk_mq_map_request+0xcd/0x170
>  [<ffffffff8131016a>] ? blk_sq_make_request+0xca/0x4c0
>  [<ffffffff81303e1a>] ? generic_make_request_checks+0x22a/0x4f0
>  [<ffffffff81304761>] ? generic_make_request+0x121/0x2c0
>  [<ffffffff8117e823>] ? __add_to_page_cache_locked+0x183/0x230
>  [<ffffffff81304976>] ? submit_bio+0x76/0x150
>  [<ffffffff8117e964>] ? add_to_page_cache_lru+0x84/0xe0
>  [<ffffffffc017b9d9>] ? ext4_mpage_readpages+0x2b9/0x8b0 [ext4]
>  [<ffffffff811d94fa>] ? alloc_pages_current+0x8a/0x110
>  [<ffffffff8118e745>] ? __do_page_cache_readahead+0x195/0x240
>  [<ffffffff8117f627>] ? pagecache_get_page+0x27/0x2b0
>  [<ffffffff811818e6>] ? filemap_fault+0x276/0x590
>  [<ffffffffc0138491>] ? ext4_filemap_fault+0x31/0x50 [ext4]
>  [<ffffffff811b23a4>] ? __do_fault+0x84/0x190
>  [<ffffffff811b783e>] ? handle_mm_fault+0xede/0x1680
>  [<ffffffff8124d25e>] ? ep_poll+0x13e/0x360
>  [<ffffffff8105feda>] ? __do_page_fault+0x26a/0x500
>  [<ffffffff81205ba2>] ? SyS_read+0x52/0xc0
>  [<ffffffff81614d58>] ? page_fault+0x28/0x30

I haven't been able to get the address of the queue object yet to get
its state. (timeout for today)

> Can you dump the output of: cat /sys/block/$xen-frontend-device/mq/*/tags
> 
> If you've hit this bug, one or more of the MQs would be wedged and
> the nr_free in one or more of the queues would be 0 and will not
> change.

As soon as the bug occurs, we can no longer access the VM via ssh or the
Xen (serial) console: the connection stalls after entering 2-3 characters.

I have a Xen crash-dump file of one such crash, but following
/sys/block/xvd?/mq/*/tags manually to get the kobject address using
"crash" (gdb) is very time consuming. So far I only did it once for
xvda, but I have to to that for the other 15 block devices as well to
find one culprit.

Philipp Hahn
-- 
Philipp Hahn
Open Source Software Engineer

Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen
Tel.: +49 421 22232-0
Fax : +49 421 22232-99
hahn@xxxxxxxxxxxxx

http://www.univention.de/
Geschäftsführer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-0287

Attachment: blk_sq_make_request.txt
Description: Text document

Attachment: blk_sq_make_request.s
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.