Re: [Xen-devel] [BUG] xl devd segmentation fault on xl block-detach

On 03.05.2017 13:27, Wei Liu wrote:
CC Ian

On Wed, May 03, 2017 at 03:04:44AM +0300, Reinis Martinsons wrote:

I would like to report a problem with storage driver domain. When detaching
2 virtual block devices from the same domain provided by the same driver
domain, this generates a segmentation fault in the driver domain `xl devd`
process. I observed the same problem both when manually detaching block
devices from Dom0 and when shutting down guest domains with several block

For ease of demonstration I am sharing my test results on a simple scenario
where virtual block devices are provided from a storage driver domain (DomD)
back to Dom0, but I observed identical results for other DomUs.

Both of my Dom0 and DomD are Arch Linux (kernel 4.10.11-1-ARCH). I built xen
from Arch Linux User Repository (https://aur.archlinux.org/xen.git) latest
commit 16894c15a19bfef23550ba09d58e097fe16c4792, which is using Xen 4.8.0
(commit b03cee73197f4a37bf2941b9367105187355e638). Please see the output of
`xl info` attached in "xl info (Dom0).txt". When building xen for DomD, I
enabled debugging symbols (`debug ?= y` in /Config.mk). I enabled
xendriverdomain.service in DomD. DomD configuration file is attached in

After 2 consecutive `xl block-attach` and `xl block-detach` commands in Dom0
I am observing the following output:

[root@arch-test-dom0 ~]# xl block-attach 0 
[root@arch-test-dom0 ~]# xl block-attach 0 
[root@arch-test-dom0 ~]# xl block-detach 0 51712
[root@arch-test-dom0 ~]# xl block-detach 0 51728
libxl: error: libxl_device.c:1264:device_destroy_be_watch_cb: timed out
while waiting for /local/domain/1/backend/vbd/0/51728 to be removed
libxl: error: libxl.c:2009:device_addrm_aocomplete: unable to remove vbd
with id 51728
libxl_device_disk_remove failed.

The 2nd `xl block-detach` command is generating segmentation fault in DomD
`xl devd` process (search_for_guest (libxenlight.so.4.8)) - please see full
DomD log output attached in "journalctl (domD).txt".

I am also attaching "xenstored-access.log" and output of `xenstore-ls -fp`
in "xenstore-ls.txt". In addition, I am attaching output of gdb `backtrace
full` command on the generated coredump in DomD as "coredumpctl gdb

Please let me know if I should provide any other information for debugging
this problem.

Kind regards

Reinis Martinsons

# After the 2nd `xl block-detach` command:

[20170502T23:30:38.176Z]  A37.2        rm        
[20170502T23:30:38.177Z]  A37.2        rm        /local/domain/0/device/vbd
[20170502T23:30:38.177Z]  A37.2        rm        /local/domain/0/device
[20170502T23:30:38.178Z]  A37.2        rm        /libxl/0/device/vbd/51728
[20170502T23:30:38.178Z]  A37.2        rm        /libxl/0/device/vbd
[20170502T23:30:38.179Z]  A37.2        rm        /libxl/0/device
[20170502T23:30:38.179Z]  A37.2        rm        /libxl/0
[20170502T23:30:38.180Z]  A37.2        commit
[20170502T23:30:38.180Z]  D0           w event   device/vbd/51728 
[20170502T23:30:38.180Z]  D0           w event   device/vbd FFFFFFFF81AA8180
[20170502T23:30:38.180Z]  D0           w event   device FFFFFFFF81AA8180
[20170502T23:30:38.181Z]  D0           unwatch   
/local/domain/1/backend/vbd/0/51728/state FFFF88017F40CC20
[20170502T23:30:38.181Z]  A37          endconn
[20170502T23:31:17.867Z]  A38          newconn
[20170502T23:31:17.957Z]  A38          endconn
Core was generated by `/usr/bin/xl devd'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f49bf42519d in search_for_guest (ddomain=0x7ffc601e7130, domid=0)
     at libxl.c:3688
3688            if (dguest->domid == domid)
[Current thread is 1 (Thread 0x7f49bfa75fc0 (LWP 1403))]
(gdb) backtrace full
#0  0x00007f49bf42519d in search_for_guest (ddomain=0x7ffc601e7130, domid=0)
     at libxl.c:3688
         dguest = 0x31352f302f646276

This seems to suggest dguest is used after freed.

But looking at the code of backend_watch_callback, dguest shouldn't be
on the list.

3927         /* If this was the last device in the domain, remove it from the 
list */
3928         num_devs = dguest->num_vifs + dguest->num_vbds + 
3929         if (num_devs == 0) {
3930             LIBXL_SLIST_REMOVE(&ddomain->guests, dguest, 
3931                                next);
3932             LOG(DEBUG, "removed domain %u from the list of active guests",
3933                        dguest->domid);
3934             /* Clear any leftovers in libxl/<domid> */
3935             libxl__xs_rm_checked(gc, XBT_NULL,
3936                                  GCSPRINTF("libxl/%u", dguest->domid));
3937             free(dguest);
3938         }
3939     }

There is no logging unfortunately. But the xenstore log suggests this
path is taken. Can you do a quick retest? Modify the unit file for xl
devd to make it `xl -vvv devd` to grab more output.

I modified xendriverdomain.service unit file to execute `xl -vvv devd`. This provided following output from journalctl when the service was started:

[root@arch-zfs-test ~]# journalctl -b "_SYSTEMD_UNIT=xendriverdomain.service" -- Logs begin at Sat 2017-04-15 01:20:58 EEST, end at Wed 2017-05-03 15:32:12 EEST. -- May 03 14:53:46 arch-zfs-test xl[1396]: xencall:buffer: debug: total allocations:7 total releases:7 May 03 14:53:46 arch-zfs-test xl[1396]: xencall:buffer: debug: current allocations:0 maximum allocations:1 May 03 14:53:46 arch-zfs-test xl[1396]: xencall:buffer: debug: cache current size:1 May 03 14:53:46 arch-zfs-test xl[1396]: xencall:buffer: debug: cache hits:6 misses:1 toobig:0

In addition, full xldevd log was generated - please see "xldevd.log.1" from the respective session attached.


I also attach the repeated test results similar as before.


