Xen project Mailing List

Re: [Xen-devel] [libvirt test] 55257: regressions - FAIL

From: Ian Campbell <ian.campbell@xxxxxxxxxx>

Date: Fri, 15 May 2015 09:44:28 +0100

Cc: Anthony PERARD <anthony.perard@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx, ian.jackson@xxxxxxxxxxxxx

Delivery-date: Fri, 15 May 2015 08:44:35 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Thu, 2015-05-14 at 15:21 -0600, Jim Fehlig wrote: > > FWIW http://logs.test-lab.xenproject.org/osstest/logs/55443/ seems to > > have two more instances of this (amd64 and i386) > > More cases of qemu not starting. I'm not sure how we can get more > details about that. FWIW I dug into this a bit more yesterday having discussed this with Ian and others a bit. We wondered if qemu had crashed, but the logs show a time out and libxl has code in the parent process which receives SIGCHLD and logs + errors out, so I think it probably isn't that, unless the monitoring code is buggy somehow (not out of the question, it's probably not exercised much). Also we expect that a crash would produce a segfault message on the kernel console, which didn't appear. We also considered where stderr was going. libxl redirects std{out,err} for the qemu to the qemu-dm-debian.guest.osstest.log file, which is captured and empty. There was some question about where libvirt's own stderr was going (/dev/null or perhaps the console) but it doesn't appear as if anything is going wrong in libvirt itself and as above we capture the std* for processes which we spawn ourselves. Lastly libvirtd is still running and is shown in the ps logs captured. > > > but with no > > interesting logs still and a different one on ARM: > > > > http://logs.test-lab.xenproject.org/osstest/logs/55443/test-armhf-armhf-libvirt/11.ts-guest-start.log: > > 2015-05-13 09:23:32.193+0000: 16389: info : libvirt version: 1.2.16 > > 2015-05-13 09:23:32.193+0000: 16389: warning : > > virKeepAliveTimerInternal:143 : No response from client 0xb7000c38 after 6 > > keepalive messages in 35 seconds > > 2015-05-13 09:23:32.193+0000: 16390: warning : > > virKeepAliveTimerInternal:143 : No response from client 0xb7000c38 after 6 > > keepalive messages in 35 seconds > > error: Failed to create domain from /etc/xen/debian.guest.osstest.cfg.xml > > error: internal error: received hangup / error event on socket > > > > In this case it seems libvirtd crashed. http://logs.test-lab.xenproject.org/osstest/logs/55443/test-armhf-armhf-libvirt/arndale-lakeside-output-ps_wwwaxf_-eo_pid%2Ctty%2Cstat%2Ctime%2Cnice%2Cpsr%2Cpcpu%2Cpmem%2Cnwchan%2Cwchan%2325%2Cargs includes: 2301 ? DLl 00:00:00 0 0 0.0 1.6 ffffff fdget_pos /usr/local/sbin/libvirtd -d 16395 ? S 00:00:00 0 0 0.0 0.5 24b6dc wait \_ /usr/local/sbin/libvirtd -d 16396 ? Ssl 00:00:00 0 0 0.0 1.9 ffffff poll_schedule_timeout \_ /usr/local/lib/xen/bin/qemu-system-i386 -xen-domid 1 -chardev socket,id=libxl-cmd,path=/var/run/xen/qmp-libxl-1,server,nowait -no-shutdown -mon chardev=libxl-cmd,mode=control -chardev socket,id=libxenstat-cmd,path=/var/run/xen/qmp-libxenstat-1,server,nowait -mon chardev=libxenstat-cmd,mode=control -nodefaults -xen-attach -name debian.guest.osstest -vnc none -display none -nographic -machine xenpv -m 512 So I don't think it has crashed, it's even successfully spawned a qemu it seems. Comparing the libxl-driver.log here with the amd64 case: libxl: debug: libxl_event.c:537:watchfd_callback: watch w=0x7ff4d70595e0 wpath=/local/domain/0/device-model/1/state token=3/0: event epath=/local/domain/0/device-model/1/state [arm stops here, amd64 continues with the remainder] libxl: debug: libxl_aoutils.c:87:xswait_timeout_callback: domain 1 device model startup: xswait timeout (path=/local/domain/0/device-model/1/state) libxl: debug: libxl_event.c:638:libxl__ev_xswatch_deregister: watch w=0x7ff4d70595e0 wpath=/local/domain/0/device-model/1/state token=3/0: deregister slotnum=3 libxl: error: libxl_exec.c:393:spawn_watch_event: domain 1 device model: startup timed out libxl: debug: libxl_event.c:652:libxl__ev_xswatch_deregister: watch w=0x7ff4d70595e0: deregister unregistered libxl: debug: libxl_event.c:652:libxl__ev_xswatch_deregister: watch w=0x7ff4d70595e0: deregister unregistered libxl: error: libxl_dm.c:1565:device_model_spawn_outcome: domain 1 device model: spawn failed (rc=-3) libxl: error: libxl_create.c:1362:domcreate_devmodel_started: device model did not start: -3 libxl: debug: libxl_dm.c:1678:kill_device_model: Device Model signaled libxl: debug: libxl_event.c:652:libxl__ev_xswatch_deregister: watch w=0x7ff4d702f3c0: deregister unregistered libxl: debug: libxl_event.c:652:libxl__ev_xswatch_deregister: watch w=0x7ff4d7031290: deregister unregistered libxl: debug: libxl.c:1701:devices_destroy_cb: forked pid 18588 for destroy of domain 1 libxl: debug: libxl_event.c:1768:libxl__ao_complete: ao 0x7ff4d702ed60: complete, rc=-3 libxl: debug: libxl_event.c:1740:libxl__ao__destroy: ao 0x7ff4d702ed60: destroy I wonder if we are somehow loosing an event or getting the event loop screwed up. Perhaps in the amd64 case we are somehow losing the xenstore watch, in the armhf case we are losing some other fd which interferes with libvirt's own event loop? So I think we are looking at either a hang or an event processing SNAFU rather than a crash. BTW, in the above there is "Device Model signaled", which indicates that kill(pid, SIGHUP) returned 0 and not e.g. ESRCH (when it would log "Device Model already exited") or anything else (when it would log "failed to kill..."). So the qemu process was actually present. The host is doing nothing other than running this one test case, so it doesn't seem likely that we are really hitting the 30s qemu startup timeout. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.