[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [libvirt test] 55257: regressions - FAIL



On Thu, 2015-05-14 at 15:21 -0600, Jim Fehlig wrote:

> > FWIW http://logs.test-lab.xenproject.org/osstest/logs/55443/ seems to
> > have two more instances of this (amd64 and i386)
> 
> More cases of qemu not starting.  I'm not sure how we can get more
> details about that.

FWIW I dug into this a bit more yesterday having discussed this with Ian
and others a bit.

We wondered if qemu had crashed, but the logs show a time out and libxl
has code in the parent process which receives SIGCHLD and logs + errors
out, so I think it probably isn't that, unless the monitoring code is
buggy somehow (not out of the question, it's probably not exercised
much).

Also we expect that a crash would produce a segfault message on the
kernel console, which didn't appear.

We also considered where stderr was going. libxl redirects std{out,err}
for the qemu to the qemu-dm-debian.guest.osstest.log file, which is
captured and empty.

There was some question about where libvirt's own stderr was going
(/dev/null or perhaps the console) but it doesn't appear as if anything
is going wrong in libvirt itself and as above we capture the std* for
processes which we spawn ourselves.

Lastly libvirtd is still running and is shown in the ps logs captured.

> 
> >  but with no 
> > interesting logs still and a different one on ARM:
> >
> > http://logs.test-lab.xenproject.org/osstest/logs/55443/test-armhf-armhf-libvirt/11.ts-guest-start.log:
> > 2015-05-13 09:23:32.193+0000: 16389: info : libvirt version: 1.2.16
> > 2015-05-13 09:23:32.193+0000: 16389: warning : 
> > virKeepAliveTimerInternal:143 : No response from client 0xb7000c38 after 6 
> > keepalive messages in 35 seconds
> > 2015-05-13 09:23:32.193+0000: 16390: warning : 
> > virKeepAliveTimerInternal:143 : No response from client 0xb7000c38 after 6 
> > keepalive messages in 35 seconds
> > error: Failed to create domain from /etc/xen/debian.guest.osstest.cfg.xml
> > error: internal error: received hangup / error event on socket
> >   
> 
> In this case it seems libvirtd crashed.

http://logs.test-lab.xenproject.org/osstest/logs/55443/test-armhf-armhf-libvirt/arndale-lakeside-output-ps_wwwaxf_-eo_pid%2Ctty%2Cstat%2Ctime%2Cnice%2Cpsr%2Cpcpu%2Cpmem%2Cnwchan%2Cwchan%2325%2Cargs
 

includes:
 2301 ?        DLl  00:00:00   0   0  0.0  1.6 ffffff fdget_pos                 
/usr/local/sbin/libvirtd -d
16395 ?        S    00:00:00   0   0  0.0  0.5 24b6dc wait                      
 \_ /usr/local/sbin/libvirtd -d
16396 ?        Ssl  00:00:00   0   0  0.0  1.9 ffffff poll_schedule_timeout     
     \_ /usr/local/lib/xen/bin/qemu-system-i386 -xen-domid 1 -chardev 
socket,id=libxl-cmd,path=/var/run/xen/qmp-libxl-1,server,nowait -no-shutdown 
-mon chardev=libxl-cmd,mode=control -chardev 
socket,id=libxenstat-cmd,path=/var/run/xen/qmp-libxenstat-1,server,nowait -mon 
chardev=libxenstat-cmd,mode=control -nodefaults -xen-attach -name 
debian.guest.osstest -vnc none -display none -nographic -machine xenpv -m 512

So I don't think it has crashed, it's even successfully spawned a qemu
it seems.

Comparing the libxl-driver.log here with the amd64 case:

libxl: debug: libxl_event.c:537:watchfd_callback: watch w=0x7ff4d70595e0 
wpath=/local/domain/0/device-model/1/state token=3/0: event 
epath=/local/domain/0/device-model/1/state

[arm stops here, amd64 continues with the remainder]

libxl: debug: libxl_aoutils.c:87:xswait_timeout_callback: domain 1 device model 
startup: xswait timeout (path=/local/domain/0/device-model/1/state)
libxl: debug: libxl_event.c:638:libxl__ev_xswatch_deregister: watch 
w=0x7ff4d70595e0 wpath=/local/domain/0/device-model/1/state token=3/0: 
deregister slotnum=3
libxl: error: libxl_exec.c:393:spawn_watch_event: domain 1 device model: 
startup timed out
libxl: debug: libxl_event.c:652:libxl__ev_xswatch_deregister: watch 
w=0x7ff4d70595e0: deregister unregistered
libxl: debug: libxl_event.c:652:libxl__ev_xswatch_deregister: watch 
w=0x7ff4d70595e0: deregister unregistered
libxl: error: libxl_dm.c:1565:device_model_spawn_outcome: domain 1 device 
model: spawn failed (rc=-3)
libxl: error: libxl_create.c:1362:domcreate_devmodel_started: device model did 
not start: -3
libxl: debug: libxl_dm.c:1678:kill_device_model: Device Model signaled
libxl: debug: libxl_event.c:652:libxl__ev_xswatch_deregister: watch 
w=0x7ff4d702f3c0: deregister unregistered
libxl: debug: libxl_event.c:652:libxl__ev_xswatch_deregister: watch 
w=0x7ff4d7031290: deregister unregistered
libxl: debug: libxl.c:1701:devices_destroy_cb: forked pid 18588 for destroy of 
domain 1
libxl: debug: libxl_event.c:1768:libxl__ao_complete: ao 0x7ff4d702ed60: 
complete, rc=-3
libxl: debug: libxl_event.c:1740:libxl__ao__destroy: ao 0x7ff4d702ed60: destroy

I wonder if we are somehow loosing an event or getting the event loop screwed 
up.

Perhaps in the amd64 case we are somehow losing the xenstore watch, in
the armhf case we are losing some other fd which interferes with
libvirt's own event loop?

So I think we are looking at either a hang or an event processing SNAFU
rather than a crash.

BTW, in the above there is "Device Model signaled", which indicates that
kill(pid, SIGHUP) returned 0 and not e.g. ESRCH (when it would log
"Device Model already exited") or anything else (when it would log
"failed to kill..."). So the qemu process was actually present.

The host is doing nothing other than running this one test case, so it
doesn't seem likely that we are really hitting the 30s qemu startup
timeout.

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.