[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] libxl crash in event loop accessing fds array
Roger reported to me (pers.comm.) this crash on NetBSD: Program terminated with signal 11, Segmentation fault. #0 0x00007f7ff743131b in afterpoll_check_fd (poller=<optimized out>, fds=0x7f7ff7b241c0, nfds=7, fd=-1, events=1) at libxl_event.c:856 856 if (fds[slot].fd != fd) More information: http://dpaste.org/E5sw2/raw/ http://dpaste.org/7zzRT/raw/ (with #define DEBUG in libxl_event.c) http://dpaste.org/jeCRG/raw/ (strace, seems a bit broken) There are, I think, at least three things going wrong here: - Firstly, the bootloader is failing for some unknown reason. I think this is (first) manifesting on NetBSD as POLLHUP on the pty master fd. I don't know why this might be but a good way to find out would be to run the bootloader by hand. The debug log lists the arguments it's been given and you can just run it by hand and see what it does. - Secondly, this causes a crash in the libxl event handling machinery. This is due to a mishandled reentrancy hazard. Patch below. - Thirdly, for some unknown reason libxl is not able to kill and reap the bootloader child process: libxl: warning: libxl_bootloader.c:292:bootloader_abort: after failure, failed to kill bootloader [16432]: No such process This is a mystery to me. I asked for a system call trace to try to see what libxl was actually doing but unfortunately strace on Roger's NetBSD install doesn't seem to work. ISTR NetBSD having some other system instead which might produce better information. It seems to me that libxl ought to be able to kill() an unreaped child so I looked to see whether the child might have been reaped. There are only three calls to wait* functions in xl/libxl. The one in xl.c can be ruled out because we are constantly within libxl rather than libxl, and all the callbacks into xl are accompanied by debugging output which we do not see in the trace. The one in libxl_exec.c is in the middle process of libxl__spawn_spawn and can be disregarded. That leaves the one in libxl__fork_selfpipe_woken. This is not invariably accompanied by debugging output so it is possible that something has gone wrong with the handling of the bootloader's death. However looking at the flow after waitpid, in childproc_reaped, there seem to be these possibilities: - the child was not found, in which case childproc_reaped returns ERROR_UNKNOWN_CHILD, and we end up calling xl_reaped_callback which will (we hope!) also give ERROR_UNKNOWN_CHILD, so that we will call libxl_report_child_exitstatus complaining about the unknown child. We don't see this in the trace. - the child was found and is still associated with the bootloader callback, in which case we call bootloader_finished which also always logs. - somehow we call some other function which throws the information awaw. I think this is unlikely. I therefore conclude that the child has died but been auto-reaped somehow. Perhaps SIGCHLD is set to SIG_IGN ? That would be Evil, Bad and Wrong. I think a proper system call trace will be necessary to resolve this. Naturally it is important not to fix the first of these problems before debugging the error handling :-). Thanks, Ian. Here is a patch to reproduce the problem on Linux. touch the file "boom" while the bootloader is running. You need to run the whole thing under valgrind to spot the illegal access. diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c index 99972a2..f7df0b1 100644 --- a/tools/libxl/libxl_aoutils.c +++ b/tools/libxl/libxl_aoutils.c @@ -102,6 +102,9 @@ static void datacopier_readable(libxl__egc *egc, libxl__ev_fd *ev, libxl__datacopier_state *dc = CONTAINER_OF(ev, *dc, toread); STATE_AO_GC(dc->ao); +if (!access("boom",R_OK)) + revents = 0x10; + if (revents & ~POLLIN) { LOG(ERROR, "unexpected poll event 0x%x (should be POLLIN)" " on %s during copy of %s", revents, dc->readwhat, dc->copywhat); Doing this does not (for me, at least), seem to lose track of the bootloader process. Instead, I see (when combined with the patch below): libxl: error: libxl_aoutils.c:110:datacopier_readable: unexpected poll event 0x10 (should be POLLIN) on bootloader pty during copy o bootloader output for domain 22 libxl: debug: libxl_event.c:120:libxl__ev_fd_deregister: ev_fd=0x42b13a8 deregister fd=13 libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b13c0 deregister unregistered libxl: debug: libxl_event.c:120:libxl__ev_fd_deregister: ev_fd=0x42b1348 deregister fd=17 libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b1360 deregister unregistered libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b13a8 deregister unregistered libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b13c0 deregister unregistered libxl: debug: libxl_event.c:1610:libxl__ao_inprogress: ao 0x42b1038: not ready, waiting libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b1348 deregister unregistered libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b1360 deregister unregistered libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b13a8 deregister unregistered libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b13c0 deregister unregistered libxl: error: libxl_bootloader.c:557:bootloader_finished: bootloader failed - consult logfile /var/log/xen/bootloader.22.log libxl: error: libxl_exec.c:129:libxl_report_child_exitstatus: bootloader [3615] died due to fatal signal Terminated The fix I have come up with is below. Ian. Subject: [PATCH] libxl: Deal properly with reentrant changes to fd events In afterpoll_internal, the callback functions may register and deregister events arbitrarily. This means that we need to consider the reentrancy-safety of the event machinery state variables. Most of the code is safe but the fd handling is not. Fix this by arranging to restart the fd scan loop every time we call one of these callback functions. Another possible solution would be simply to return from afterpoll_internal after calling efd->func. That would be a small and more obviously correct change but would prevent the process from handling more than one fd event with a single call to poll. Reported-by: Roger Pau Monne <roger.pau@xxxxxxxxxx> Signed-off-by: Ian Jackson <ian.jackson@xxxxxxxxxxxxx> diff --git a/tools/libxl/libxl_event.c b/tools/libxl/libxl_event.c index 1957505..991261e 100644 --- a/tools/libxl/libxl_event.c +++ b/tools/libxl/libxl_event.c @@ -838,15 +838,19 @@ int libxl_osevent_beforepoll(libxl_ctx *ctx, int *nfds_io, static int afterpoll_check_fd(libxl__poller *poller, const struct pollfd *fds, int nfds, - int fd, int events) + int fd, int events, + int *(rindices[3])) /* returns mask of events which were requested and occurred */ { if (fd >= poller->fd_rindices_allocd) /* added after we went into poll, have to try again */ return 0; + events |= POLLERR | POLLHUP; + int i, revents = 0; for (i=0; i<3; i++) { + if (rindices) rindices[i] = 0; int slot = poller->fd_rindices[fd][i]; if (slot >= nfds) @@ -858,11 +862,16 @@ static int afterpoll_check_fd(libxl__poller *poller, continue; assert(!(fds[slot].revents & POLLNVAL)); - revents |= fds[slot].revents; - } - /* we mask in case requested events have changed */ - revents &= (events | POLLERR | POLLHUP); + /* we mask in case requested events have changed */ + int slot_revents = fds[slot].revents & events; + if (!slot_revents) + /* this slot is for a different set of events */ + continue; + + revents |= slot_revents; + if (rindices) rindices[i] = &poller->fd_rindices[fd][i]; + } return revents; } @@ -876,26 +885,78 @@ static void afterpoll_internal(libxl__egc *egc, libxl__poller *poller, EGC_GC; libxl__ev_fd *efd; - LIBXL_LIST_FOREACH(efd, &CTX->efds, entry) { - if (!efd->events) - continue; + /* + * Warning! Reentrancy hazards! + * + * Many parts of this function eventually call arbitrary callback + * functions which may modify the event handling data structures. + * + * Of the data structures used here: + * + * egc, poller, now + * are allocated by our caller and relate to the + * current thread and its call stack down into the + * event machinery; it is not freed until we return. + * So it is safe. + * + * fds is either what application passed into + * libxl_osevent_afterpoll (which, although this + * isn't explicitly stated, clearly must remain + * valid until libxl_osevent_afterpoll returns) or + * it's poller->fd_polls which is modified only by + * our (non-recursive) caller eventloop_iteration. + * + * CTX comes from our caller, and applications are + * forbidden from destroying it while we are running. + * So the ctx pointer itself is safe to use; now + * for its contents: + * + * CTX->etimes is used in a simple reentrancy-safe manner. + * + * CTX->efds is more complicated; see below. + */ - int revents = afterpoll_check_fd(poller,fds,nfds, efd->fd,efd->events); - if (revents) { - DBG("ev_fd=%p occurs fd=%d events=%x revents=%x", - efd, efd->fd, efd->events, revents); - efd->func(egc, efd, efd->fd, efd->events, revents); + for (;;) { + /* We restart our scan of fd events whenever we call a + * callback function. This is necessary because such + * a callback might make arbitrary changes to CTX->efds. + * We invalidate the fd_rindices[] entries which were used + * so that we don't call the same function again. */ + int revents, i, *(rindices[3]); + + LIBXL_LIST_FOREACH(efd, &CTX->efds, entry) { + + if (!efd->events) + continue; + + revents = afterpoll_check_fd(poller,fds,nfds, + efd->fd,efd->events, + rindices); + if (revents) + goto found_fd_event; } + /* no ordinary fd events, then */ + break; + + found_fd_event: + DBG("ev_fd=%p occurs fd=%d events=%x revents=%x", + efd, efd->fd, efd->events, revents); + + efd->func(egc, efd, efd->fd, efd->events, revents); + + for (i=0; i<3; i++) + if (rindices[i]) + *(rindices[i]) = INT_MAX; } - if (afterpoll_check_fd(poller,fds,nfds, poller->wakeup_pipe[0],POLLIN)) { + if (afterpoll_check_fd(poller,fds,nfds, poller->wakeup_pipe[0],POLLIN, 0)) { int e = libxl__self_pipe_eatall(poller->wakeup_pipe[0]); if (e) LIBXL__EVENT_DISASTER(egc, "read wakeup", e, 0); } int selfpipe = libxl__fork_selfpipe_active(CTX); if (selfpipe >= 0 && - afterpoll_check_fd(poller,fds,nfds, selfpipe, POLLIN)) { + afterpoll_check_fd(poller,fds,nfds, selfpipe, POLLIN, 0)) { int e = libxl__self_pipe_eatall(selfpipe); if (e) LIBXL__EVENT_DISASTER(egc, "read sigchld pipe", e, 0); libxl__fork_selfpipe_woken(egc); diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 7a655b3..647cc7c 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -278,7 +278,7 @@ struct libxl__poller { int fd_polls_allocd; int fd_rindices_allocd; - int (*fd_rindices)[3]; /* see libxl_osevent_beforepoll */ + int (*fd_rindices)[3]; /* see libxl_event.c:beforepoll_internal */ int wakeup_pipe[2]; /* 0 means no fd allocated */ }; _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |