|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] libxl crash in event loop accessing fds array
Roger reported to me (pers.comm.) this crash on NetBSD:
Program terminated with signal 11, Segmentation fault.
#0 0x00007f7ff743131b in afterpoll_check_fd (poller=<optimized out>,
fds=0x7f7ff7b241c0, nfds=7, fd=-1, events=1)
at libxl_event.c:856
856 if (fds[slot].fd != fd)
More information:
http://dpaste.org/E5sw2/raw/
http://dpaste.org/7zzRT/raw/ (with #define DEBUG in libxl_event.c)
http://dpaste.org/jeCRG/raw/ (strace, seems a bit broken)
There are, I think, at least three things going wrong here:
- Firstly, the bootloader is failing for some unknown reason. I
think this is (first) manifesting on NetBSD as POLLHUP on the pty
master fd. I don't know why this might be but a good way to find
out would be to run the bootloader by hand. The debug log lists
the arguments it's been given and you can just run it by hand and
see what it does.
- Secondly, this causes a crash in the libxl event handling
machinery. This is due to a mishandled reentrancy hazard. Patch
below.
- Thirdly, for some unknown reason libxl is not able to kill and
reap the bootloader child process:
libxl: warning: libxl_bootloader.c:292:bootloader_abort:
after failure, failed to kill bootloader [16432]: No such process
This is a mystery to me. I asked for a system call trace to try to
see what libxl was actually doing but unfortunately strace on
Roger's NetBSD install doesn't seem to work. ISTR NetBSD having
some other system instead which might produce better information.
It seems to me that libxl ought to be able to kill() an unreaped
child so I looked to see whether the child might have been reaped.
There are only three calls to wait* functions in xl/libxl. The one
in xl.c can be ruled out because we are constantly within libxl
rather than libxl, and all the callbacks into xl are accompanied by
debugging output which we do not see in the trace. The one in
libxl_exec.c is in the middle process of libxl__spawn_spawn and can
be disregarded.
That leaves the one in libxl__fork_selfpipe_woken. This is not
invariably accompanied by debugging output so it is possible that
something has gone wrong with the handling of the bootloader's
death. However looking at the flow after waitpid, in
childproc_reaped, there seem to be these possibilities:
- the child was not found, in which case childproc_reaped
returns ERROR_UNKNOWN_CHILD, and we end up calling
xl_reaped_callback which will (we hope!) also give
ERROR_UNKNOWN_CHILD, so that we will call
libxl_report_child_exitstatus complaining about the unknown
child. We don't see this in the trace.
- the child was found and is still associated with the bootloader
callback, in which case we call bootloader_finished which
also always logs.
- somehow we call some other function which throws the
information awaw. I think this is unlikely.
I therefore conclude that the child has died but been auto-reaped
somehow. Perhaps SIGCHLD is set to SIG_IGN ? That would be Evil,
Bad and Wrong.
I think a proper system call trace will be necessary to resolve this.
Naturally it is important not to fix the first of these problems
before debugging the error handling :-).
Thanks,
Ian.
Here is a patch to reproduce the problem on Linux. touch the file
"boom" while the bootloader is running. You need to run the whole
thing under valgrind to spot the illegal access.
diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
index 99972a2..f7df0b1 100644
--- a/tools/libxl/libxl_aoutils.c
+++ b/tools/libxl/libxl_aoutils.c
@@ -102,6 +102,9 @@ static void datacopier_readable(libxl__egc *egc,
libxl__ev_fd *ev,
libxl__datacopier_state *dc = CONTAINER_OF(ev, *dc, toread);
STATE_AO_GC(dc->ao);
+if (!access("boom",R_OK))
+ revents = 0x10;
+
if (revents & ~POLLIN) {
LOG(ERROR, "unexpected poll event 0x%x (should be POLLIN)"
" on %s during copy of %s", revents, dc->readwhat, dc->copywhat);
Doing this does not (for me, at least), seem to lose track of the
bootloader process. Instead, I see (when combined with the patch
below):
libxl: error: libxl_aoutils.c:110:datacopier_readable: unexpected poll event
0x10 (should be POLLIN) on bootloader pty during copy o bootloader output for
domain 22
libxl: debug: libxl_event.c:120:libxl__ev_fd_deregister: ev_fd=0x42b13a8
deregister fd=13
libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b13c0
deregister unregistered
libxl: debug: libxl_event.c:120:libxl__ev_fd_deregister: ev_fd=0x42b1348
deregister fd=17
libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b1360
deregister unregistered
libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b13a8
deregister unregistered
libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b13c0
deregister unregistered
libxl: debug: libxl_event.c:1610:libxl__ao_inprogress: ao 0x42b1038: not ready,
waiting
libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b1348
deregister unregistered
libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b1360
deregister unregistered
libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b13a8
deregister unregistered
libxl: debug: libxl_event.c:116:libxl__ev_fd_deregister: ev_fd=0x42b13c0
deregister unregistered
libxl: error: libxl_bootloader.c:557:bootloader_finished: bootloader failed -
consult logfile /var/log/xen/bootloader.22.log
libxl: error: libxl_exec.c:129:libxl_report_child_exitstatus: bootloader [3615]
died due to fatal signal Terminated
The fix I have come up with is below.
Ian.
Subject: [PATCH] libxl: Deal properly with reentrant changes to fd events
In afterpoll_internal, the callback functions may register and
deregister events arbitrarily. This means that we need to consider
the reentrancy-safety of the event machinery state variables.
Most of the code is safe but the fd handling is not. Fix this by
arranging to restart the fd scan loop every time we call one of these
callback functions.
Another possible solution would be simply to return from
afterpoll_internal after calling efd->func. That would be a small and
more obviously correct change but would prevent the process from
handling more than one fd event with a single call to poll.
Reported-by: Roger Pau Monne <roger.pau@xxxxxxxxxx>
Signed-off-by: Ian Jackson <ian.jackson@xxxxxxxxxxxxx>
diff --git a/tools/libxl/libxl_event.c b/tools/libxl/libxl_event.c
index 1957505..991261e 100644
--- a/tools/libxl/libxl_event.c
+++ b/tools/libxl/libxl_event.c
@@ -838,15 +838,19 @@ int libxl_osevent_beforepoll(libxl_ctx *ctx, int *nfds_io,
static int afterpoll_check_fd(libxl__poller *poller,
const struct pollfd *fds, int nfds,
- int fd, int events)
+ int fd, int events,
+ int *(rindices[3]))
/* returns mask of events which were requested and occurred */
{
if (fd >= poller->fd_rindices_allocd)
/* added after we went into poll, have to try again */
return 0;
+ events |= POLLERR | POLLHUP;
+
int i, revents = 0;
for (i=0; i<3; i++) {
+ if (rindices) rindices[i] = 0;
int slot = poller->fd_rindices[fd][i];
if (slot >= nfds)
@@ -858,11 +862,16 @@ static int afterpoll_check_fd(libxl__poller *poller,
continue;
assert(!(fds[slot].revents & POLLNVAL));
- revents |= fds[slot].revents;
- }
- /* we mask in case requested events have changed */
- revents &= (events | POLLERR | POLLHUP);
+ /* we mask in case requested events have changed */
+ int slot_revents = fds[slot].revents & events;
+ if (!slot_revents)
+ /* this slot is for a different set of events */
+ continue;
+
+ revents |= slot_revents;
+ if (rindices) rindices[i] = &poller->fd_rindices[fd][i];
+ }
return revents;
}
@@ -876,26 +885,78 @@ static void afterpoll_internal(libxl__egc *egc,
libxl__poller *poller,
EGC_GC;
libxl__ev_fd *efd;
- LIBXL_LIST_FOREACH(efd, &CTX->efds, entry) {
- if (!efd->events)
- continue;
+ /*
+ * Warning! Reentrancy hazards!
+ *
+ * Many parts of this function eventually call arbitrary callback
+ * functions which may modify the event handling data structures.
+ *
+ * Of the data structures used here:
+ *
+ * egc, poller, now
+ * are allocated by our caller and relate to the
+ * current thread and its call stack down into the
+ * event machinery; it is not freed until we return.
+ * So it is safe.
+ *
+ * fds is either what application passed into
+ * libxl_osevent_afterpoll (which, although this
+ * isn't explicitly stated, clearly must remain
+ * valid until libxl_osevent_afterpoll returns) or
+ * it's poller->fd_polls which is modified only by
+ * our (non-recursive) caller eventloop_iteration.
+ *
+ * CTX comes from our caller, and applications are
+ * forbidden from destroying it while we are running.
+ * So the ctx pointer itself is safe to use; now
+ * for its contents:
+ *
+ * CTX->etimes is used in a simple reentrancy-safe manner.
+ *
+ * CTX->efds is more complicated; see below.
+ */
- int revents = afterpoll_check_fd(poller,fds,nfds, efd->fd,efd->events);
- if (revents) {
- DBG("ev_fd=%p occurs fd=%d events=%x revents=%x",
- efd, efd->fd, efd->events, revents);
- efd->func(egc, efd, efd->fd, efd->events, revents);
+ for (;;) {
+ /* We restart our scan of fd events whenever we call a
+ * callback function. This is necessary because such
+ * a callback might make arbitrary changes to CTX->efds.
+ * We invalidate the fd_rindices[] entries which were used
+ * so that we don't call the same function again. */
+ int revents, i, *(rindices[3]);
+
+ LIBXL_LIST_FOREACH(efd, &CTX->efds, entry) {
+
+ if (!efd->events)
+ continue;
+
+ revents = afterpoll_check_fd(poller,fds,nfds,
+ efd->fd,efd->events,
+ rindices);
+ if (revents)
+ goto found_fd_event;
}
+ /* no ordinary fd events, then */
+ break;
+
+ found_fd_event:
+ DBG("ev_fd=%p occurs fd=%d events=%x revents=%x",
+ efd, efd->fd, efd->events, revents);
+
+ efd->func(egc, efd, efd->fd, efd->events, revents);
+
+ for (i=0; i<3; i++)
+ if (rindices[i])
+ *(rindices[i]) = INT_MAX;
}
- if (afterpoll_check_fd(poller,fds,nfds, poller->wakeup_pipe[0],POLLIN)) {
+ if (afterpoll_check_fd(poller,fds,nfds, poller->wakeup_pipe[0],POLLIN, 0))
{
int e = libxl__self_pipe_eatall(poller->wakeup_pipe[0]);
if (e) LIBXL__EVENT_DISASTER(egc, "read wakeup", e, 0);
}
int selfpipe = libxl__fork_selfpipe_active(CTX);
if (selfpipe >= 0 &&
- afterpoll_check_fd(poller,fds,nfds, selfpipe, POLLIN)) {
+ afterpoll_check_fd(poller,fds,nfds, selfpipe, POLLIN, 0)) {
int e = libxl__self_pipe_eatall(selfpipe);
if (e) LIBXL__EVENT_DISASTER(egc, "read sigchld pipe", e, 0);
libxl__fork_selfpipe_woken(egc);
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 7a655b3..647cc7c 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -278,7 +278,7 @@ struct libxl__poller {
int fd_polls_allocd;
int fd_rindices_allocd;
- int (*fd_rindices)[3]; /* see libxl_osevent_beforepoll */
+ int (*fd_rindices)[3]; /* see libxl_event.c:beforepoll_internal */
int wakeup_pipe[2]; /* 0 means no fd allocated */
};
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |