Re: [Xen-devel] [PATCH 1/3] libxl: Fix libxl_postfork_child_noexec deadlock etc.

On 02/24/2014 02:19 PM, Ian Jackson wrote:
libxl_postfork_child_noexec would nestedly reaquire the non-recursive
"no_forking" mutex: atfork_lock uses it, as does sigchld_user_remove.
The result on Linux is that the process always deadlocks before
returning from this function.

This is used by xl's console child.  So, the ultimate effect is that
xl with pygrub does not manage to connect to the pygrub console.
This beahviour was reported by Michael Young in Xen 4.4.0 RC5.

Also, the use of sigchld_user_remove in libxl_postfork_child_noexec is
not correct with SIGCHLD sharing.  libxl_postfork_child_noexec is
documented to suffice if called only on one ctx.  So deregistering the
ctx it's called on is not sufficient.  Instead, we need a new approach
which discards the whole sigchld_user list and unconditionally removes
our SIGCHLD handler if we had one.

Prompted by this, clarify the semantics of
libxl_postfork_child_noexec.  Specifically, expand on the meaning of
"quickly" by explaining what operations are not permitted; and
document the fact that the function doesn't reclaim the resources in
the ctxs.

And add a comment in libxl_postfork_child_noexec explaining the
internal concurrency situation.

This is an important bugfix.  IMO the bug is a blocker for Xen 4.4.

Signed-off-by: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
Reported-by: M A Young <m.a.young@xxxxxxxxxxxx>
CC: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
CC: George Dunlap <george.dunlap@xxxxxxxxxxxxx>

So it looks like this path gets called from a number of other places in xl:

libxl_postfork_child_noexec() is called by xl.c:postfork().

postfork() is called in xl_cmdimpl.c by autoconnect_vncviewer(), autoconnect_console(), and do_daemonize().

do_daemonize() is called during "xl create", and "xl devd".

Was this deadlock not triggered for those, or was it triggered and nobody noticed?


