[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 1/3] libxl: Fix libxl_postfork_child_noexec deadlock etc.
George Dunlap wrote: > On Mon, Feb 24, 2014 at 3:47 PM, George Dunlap > <George.Dunlap@xxxxxxxxxxxxx> wrote: > >> On Mon, Feb 24, 2014 at 3:17 PM, George Dunlap >> <george.dunlap@xxxxxxxxxxxxx> wrote: >> >>> On 02/24/2014 02:19 PM, Ian Jackson wrote: >>> >>>> libxl_postfork_child_noexec would nestedly reaquire the non-recursive >>>> "no_forking" mutex: atfork_lock uses it, as does sigchld_user_remove. >>>> The result on Linux is that the process always deadlocks before >>>> returning from this function. >>>> >>>> This is used by xl's console child. So, the ultimate effect is that >>>> xl with pygrub does not manage to connect to the pygrub console. >>>> This beahviour was reported by Michael Young in Xen 4.4.0 RC5. >>>> >>>> Also, the use of sigchld_user_remove in libxl_postfork_child_noexec is >>>> not correct with SIGCHLD sharing. libxl_postfork_child_noexec is >>>> documented to suffice if called only on one ctx. So deregistering the >>>> ctx it's called on is not sufficient. Instead, we need a new approach >>>> which discards the whole sigchld_user list and unconditionally removes >>>> our SIGCHLD handler if we had one. >>>> >>>> Prompted by this, clarify the semantics of >>>> libxl_postfork_child_noexec. Specifically, expand on the meaning of >>>> "quickly" by explaining what operations are not permitted; and >>>> document the fact that the function doesn't reclaim the resources in >>>> the ctxs. >>>> >>>> And add a comment in libxl_postfork_child_noexec explaining the >>>> internal concurrency situation. >>>> >>>> This is an important bugfix. IMO the bug is a blocker for Xen 4.4. >>>> >>>> Signed-off-by: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx> >>>> Reported-by: M A Young <m.a.young@xxxxxxxxxxxx> >>>> CC: Ian Campbell <Ian.Campbell@xxxxxxxxxx> >>>> CC: George Dunlap <george.dunlap@xxxxxxxxxxxxx> >>>> >>> So it looks like this path gets called from a number of other places in xl: >>> >>> libxl_postfork_child_noexec() is called by xl.c:postfork(). >>> >>> postfork() is called in xl_cmdimpl.c by autoconnect_vncviewer(), >>> autoconnect_console(), and do_daemonize(). >>> >>> do_daemonize() is called during "xl create", and "xl devd". >>> >>> Was this deadlock not triggered for those, or was it triggered and nobody >>> noticed? >>> >> In any case, I do think we need to fix this; the main question is, do >> we need to delay the release a bit further to make sure it gets >> sufficient testing? >> > > Also, it would be nice to get a Tested-by: from someone using it with > libvirt (before the release at least, if not before the check-in). > > Jim / Dario? > I'll update my test system to rc6 tomorrow and restart my tests. FYI, the tests were running over the weekend on rc5 + libvirt 1.2.2 rc1. Over 25,000 domains started, shutdown, created, saved, restored, etc. with no problems noted. Regards, Jim _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |