[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] "xl restore" leaks a file descriptor?



On Tue, 2015-08-11 at 18:07 +0100, Wei Liu wrote:
> On Tue, Aug 11, 2015 at 04:48:13PM +0100, Ian Campbell wrote:
> > On Tue, 2015-08-11 at 11:13 -0400, Andrew Armenia wrote:
> > > It's the checkpoint file - i.e. the command line argument to xl
> > > restore - that is being leaked.
> > 
> > Thanks.
> > 
> > [...]
> > > So the checkpoint file is clearly being leaked.
> > 
> > Indeed. I confirmed this even with the current development version 
> > using ls
> > -l /proc/<pid>/fd which shows an fd open on a deleted file:
> > 
> > # ps aux| grep xl
> > root     20465  0.0  0.2 106036   984 ?        SLsl 15:42   0:00 xl 
> > restore save
> > # ls -l /proc/20465/fd
> > [...]
> > lr-x------. 1 root root 64 Aug 11 15:42 7 -> /root/save
> > [...]
> > # rm /root/save
> > # ls -l /proc/20465/fd
> > [...]
> > lr-x------. 1 root root 64 Aug 11 15:42 7 -> /root/save (deleted)
> > [...]
> > 
> > >  Its space is not freed
> > > until the 'xl restore' process is ended by shutting down the domain:
> > [...]
> > > 
> > > It seems like xl restore should close the checkpoint file as soon as
> > > it's done restoring the domain, allowing the space to be freed, but
> > > that's clearly not happening.
> > 
> > Right. In fact xl sets the file to be close-on-exec right after opening 
> > it,
> > which is before the daemonisation step, so it ought to be closed
> > automatically, but isn't for some reason.
> > 
> > My working theory is that something in the machinery which spawns the 
> > save
> > helper is defeating the use of CLOEXEC, perhaps by dup2() or perhaps by
> > unsetting CLOEXEC.
> > 
> > Any way, thanks for reporting. I've copied the devel list and 4.6 RM. 
> > Wei
> > this probably ought to be a blocker for 4.6 (and the fix ought 
> > ultimately
> > to be backported to 4.4 onwards at least).
> > 
> > NB: This leak seems to be independent of the switch to migration v2.
> > 
> > Ian.
> 
> Maybe this is just because we leak a fd.
> 
> I don't see how CLOEXEC would be of any use if xl doesn't actually exec
> anything.

Duh, for some reason I thought daemonize would activate the CLOEXEC, but
it's just fork without exec. Silly me.

> 
> Below is a PoC patch which seems to fix the problem for me.
> 
> ---8<---
> commit 7b5f466d5977dc9f41991ca0c2227023ac07709d
> Author: Wei Liu <wei.liu2@xxxxxxxxxx>
> Date:   Tue Aug 11 18:02:25 2015 +0100
> 
>     xl: close restore_fd when we finish with it
>     
>     Signed-off-by: Wei Liu <wei.liu2@xxxxxxxxxx>
> 
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> index 499a05c..525cd24 100644
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -2846,6 +2846,10 @@ start:
>          ret = libxl_domain_create_new(ctx, &d_config, &domid,
>                                        0, autoconnect_console_how);
>      }
> +
> +    if (migrate_fd < 0)
> +        close(restore_fd);

As Andy says I think we want restore_fd in the check, I can't see any
reason we wouldn't want to close the socket too.

For reboot handing you would need to reset the fd to < 0, otherwise when we
come back around on reboot we will close this again.

Would it be less error prone to put this in the if (restoring) just above,
i.e. exactly where restore_fd is used and which already has the reboot
logic in place with restoring = 0.

Ian.

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.