[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] libxc: check return values on mmap() and madvise() on xc_alloc_hypercall_buffer()



On Wed, May 21, 2014 at 6:27 AM, Ian Campbell <Ian.Campbell@xxxxxxxxxx> wrote:
> On Tue, 2014-05-20 at 05:37 -0700, Luis R. Rodriguez wrote:
>> From: "Luis R. Rodriguez" <mcgrof@xxxxxxxx>
>>
>> On a Thinkpad T4440p with OpenSUSE tumbleweed with v3.15-rc4
>> and today's latest xen tip from the git tree strace -f reveals
>> we end up on a never ending wait shortly after
>>
>> write(20, "backend/console/5\0", 18 <unfinished ...>
>>
>> This is right before we just wait on the qemu process which we
>> had mmap'd for. Without this you'll end up getting stuck on a
>> loop if mmap() worked but madvise() did not. While at it I noticed
>> even the mmap() error fail was not being checked, fix that too.
>>
>> Signed-off-by: Luis R. Rodriguez <mcgrof@xxxxxxxx>
>
> Acked-by: Ian Campbell <ian.campbell@xxxxxxxxxx> and applied.
>
> OOI why was madvise failing? (should be quite unusual I think?)

For the sake of Mel the context of the patch was:

>  tools/libxc/xc_linux_osdep.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/tools/libxc/xc_linux_osdep.c b/tools/libxc/xc_linux_osdep.c
> index 73860a2..86bff3e 100644
> --- a/tools/libxc/xc_linux_osdep.c
> +++ b/tools/libxc/xc_linux_osdep.c
> @@ -92,14 +92,32 @@ static void 
> *linux_privcmd_alloc_hypercall_buffer(xc_interface *xch, xc_osdep_ha
>  {
>      size_t size = npages * XC_PAGE_SIZE;
>      void *p;
> +    int rc, saved_errno;
>
>      /* Address returned by mmap is page aligned. */
>      p = mmap(NULL, size, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_LOCKED, -1, 0);
> +    if ( p == MAP_FAILED )
> +    {
> +        PERROR("xc_alloc_hypercall_buffer: mmap failed");
> +        return NULL;
> +    }
>
>      /* Do not copy the VMA to child process on fork. Avoid the page being COW
>          on hypercall. */
> -    madvise(p, npages * XC_PAGE_SIZE, MADV_DONTFORK);
> +    rc = madvise(p, npages * XC_PAGE_SIZE, MADV_DONTFORK);
> +    if ( rc < 0 )
> +    {
> +        PERROR("xc_alloc_hypercall_buffer: madvise failed");
> +        goto out;
> +    }
> +
>      return p;
> +
> +out:
> +    saved_errno = errno;
> +    (void)munmap(p, size);
> +    errno = saved_errno;
> +    return NULL;
>  }

I tried to dive deep into all the possible worst case scenarios of the
above but moved on after a bit, but I'll give my brain dump so far as
the question of whether or not madvise() failed is just one of the
questions that should be considered here.

Although MADV_HUGEPAGE is not *explicitly* used here strace shows its
also used. For example my traces showed the madvise() was called for
qemu, this is part of strace -f -o stuff.txt xl create
/etc/xen/debian.hvm

1735  execve("/usr/bin/qemu-system-x86_64",
["/usr/bin/qemu-system-x86_64", "-xen-domid", "1", "-chardev",
"socket,id=libxl-cmd,path=/var/ru"..., "-mon",
"chardev=libxl-cmd,mode=control", "-nodefaults", "-name",
"debian.hwm", "-vnc", "127.0.0.1:0,to=99", "-serial", "pty",
"-device", "cirrus-vga", ...], [/* 59 vars */] <unfinished ...>

Further down I see:

1735  mmap(NULL, 1576960, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f50116000
1735  madvise(0, 1069547520, MADV_HUGEPAGE) = -1 ENOMEM (Cannot allocate memory)
1735  madvise(0, 1069547520, MADV_DONTFORK) = -1 ENOMEM (Cannot allocate memory)

I believe that MADV_HUGEPAGE is used although we didn't ask for it
explicitly because of the size mmap()'d, in this case it had turned
out to be the memory that we'd want to use for our qemu process. I am
not certain but I believe this can happen automatically on kernels
that have huge page support and when the mmap()'d page is huge page
aligned. Now, the madvise() call for MADV_HUGEPAGE should not be
expected to succeed though, for example I suspect this can fail when a
system with huge page tables don't have it set up properly yet [0].
The important piece here however for us is the use of MADV_DONTFORK
given it was passed explicitly and that *can* also fail but its
important for us that is does not fail given the context under which
its being used. The consequences of MADV_DONTFORK failing and not
checking for the error vary, let's explore this particular case a bit
more. In this case its for qemu -- its unclear to me what would happen
if MADV_DONTFORK was used and it failed and we still continued but I
do have some information as to what can happen if MADV_DONTFORK is
*not* used and what gave birth to the flag -- if not used and the
parent process forks its possible that "the driver retains a reference
to the page which now belongs exclusively to the child process(es).
The parent process and the driver will no longer be able to
communicate with each other." [1]. This does not necessarily mean that
failure to check for madvise() with MADV_DONTFORK implicates the same
dangers -- I am not certain of this.

What's curious however is that since MADV_HUGEPAGE was pegged
automatically for us and since a system could not have huge page table
set up properly this madvise() can fail, but -- calls to madvise()
with MADV_HUGEPAGE should not be considered fatal, however since we
were using MADV_DONTFORK and we don't want that to fail the addition
of MADV_HUGEPAGE is perhaps not welcomed here unless we know that was
set up. Its unclear if in such cases using MADV_DONTFORK with
MADV_NOHUGEPAGE is advisable -- but my strace shows these are split up
-- we still however never checked its return value and consequences of
this are still a bit unclear to me.

In the now working situation I see now that madvise() with
MADV_HUGEPAGE and then MADV_DONTFORK fails and then see a mremap() on
the first returned address followed by the same madvise() failing
calls, after that I see a new mmap() with different mmap() flags
passed.

31318 execve("/usr/local/lib/xen/bin/qemu-system-i386",
["/usr/local/lib/xen/bin/qemu-syst"..., "-xen-domid", "2", "-chardev",
"socket,id=libxl-cmd,path=/var/ru"..., "-mon",
"chardev=libxl-cmd,mode=control", "-nodefaults", "-name",
"debian.hwm", "-vnc", "127.0.0.1:0,to=99", "-device",
"cirrus-vga,vgamem_mb=8", "-boot", "order=cda", ...], [/* 60 vars */]
<unfinished ...>

And then:

31318 mmap(NULL, 262144, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fceade29000
31318 madvise(0, 1069547520, MADV_HUGEPAGE) = -1 ENOMEM (Cannot allocate memory)
31318 madvise(0, 1069547520, MADV_DONTFORK) = -1 ENOMEM (Cannot allocate memory)

Followed by:

31318 mremap(0x7fceade29000, 262144, 266240, MREMAP_MAYMOVE) = 0x7fcea2a8f000
31318 madvise(0, 8388608, MADV_HUGEPAGE) = -1 ENOMEM (Cannot allocate memory)
31318 madvise(0, 8388608, MADV_DONTFORK) = -1 ENOMEM (Cannot allocate memory)
31318 mmap(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_SHARED, 12, 0) =
0x7fcea228f000

Note that my patch also covers a failure to check mmap() too --
however I believe if that failed write access to the mmap()'d area
would trigger a quick enough failure. My hope is that this is the same
for the failure to check for madvise() with MADV_DONTFORK. Given my
uncertainty I did not include all this information in the commit log.

[0] http://lwn.net/Articles/376606/
[1] http://lwn.net/Articles/171941/

  Luis

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.