Xen project Mailing List

Re: [Xen-devel] [PATCH] libxc: check return values on mmap() and madvise() on xc_alloc_hypercall_buffer()

To: Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Mel Gorman <mgorman@xxxxxxx>

From: "Luis R. Rodriguez" <mcgrof@xxxxxxxxxxxxxxxx>

Date: Wed, 21 May 2014 09:05:57 -0700

Delivery-date: Wed, 21 May 2014 16:06:24 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Wed, May 21, 2014 at 6:27 AM, Ian Campbell <Ian.Campbell@xxxxxxxxxx> wrote: > On Tue, 2014-05-20 at 05:37 -0700, Luis R. Rodriguez wrote: >> From: "Luis R. Rodriguez" <mcgrof@xxxxxxxx> >> >> On a Thinkpad T4440p with OpenSUSE tumbleweed with v3.15-rc4 >> and today's latest xen tip from the git tree strace -f reveals >> we end up on a never ending wait shortly after >> >> write(20, "backend/console/5\0", 18 <unfinished ...> >> >> This is right before we just wait on the qemu process which we >> had mmap'd for. Without this you'll end up getting stuck on a >> loop if mmap() worked but madvise() did not. While at it I noticed >> even the mmap() error fail was not being checked, fix that too. >> >> Signed-off-by: Luis R. Rodriguez <mcgrof@xxxxxxxx> > > Acked-by: Ian Campbell <ian.campbell@xxxxxxxxxx> and applied. > > OOI why was madvise failing? (should be quite unusual I think?) For the sake of Mel the context of the patch was: > tools/libxc/xc_linux_osdep.c | 20 +++++++++++++++++++- > 1 file changed, 19 insertions(+), 1 deletion(-) > > diff --git a/tools/libxc/xc_linux_osdep.c b/tools/libxc/xc_linux_osdep.c > index 73860a2..86bff3e 100644 > --- a/tools/libxc/xc_linux_osdep.c > +++ b/tools/libxc/xc_linux_osdep.c > @@ -92,14 +92,32 @@ static void > *linux_privcmd_alloc_hypercall_buffer(xc_interface *xch, xc_osdep_ha > { > size_t size = npages * XC_PAGE_SIZE; > void *p; > + int rc, saved_errno; > > /* Address returned by mmap is page aligned. */ > p = mmap(NULL, size, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS|MAP_LOCKED, -1, 0); > + if ( p == MAP_FAILED ) > + { > + PERROR("xc_alloc_hypercall_buffer: mmap failed"); > + return NULL; > + } > > /* Do not copy the VMA to child process on fork. Avoid the page being COW > on hypercall. */ > - madvise(p, npages * XC_PAGE_SIZE, MADV_DONTFORK); > + rc = madvise(p, npages * XC_PAGE_SIZE, MADV_DONTFORK); > + if ( rc < 0 ) > + { > + PERROR("xc_alloc_hypercall_buffer: madvise failed"); > + goto out; > + } > + > return p; > + > +out: > + saved_errno = errno; > + (void)munmap(p, size); > + errno = saved_errno; > + return NULL; > } I tried to dive deep into all the possible worst case scenarios of the above but moved on after a bit, but I'll give my brain dump so far as the question of whether or not madvise() failed is just one of the questions that should be considered here. Although MADV_HUGEPAGE is not *explicitly* used here strace shows its also used. For example my traces showed the madvise() was called for qemu, this is part of strace -f -o stuff.txt xl create /etc/xen/debian.hvm 1735 execve("/usr/bin/qemu-system-x86_64", ["/usr/bin/qemu-system-x86_64", "-xen-domid", "1", "-chardev", "socket,id=libxl-cmd,path=/var/ru"..., "-mon", "chardev=libxl-cmd,mode=control", "-nodefaults", "-name", "debian.hwm", "-vnc", "127.0.0.1:0,to=99", "-serial", "pty", "-device", "cirrus-vga", ...], [/* 59 vars */] <unfinished ...> Further down I see: 1735 mmap(NULL, 1576960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f50116000 1735 madvise(0, 1069547520, MADV_HUGEPAGE) = -1 ENOMEM (Cannot allocate memory) 1735 madvise(0, 1069547520, MADV_DONTFORK) = -1 ENOMEM (Cannot allocate memory) I believe that MADV_HUGEPAGE is used although we didn't ask for it explicitly because of the size mmap()'d, in this case it had turned out to be the memory that we'd want to use for our qemu process. I am not certain but I believe this can happen automatically on kernels that have huge page support and when the mmap()'d page is huge page aligned. Now, the madvise() call for MADV_HUGEPAGE should not be expected to succeed though, for example I suspect this can fail when a system with huge page tables don't have it set up properly yet [0]. The important piece here however for us is the use of MADV_DONTFORK given it was passed explicitly and that *can* also fail but its important for us that is does not fail given the context under which its being used. The consequences of MADV_DONTFORK failing and not checking for the error vary, let's explore this particular case a bit more. In this case its for qemu -- its unclear to me what would happen if MADV_DONTFORK was used and it failed and we still continued but I do have some information as to what can happen if MADV_DONTFORK is *not* used and what gave birth to the flag -- if not used and the parent process forks its possible that "the driver retains a reference to the page which now belongs exclusively to the child process(es). The parent process and the driver will no longer be able to communicate with each other." [1]. This does not necessarily mean that failure to check for madvise() with MADV_DONTFORK implicates the same dangers -- I am not certain of this. What's curious however is that since MADV_HUGEPAGE was pegged automatically for us and since a system could not have huge page table set up properly this madvise() can fail, but -- calls to madvise() with MADV_HUGEPAGE should not be considered fatal, however since we were using MADV_DONTFORK and we don't want that to fail the addition of MADV_HUGEPAGE is perhaps not welcomed here unless we know that was set up. Its unclear if in such cases using MADV_DONTFORK with MADV_NOHUGEPAGE is advisable -- but my strace shows these are split up -- we still however never checked its return value and consequences of this are still a bit unclear to me. In the now working situation I see now that madvise() with MADV_HUGEPAGE and then MADV_DONTFORK fails and then see a mremap() on the first returned address followed by the same madvise() failing calls, after that I see a new mmap() with different mmap() flags passed. 31318 execve("/usr/local/lib/xen/bin/qemu-system-i386", ["/usr/local/lib/xen/bin/qemu-syst"..., "-xen-domid", "2", "-chardev", "socket,id=libxl-cmd,path=/var/ru"..., "-mon", "chardev=libxl-cmd,mode=control", "-nodefaults", "-name", "debian.hwm", "-vnc", "127.0.0.1:0,to=99", "-device", "cirrus-vga,vgamem_mb=8", "-boot", "order=cda", ...], [/* 60 vars */] <unfinished ...> And then: 31318 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fceade29000 31318 madvise(0, 1069547520, MADV_HUGEPAGE) = -1 ENOMEM (Cannot allocate memory) 31318 madvise(0, 1069547520, MADV_DONTFORK) = -1 ENOMEM (Cannot allocate memory) Followed by: 31318 mremap(0x7fceade29000, 262144, 266240, MREMAP_MAYMOVE) = 0x7fcea2a8f000 31318 madvise(0, 8388608, MADV_HUGEPAGE) = -1 ENOMEM (Cannot allocate memory) 31318 madvise(0, 8388608, MADV_DONTFORK) = -1 ENOMEM (Cannot allocate memory) 31318 mmap(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_SHARED, 12, 0) = 0x7fcea228f000 Note that my patch also covers a failure to check mmap() too -- however I believe if that failed write access to the mmap()'d area would trigger a quick enough failure. My hope is that this is the same for the failure to check for madvise() with MADV_DONTFORK. Given my uncertainty I did not include all this information in the commit log. [0] http://lwn.net/Articles/376606/ [1] http://lwn.net/Articles/171941/ Luis _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.