|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Understanding osdep_xenforeignmemory_map mmap behaviour
(add Arnd to CC)
Juergen Gross <jgross@xxxxxxxx> writes:
> [[PGP Signed Part:Undecided]]
> On 24.03.22 02:42, Stefano Stabellini wrote:
>> I am pretty sure the reasons have to do with old x86 PV guests, so I am
>> CCing Juergen and Boris.
>>
>>> Hi,
>>>
>>> While we've been working on the rust-vmm virtio backends on Xen we
>>> obviously have to map guest memory info the userspace of the daemon.
>>> However following the logic of what is going on is a little confusing.
>>> For example in the Linux backend we have this:
>>>
>>> void *osdep_xenforeignmemory_map(xenforeignmemory_handle *fmem,
>>> uint32_t dom, void *addr,
>>> int prot, int flags, size_t num,
>>> const xen_pfn_t arr[/*num*/], int
>>> err[/*num*/])
>>> {
>>> int fd = fmem->fd;
>>> privcmd_mmapbatch_v2_t ioctlx;
>>> size_t i;
>>> int rc;
>>>
>>> addr = mmap(addr, num << XC_PAGE_SHIFT, prot, flags | MAP_SHARED,
>>> fd, 0);
>>> if ( addr == MAP_FAILED )
>>> return NULL;
>>>
>>> ioctlx.num = num;
>>> ioctlx.dom = dom;
>>> ioctlx.addr = (unsigned long)addr;
>>> ioctlx.arr = arr;
>>> ioctlx.err = err;
>>>
>>> rc = ioctl(fd, IOCTL_PRIVCMD_MMAPBATCH_V2, &ioctlx);
>>>
>>> Where the fd passed down is associated with the /dev/xen/privcmd device
>>> for issuing hypercalls on userspaces behalf. What is confusing is why
>>> the function does it's own mmap - one would assume the passed addr would
>>> be associated with a anonymous or file backed mmap region already that
>>> the calling code has setup. Applying a mmap to a special device seems a
>>> little odd.
>>>
>>> Looking at the implementation on the kernel side it seems the mmap
>>> handler only sets a few flags:
>>>
>>> static int privcmd_mmap(struct file *file, struct vm_area_struct *vma)
>>> {
>>> /* DONTCOPY is essential for Xen because copy_page_range doesn't
>>> know
>>> * how to recreate these mappings */
>>> vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTCOPY |
>>> VM_DONTEXPAND | VM_DONTDUMP;
>>> vma->vm_ops = &privcmd_vm_ops;
>>> vma->vm_private_data = NULL;
>>>
>>> return 0;
>>> }
>>>
>>> So can I confirm that the mmap of /dev/xen/privcmd is being called for
>>> side effects? Is it so when the actual ioctl is called the correct flags
>>> are set of the pages associated with the user space virtual address
>>> range?
>>>
>>> Can I confirm there shouldn't be any limitation on where and how the
>>> userspace virtual address space is setup for the mapping in the guest
>>> memory?
>>>
>>> Is there a reason why this isn't done in the ioctl path itself?
>
> For a rather long time we were using "normal" user pages for this purpose,
> which were just locked into memory for doing the hypercall.
Was this using the normal mlock() semantics to stop pages being swapped
out of RAM?
> Unfortunately there have been very rare problems with that approach, as
> the Linux kernel can set a user page related PTE to invalid for short
> periods of time, which led to EFAULT in the hypervisor when trying to
> access the hypercall data.
I must admit I'm not super familiar with the internals of page table
handling with Linux+Xen. Doesn't the kernel need to delegate the
tweaking of page tables to the hypervisor or is it allowed to manipulate
the page tables itself?
> In Linux this can avoided only by using kernel memory, which is the
> reason why the hypercall buffers are allocated and mmap()-ed through the
> privcmd driver.
>
>>>
>>> I'm trying to understand the differences between Xen and KVM in the API
>>> choices here. I think the equivalent is the KVM_SET_USER_MEMORY_REGION
>>> ioctl for KVM which brings a section of the guest physical address space
>>> into the userspaces vaddr range.
>
> The main difference is just that the consumer of the hypercall buffer is
> NOT the kernel, but the hypervisor. In the KVM case both are the same, so
> a brief period of an invalid PTE can be handled just fine in KVM, while
> the Xen hypervisor has no idea that this situation will be over very
> soon.
I still don't follow the details of why we have the separate mmap. Is it
purely because the VM flags of the special file can be changed in a way
that can't be done with a traditional file-backed mmap?
I can see various other devices have their own setting of vm flags but
VM_DONTCOPY for example can be set with the appropriate madvise call:
MADV_DONTFORK (since Linux 2.6.16)
Do not make the pages in this range available to the child after
a fork(2). This is useful to prevent copy-on-write semantics
from changing the physical location of a page if the parent
writes to it after a fork(2). (Such page relocations cause
problems for hardware that DMAs into the page.)
For the vhost-user work we need to be able to share the guest memory
between the xen-vhost-master (which is doing the ioctls to talk to Xen)
and the vhost-user daemon (which doesn't know about hypervisors but just
deals in memory and events).
Would it be enough to loosen the API and just have xen_remap_pfn()
verify the kernels VM flags are appropriately set before requesting Xen
updates the page tables?
--
Alex Bennée
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |