Xen project Mailing List

[Xen-devel] RFC: vNUMA project

Date: Tue, 11 Nov 2014 17:36:06 +0000

Cc: Dario Faggioli <dario.faggioli@xxxxxxxxxx>, wei.liu2@xxxxxxxxxx, David Vrabel <david.vrabel@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>

Delivery-date: Tue, 11 Nov 2014 17:37:07 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

# What's already implemented? PV vNUMA support in libxl/xl and Linux kernel. # What's planned but yet implemented? NUMA-aware ballooning, HVM vNUMA # How is vNUMA used in toolstack and Xen? On libxl level, user (xl and other higher level toolstack) can specify number of vnodes, size of a vnode, vnode to pnode mapping, vcpu to vnode mapping, and distances for local and remote node. Then libxl will generate one or more vmemranges for each node. The need to generate more than one vmemranges is to accommodate memory holes. One example is to have e820_host=1 in PV guest config file and allocate to guest more than 4G RAM. The generated information will also be stored in Xen. It will be used in two scenarios: to be retrieved by PV guest; to implement NUMA-aware ballooning. # How is vNUMA used in guest? When PV guest boots up, it issues hypercall to retrieve vNUMA information. Guest is able to retrieve the number of vnodes, size of each vnode, vcpu to vnode mapping and finally an array of vmemranges. Guest can massage these pieces of information for its own use. HVM guest will still use ACPI to initialise NUMA. ACPI table is arranged by hvmloader. # NUMA-aware ballooning It's agreed that NUMA-aware ballooning should be achieved solely in hypervisor. Everything should happen under the hood without guest knowing vnode to pnode mapping. As far as I can tell, existing guests (Linux and FreeBSD) use XENMEM_populate_physmap to balloon up. There's a hypercall called XENMEM_increase_reservation but it's not used by Linux and FreeBSD. I can think of two options to implement NUMA-aware ballooning: 1. Modify XENMEM_populate_physmap to take into account vNUMA hint when it tries to allocate a page for guest. 2. Introduce a new hypercall dedicated to vNUMA ballooning. Its functionality is similar to XENMEM_populate_physmap but it's only used in ballooning so that we don't break XENMEM_populate_physmap. Option #1 requires less modification to guest, because guest won't need to switch to new hypercall. It's unclear at this point if a guest asks to populate a gpfn that doesn't belong to any vnode, what Xen should do about it. Should it be permissive or strict? If Xen is strict (say, refuse to populate gpfn that doesn't belong to a vnode), it imposes difficulty in implementing HVM vNUMA. Hvmloader may try to populate firmware pages which are in a memory hole, and memory hole doesn't belong to a node. Option #2, the question would be should Xen be permissive or strict on guest that uses vNUMA but doesn't use the new hypercall to balloon up. # HVM vNUMA HVM vNUMA is implemented as followed: 1. Libxl generates vNUMA information and passes it to hvmloader. 2. Hvmloader build SRAT table. Note that hvmloader is capable of relocating memory. This means toolstack and guest can have different ideas of the memory layout. This makes NUMA-aware ballooning for HVM guest tricky to implement, due to the fact toolstack to hvmloader communication is one way, and hypervisor shares the same view of guest memory layout as toolstack. Hvmloader should not be allowed to adjust memory layout; otherwise Xen will use the wrong hinting information and the end result is certainly wrong. To have basic HVM vNUMA support, we should disallow memory relocation and discourage ballooning if vNUMA is enabled in HVM guest. We also need to disable populate-on-demand as PoD pool in Xen is not NUMA-aware. We then can gradually lift these limits when we deicde what to do about them. # Planning There are many moving parts that don't fit well together. I think a valid strategy is to impose some limitations on vNUMA and other features, either by restricting in toolstack or in documentation. Then lift these limitations in different stages. First stage: Basic PoD Ballooning Mem_relocation PV/PVH Y na X na HVM Y X X X Implement basic functionality of vNUMA. That is, to boot a guest (PV/HVM) with vNUMA support. Second stage: Basic PoD Ballooning Mem_relocation PV/PVH Y na Y na HVM Y X Y X Implement NUMA-aware ballooning. Third stage: Basic PoD Ballooning Mem_relocation PV/PVH Y na Y na HVM Y Y Y X NUMA-aware PoD? Fourth stage: Basic PoD Ballooning Mem_relocation PV/PVH Y na Y na HVM Y Y Y Y Implement bi-direction communication mechanism so that we can allow memory relocation in hvmloader? Third stages onward are less concrete at this point. Thoughts? Wei. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.