Xen project Mailing List

Re: [Xen-devel] [PATCH 00/11] PV NUMA Guests

To: "Cui, Dexuan" <dexuan.cui@xxxxxxxxx>

Date: Thu, 15 Apr 2010 13:19:48 -0400

Cc: Andre Przywara <andre.przywara@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>

Delivery-date: Thu, 15 Apr 2010 10:20:48 -0700

Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=Jxw6bmeE+N0S5UTt/34smMgQVoWhKqxUdh6Vew8FgcXEi2P3XHrawuv6MGJDaQrqP6 JJFtFWh390s1cZrC8UQ3RnruQ0uyFhbvkIago03FWDqTSJuFwWvX0N9Qgki2ifwgXuwa BpKFwfy9SE2yO+UYy4cT8TGLnib/YYS7Q6YKM=

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On Wed, Apr 14, 2010 at 1:18 AM, Cui, Dexuan <dexuan.cui@xxxxxxxxx> wrote: > Dulloor wrote: >> On Wed, Apr 7, 2010 at 3:57 AM, Cui, Dexuan <dexuan.cui@xxxxxxxxx> >> wrote: >>> Keir Fraser wrote: >>>> I would like Acks from the people working on HVM NUMA for this patch >>>> series. At the very least it would be nice to have a single user >>>> interface for setting this up, regardless of whether for a PV or HVM >>>> guest. Hopefully code in the toolstack also can be shared. So I'm >>> Yes, I strongly agree we should share one interterface, e.g., The >>> XENMEM_numa_op hypercalls implemented by Dulloor could be >re-used >>> in the hvm numa case and some parts of the toolstack could be >>> shared, I think. I also replied in another thead and >supplied some >>> similarity I found in Andre/Dulloor's patches. >>> >> IMO PV NUMA guests and HVM NUMA guests could share most of the code >> from toolstack - for instance, getting the current state of machine, >> deciding on a strategy for domain memory allocation, selection of >> nodes, etc. They diverge only at the actual point of domain >> construction. PV NUMA uses enlightenments, whereas HVM would need >> working with hvmloader to export SLIT/SRAT ACPI tables. So, I agree >> that we need to converge. > Hi Dulloor, > In your patches, the toolstack tries to figure out the "best fit nodes" for a > PV guest and >invokes a hypercall set_domain_numa_layout to tell the hypervisor to remember >the >info, and later the PV guest invokes a hypercall get_domain_numa_layout to >retrieve the >info from the hypervisor. > Can this be changed to: the toolstack writes the guest numa info directly > into a new >field in the start_info(or the share_info) (maybe in the starndard format of >the SRAT/SLIT) >and later PV guest reads the info and uses acpi_numa_init() to parse the info? > I think in >this way the new hypercalls can be avoided and the pv numa enlightenment code >in >guest kernel can be minimized. > I'm asking this because this is the way how HVM numa patches of Andure do(the >toolstack passes the info to hvmloader and the latter builds SRAT/SLIT for >guest) Hi Cui, In my first version of patches (for making dom0 a numa guest), I had put this information into start_info (http://lists.xensource.com/archives/html/xen-devel/2010-02/msg00630.html). But, after that I thought this new approach is better (for pv numa and maybe even hvm numa) for following reasons : - For PV NUMA guests, there are more places where the enlightenment might be useful. For instance, in the attached (refreshed)patch, I have used the enlightenment to support ballooning (without changing node mappings) for PV NUMA guests. Similarly, there are other places within the hypervisor as well as in the VM where I plan to use the domain_numa_layout. That's the main reason for choosing this approach. Although I am not sure, I think this could be useful for HVM too (maybe with PV on HVM). - Using the hypercall interface is equally simple. And, also with start-info, I wasn't sure if it looks clean to add feature-specific variables (useful only with PV NUMA guests) to start-info (or even shared info), changing the xen-vm interface, adding (unnecessary) changes for compat, etc. Please let me know your thoughts. > > xc_select_best_fit_nodes() decides the "min-set" of host nodes that will be > used for the >guest. It only considers the current memory usage of the system. Maybe we >should also >condider the cpu load? And the number of the nodes must be 2^^n? And how to >handle >the case #vcpu is < #vnode? > And looks your patches only consider the guest's memory requirement -- > guest's vcpu >requirement is neglected? e.g., a guest may not need a very large amount of >memory >while it needs many vcpus. xc_select_best_fit_nodes() should consider this when >determining the number of vnode. I agree with you. I was planning to consider vcpu load as the next step. Also, I am looking for a good heuristic. I looked at the nodeload heuristic (currently in xen), but found it too naive. But, if you/Andre think it is a good heuristic, I will add the support. Actually, I think in future we should do away with strict vcpu-affinities and rely more on a scheduler with necessary NUMA support to complement our placement strategies. As of now, we don't SPLIT, if #vcpu < #vnode. We use STRIPING in that case. > >>>> On 04/04/2010 20:30, "Dulloor" <dulloor@xxxxxxxxx> wrote: >>>> >>>>> The set of patches implements virtual NUMA-enlightenment to support >>>>> NUMA-aware PV guests. In more detail, the patch implements the >>>>> following : >>>>> >>>>> * For the NUMA systems, the following memory allocation strategies >>>>> are implemented : - CONFINE : Confine the VM memory allocation to a >>>>> single node. As opposed to the current method of doing this in >>>>> python, the patch implements this in libxc(along with other >>>>> strategies) and with assurance that the memory actually comes from >>>>> the selected node. - STRIPE : If the VM memory doesn't fit in a >>>>> single node and if the VM is not compiled with guest-numa-support, >>>>> the memory is allocated striped across a selected max-set of nodes. >>>>> - SPLIT : If the VM memory doesn't fit in a single node and if the >>>>> VM is compiled with guest-numa-support, the memory is allocated >>>>> split (equally for now) from the min-set of nodes. The VM is then >>>>> made aware of this NUMA allocation (virtual NUMA enlightenment). >>>>> -DEFAULT : This is the existing allocation scheme. >>>>> >>>>> * If the numa-guest support is compiled into the PV guest, we add >>>>> numa-guest-support to xen features elfnote. The xen tools use this >>>>> to determine if SPLIT strategy can be applied. >>>>> >>> I think this looks too complex to allow a real user to easily >>> determine which one to use... >> I think you misunderstood this. For the first version, I have >> implemented an automatic global domain memory allocation scheme, which >> (when enabled) applies to all domains on a NUMA machine. I am of >> opinion that users are seldom in a state to determine which strategy >> to use. They would want the best possible performance for their VM at >> any point of time, and we can only guarantee the best possible >> performance, given the current state of the system (how the free >> memory is scattered across nodes, distance between those nodes, etc). >> In that regard, this solution is the simplest. > Ok, I see. > BTW: I think actually currently Xen can handle the case CONFINE pretty well, > e.g, when > no vcpu affinity is explicitly specified, the toolstack tries to choose a > "best" host node > for the guest and pins all vcpus of the guest to the host node. But, currently it is done in python code and also it doesn't use exact_node interface. I added this to the libxc toolstack for the sake of completeness (CONFINE is just a special case of SPLIT). Also, with libxl catching up, we might anyway want to do these things in libxc, where it is accessible to both xm and xl. > >>> About the CONFINE stragegy -- looks this is not a useful usage model >>> to me -- do we really think it's a typical usage model to >>> ensure a VM's memory can only be allocated on a specified node? >> Not all VMs are large enough not to fit into a single node (note that >> user doesn't specify a node). And, if a VM can be fit into a single >> node, that is obviously the best possible option for a VM. >> >>> The definitions of STRIPE and SPLIT also doesn't sound like typical >>> usage models to me. >> There are only two possibilities. Either the VM fits in a single node >> or it doesn't. The mentioned strategies (SPLIT, STRIPE) try to >> optimize the solution when the VM doesn't fit in a single node. The >> aim is to reduce the number of inter-node accesses(SPLIT) and/or >> provide a more predictable performance(STRIPE). >> >>> Why must tools know if the PV kernel is built with guest numa >>> support or not? >> What is the point of arranging the memory amenable for construction of >> nodes in guest if the guest itself is not compiled to do so. > I meant: to simplify the implementation, the toolstack can always supply the > numa > config info to the guest *if necessary*, no matter if the guest kernel is > numa-enabled or > not (even if the guest kernel isn't numa-enabled, the guest performance may > be better > if the toolstack decides to supply a numa config to the guest) > About the "*if necessary*": Andure and I think the user should supply an > option > "guestnode" in the guest config file, and you think the toolstack should be > able to > automatically determine a "best" value. I raised some questions about > xc_select_best_fit_nodes() in the above paragraph. > Hi Andre, would you like to comment on this? How about an "automatic" global option along with a VM-level "guestnode" option. These options could be work independently or with each other ("guestnode" would take preference over global "automatic" option). We can work out finer details. > >> >>> If a user configures guest numa to "on" for a pv guest, the tools >>> can supply the numa info to PV kernel even if the pv kernel is not > >>> built with guest numa support -- the pv kernel will neglect the info >>> safely; >>> If a user configures guest numa to "off" for a pv guest and the >>> tools don't supply the numa info to PV kernel, and if the pv kernel >>> > is built with guest numa support, the pv kernel can easily detect >>> this by your new hypercall and will not enable numa. >> These error checks are done even now. But, by checking if the PV >> kernel is built with guest numa support, we don't require the user to >> configure yet another parameter. Wasn't that your concern too in the >> very first point ? >> >>> >>> When a user finds the computing capability of a single node can't >>> satisfy the actual need and hence wants to use guest numa, >>> since the user has specified the amount of guest memory and the >>> number of vcpus in guest config file, I think the user only needs >>> to specify how many guest nodes (the "guestnodes" option in Andre's >>> patch) the guest will see, and the tools and the hypervisor >>> should co-work to distribute guest memory and vcpus uniformly among >>> the guest nodes(I think we may not want to support non- >>> uniform nodes as that doesn't look like a typical usage model) -- of >>> course, maybe a specified node doesn't have the expected >>> amount of memory -- in this case, the guest can continue to run with >>> a slower speed (we can print a warning message to the >>> user); or, if the user does care about predictable guest >>> performance, the guest creation should fail. >> >> Please observe that the patch does all these things plus some more. >> For one, "guestnodes" option doesn't make sense, since as you observe, >> it needs the user to carefully read the state of the system when >> starting the domain and also the user needs to make sure that the >> guest itself is compiled with numa support. The aim should be to > I think it's not difficult for a user to specify "guestnodes" and to check if > a PV/HVM guest > kernel is numa-enabled or not(anyway, a user needs to ensure that to achieve > the > optimal peformance). "xm info/list/vcpu-list" should already supply enough > info. I think > it's reasonable to assume a numa user has more knowledge than a preliminary > user. :-) > > I suppose Andure would argue more for the "guestnodes" option. > > PV guest can use the ELFnote as a hit to the toolstack. This may be used as a > kind of optimization. > HVM guest can't use this. As mentioned above, I think we have a good case for both global and VM-level options. What do you think ? > >> automate this part and provide the best performance, given the current >> state. The patch attempts to do that. Secondly, when the guests are >> not compiled with numa support, they would still want a more >> predictable (albeit average) performance. And, by striping the memory >> across the nodes and by pinning the domain vcpus to the union of those >> nodes' processors, applications (of substantial sizes) could be >> expected to see more predictable performance. >>> >>> How do you like this? My thought is we can make things simple in the >>> first step. :-) >> Please let me know if my comments are not clear. I agree that we >> should shoot for simplicity and also for a common interface. Hope we >> will get there :) > Thanks a lot for all the explanation and discussion. > Yes, we need to agree on a common interface to avoid confusion. > And I still think the "guestnodes/uniform_nodes" idea is more straightforward > and the > implementatin is simpler. :-) > > Thanks, > -- Dexuan thanks dulloor

Attachment: numa-ballooning.patch
Description: Text Data

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.