Xen project Mailing List

Re: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support

To: "Cui, Dexuan" <dexuan.cui@xxxxxxxxx>

From: Andre Przywara <andre.przywara@xxxxxxx>

Date: Tue, 23 Feb 2010 10:53:17 +0100

Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>

Delivery-date: Tue, 23 Feb 2010 01:55:55 -0800

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Cui, Dexuan wrote: > Hi Andre, > I'm also looking into hvm guest's numa support and I'd like to share my > thoughs and supply my understanding about your patches. > > 1) Besides SRAT, I think we should also build guest SLIT according to host > SLIT. That is probably right, though currently low priority. Let's get the basics first upstream. > 2) I agree we should supply the user a way to specify which guest node should > have how much memory, namely, the "nodemem" > parameter in your patch02. However, I can't find where it is assigned a value in your patches. I guess you missed it in image.py. Omitted for now. I wanted to keep the first patches clean and had some hard time to propagate arrays from the config files downto libxc. Is there a good explanation of the different kind of config file options? I see different classes (like HVM only) along with some legacy parts that appear quite confusing to me. > And what if xen can't allocate memory from the specified host node(e.g., > no enough free memory on the host node)? > -- currently xen *silently* tries to allocate memory from other host nodes -- this would hurt guest performance > while the user doesn't know that at all! I think we should add an option in guest config file: if it's set, > the guest creation should fail if xen can not allocate memory from the specified host node. Exactly that scenario I had also in mind: Provide some kind of numa=auto option in the config file to let Xen automatically split up the memory allocation from different nodes if needed. I think we need an upper limit here, or maybe something like: numa={force,allow,deny} numanodes=2 the numa=allow option would only allocate up to 2 nodes if no single node can satisfy the memory request. > 3) In your patch02: > + for (i = 0; i < numanodes; i++) > + numainfo.guest_to_host_node[i] = i % 2; > As you said in the mail "[PATCH 5/5]", at present it "simply round robin > until the code for automatic allocation is in place", > I think "simply round robin" is not acceptable and we should implement "automatic allocation". Right, but this depends on the one part I missed. The first part of this is the xc_nodeload() function. I will try to provide the missing part this week. > 4) Your patches try to sort the host nodes using a noad load evaluation > algorithm, and require the user to specify how many > guest nodes the guest should see, and distribute equally guest vcpus into each guest node. > I don't think the algorithm could be wise enough every time and it's not > flexiable. Requiring the user to specify the number > of guest node and districuting vcpus equally into each guest node also doesn't sound wise enough and flexible. Another possible extension. I had some draft with "node_cpus=[1,2,1]" to put one vCPU in the first and third node and two vCPUs in the second node, although I omitted them from the first "draft" release. > Since guest numa needs vcpu pinning to work as expected, how about my > below thoughs? > > a) ask the user to use "cpus" option to pin each vcpu to a physical cpu > (or node); > b) find out how many physical nodes (host nodes) are involved and use that > number as the number of guest node; > c) each guest node corresponds to a host node found out in step b) and use > this info to fill the numainfo.guest_to_host_node[] in 3). My idea is: 1) use xc_nodeload() to get a list of host nodes with the respective amount of free memory 2) either use the user-provided number of guest nodes and determine the number based on the memory availability (=n) 3) select the <n> best nodes from the list (algorithm still to be discussed, but a simple approach is sufficient for the first time) 4) populate numainfo.guest_to_host_node accordingly 5) pin vCPUs based on this array This is basically the missing function (TM) I described earlier. > 5) I think we also need to present the numa guest with virtual cpu topology, > e.g., throught the initial APCI ID. In current xen, > apic_id = vcpu_id * 2; even if we have the guest SRAT support and use 2 guest nodes for a vcpus=n guest, > the guest would still think it's on a package with n cores without the knowledge of vcpu and cache > topology and this would do harm to the performance of guest. > I think we can use each guest node as a guest package and by giving the > guest a proper APIC ID > (consisting of guest SMT_ID/Core_ID/Package_ID) to show the vcpu topology to guest. > This needs changes to the hvmloader's SRAT/MADT's APID ID fields, xen's cpuid/vlapic emulation. The APIC ID scenario does not work on AMD CPUs, which don't have a bit field based association between compute units and APIC IDs. For NUMA purposes SRAT should be sufficient, as it overrides APIC based decisions. But you are right in that it needs more CPUID / ACPI tweaking to get the topology right, although this should be addressed in separate patches: Currently(?) it is very cumbersome to inject a specific "cores per socket" number into Xen (by tweaking those ugly CPUID bit masks). For QEMU/KVM I introduced an easy config scheme (smp=8,cores=2,threads=2) to allow this (purely CPUID based). If only I had time for this I would do this for Xen, too. > > 6) HVM vcpu's hot add/remove functionlity was added into xen recently. The > guest numa support should take this into consideration. Are you volunteering? ;-) > 7) I don't see the live migration support in your patches. Looks it's hard > for hvm numa guest to do live migration as the > src/dest hosts could be very different in HW configuration. I don't think this is a problem. We need to separate guest specific options (like VCPUs to guest nodes or guest memory to guest nodes mapping) from host specific parts (guest nodes to host nodes). I haven't tested it yet, but I assume that the config file options to specify the guest specific parts should be sent already right now, resulting in the new guest setting up with the proper guest config. The guest node to host node association is determined by the new host dynamically depending on the current host's resources. This can turn out to be sub-optimal, like migrating a "4 guest node on 4 host nodes" guest on a dual node host, but this would currently map to 0-1-0-1 setup, where two guest nodes are assigned the same host node. I don't see much of an problem here. Thanks for your thoughts and looking forward to future collaboration. Regards, Andre. -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.