Xen project Mailing List

RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support

To: Andre Przywara <andre.przywara@xxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, "Kamble, Nitin A" <nitin.a.kamble@xxxxxxxxx>

From: "Cui, Dexuan" <dexuan.cui@xxxxxxxxx>

Date: Mon, 22 Feb 2010 18:24:36 +0800

Accept-language: zh-CN, en-US

Acceptlanguage: zh-CN, en-US

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>

Delivery-date: Mon, 22 Feb 2010 02:25:21 -0800

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: Acql4+ASN059+6ptR/eqz6JM2Y/pGAAnBe0gA0o5diA=

Thread-topic: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support

Hi Andre, have you returned to office now? :-) Thanks, -- Dexuan -----Original Message----- From: Cui, Dexuan Sent: 2010年2月6日 0:36 To: 'Andre Przywara'; Keir Fraser; Kamble, Nitin A Cc: xen-devel@xxxxxxxxxxxxxxxxxxx Subject: RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support Hi Andre, I'm also looking into hvm guest's numa support and I'd like to share my thoughs and supply my understanding about your patches. 1) Besides SRAT, I think we should also build guest SLIT according to host SLIT. 2) I agree we should supply the user a way to specify which guest node should have how much memory, namely, the "nodemem" parameter in your patch02. However, I can't find where it is assigned a value in your patches. I guess you missed it in image.py. And what if xen can't allocate memory from the specified host node(e.g., no enough free memory on the host node)? -- currently xen *silently* tries to allocate memory from other host nodes -- this would hurt guest performance while the user doesn't know that at all! I think we should add an option in guest config file: if it's set, the guest creation should fail if xen can not allocate memory from the specified host node. 3) In your patch02: + for (i = 0; i < numanodes; i++) + numainfo.guest_to_host_node[i] = i % 2; As you said in the mail "[PATCH 5/5]", at present it "simply round robin until the code for automatic allocation is in place", I think "simply round robin" is not acceptable and we should implement "automatic allocation". 4) Your patches try to sort the host nodes using a noad load evaluation algorithm, and require the user to specify how many guest nodes the guest should see, and distribute equally guest vcpus into each guest node. I don't think the algorithm could be wise enough every time and it's not flexiable. Requiring the user to specify the number of guest node and districuting vcpus equally into each guest node also doesn't sound wise enough and flexible. Since guest numa needs vcpu pinning to work as expected, how about my below thoughs? a) ask the user to use "cpus" option to pin each vcpu to a physical cpu (or node); b) find out how many physical nodes (host nodes) are involved and use that number as the number of guest node; c) each guest node corresponds to a host node found out in step b) and use this info to fill the numainfo.guest_to_host_node[] in 3). 5) I think we also need to present the numa guest with virtual cpu topology, e.g., throught the initial APCI ID. In current xen, apic_id = vcpu_id * 2; even if we have the guest SRAT support and use 2 guest nodes for a vcpus=n guest, the guest would still think it's on a package with n cores without the knowledge of vcpu and cache topology and this would do harm to the performance of guest. I think we can use each guest node as a guest package and by giving the guest a proper APIC ID (consisting of guest SMT_ID/Core_ID/Package_ID) to show the vcpu topology to guest. This needs changes to the hvmloader's SRAT/MADT's APID ID fields, xen's cpuid/vlapic emulation. 6) HVM vcpu's hot add/remove functionlity was added into xen recently. The guest numa support should take this into consideration. 7) I don't see the live migration support in your patches. Looks it's hard for hvm numa guest to do live migration as the src/dest hosts could be very different in HW configuration. Thanks, -- Dexuan -----Original Message----- From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Andre Przywara Sent: 2010年2月5日 5:51 To: Keir Fraser; Kamble, Nitin A Cc: xen-devel@xxxxxxxxxxxxxxxxxxx Subject: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support Hi, to avoid double work in the community on the same topic and to help syncing on the subject and as I am not in office next week, I would like to send the NUMA guest support patches I have so far. These patches introduce NUMA support for guests. This can be handy if either the guests resources (VCPUs and/or memory) exceed one node's capacity or the host is already loaded so that the requirement cannot be satisfied from one node alone. Some applications may also benefit from the aggregated bandwidth of multiple memory controllers. Even if the guest has only a single node, this code replaces the current NUMA placement mechanism by moving it into libxc. I have changed something lately, so there are some loose ends, but it should suffice as a discussion base. The patches are for HVM guest primarily, as I don't deal much with PV I am not sure whether a port would be straight-forward or the complexity is higher. One thing I was not sure about is how to communicate the NUMA topology to PV guests. Reusing the existing code base and inject a generated ACPI table seems smart, but this would mean to enable ACPI parsing code in PV Linux, which currently seems to be disabled (?). If someone wants to step in and implement PV support, I will be glad to help. I have reworked the (guest node to) host node assignment part, this is currently unfinished. I decided to move the node-rating part from XendDomainInfo.py:find_relaxed_node() into libxc (should this eventually go into libxenlight?) to avoid passing to much information between the layers and to include libxl support. This code snippet (patch 5/5) basically scans all VCPUs on all domains and generates an array holding the node load metric for future sorting. The missing part is here a static function in xc_hvm_build.c to pick the <n> best nodes and populate the numainfo->guest_to_host_node array with the result. I will do this when I will be back. For more details see the following email bodies. Thanks and Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 488-3567-12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.