Xen project Mailing List

[Xen-ia64-devel] How to support NUMA?

To: xen-ia64-devel <xen-ia64-devel@xxxxxxxxxxxxxxxxxxx>

From: Alex Williamson <alex.williamson@xxxxxx>

Date: Thu, 24 May 2007 13:00:09 -0600

Delivery-date: Thu, 24 May 2007 11:58:24 -0700

List-id: Discussion of the ia64 port of Xen <xen-ia64-devel.lists.xensource.com>

We discussed this a little bit at Xen Summit, but we didn't leave with a plan to move forward. Jes is now to the point where he's got Altix booting to some extent and we need to be in agreement on what NUMA support in Xen/ia64 is going to look like. First, there are a couple ways that NUMA is described and implemented in Linux. Many of us are more familiar with the ACPI approach (or "DIG" as Jes might call it). This is comprised of ACPI static tables and methods in namespace. The SRAT static table defines processors and memory ranges and assigns each into a proximity domain. The SLIT table defines the locality between proximity domains. ACPI namespace also provides _PXM methods on objects that allow us to place things like PCI buses and iommu hardware into the right locality. Another approach is that used on the SGI Altix systems. I'm no expert here, but as I understand it, a range of bits within the physical address defines which node the physical address resides. I haven't looked in the SN code base, but presumably PCI root buses, iommus, and perhaps other hardware including processors are associated with nodes in a similar way. Maybe Jes can expand on this a bit for us. Also, is there a way to describe multiple levels of locality in the Altix scheme, or is it simply local vs non-local? In order to incur minimal changes to the Linux code based, Jes has proposed a P==M model. This is where the guest physical (or meta/pseudo-physical) address is equal to the machine physical address. This might seem like a step backwards, since we just transitioned from P==M to a virtual physical (VP) model about a year ago. However, I think this might be a more loosely interpreted P==M model than we had previously, see below. The obvious benefit to this approach is that the NUMA layout of the system is plain to see in the metaphysical addresses provided to the guest. The downside here is that we think this might break the grant table API that we worked so hard to fix with the VP transition. An alternative might be available using the current VP approach. One could imagine that a contiguous chunk of metaphysical memory could be allocated out of memory from a given node. Xen could then rewrite the SLIT & SRAT tables for the domain. Perhaps this is more of a VP with P->node==M->node model. The actual metaphysical addresses are irrelevant, but the node metaphysical memory is assigned must match the node of the machine memory and we must not re-arrange proximity domains (unless someone wants to volunteer to rewrite AML from within Xen). This approach helps the ACPI NUMA systems, but obviously doesn't work for the Altix systems since they need specific bits in their metaphysical address for locality. Will this latter approach eventually devolve/evolve into the former? I think all that Jes really needs is a way to get the node info from a metaphysical address. To support NUMA, there's no way to get around P->node==M->node, correct? We simply can't do per page lookups in the mm code to get a node ID and expect any kind of performance. The guest needs to be able to assume contiguous metaphysical addresses come from the same locality (except of course at the edges of a node). We have to assign some kind of metaphysical address to a guest, so why shouldn't at least the Node ID bits of the metaphysical address match the machine physical addresses? The part that I think we're missing is that pages within a node don't need to map 1:1, P==M. Effectively we end up with a pool of VP memory for each node. In the SGI case, a few high order bits in the metaphysical address will happen to match the machine physical high order bits. In the ACPI NUMA case, we might choose to do something similar so that we have to modify the SRAT table a little less. Even if this is the base, there are still a lot of questions. Is this model only for dom0, or can we specify it for domU also? There are obvious performance advantages to a NUMA aware domU if its running on a NUMA boxes and doesn't entirely fit within a node. How do we specify which resources go to which domains for both the dom0 and domU cases? Can NUMA aware domains be migrated or restored? Do non-NUMA aware domains have zero-based metaphysical memory (below 4G)? Does a non-NUMA aware domain that spans nodes have a discontiguous address map? How do driver domains fit into the picture? How can a NUMA aware domain be told the locality of a PCI device? Will we make an attempt to allocate non-NUMA aware guests within a node? Please comment and discuss. Let me know if I'm way off base. If this doesn't meet our needs or is not feasible, let's come up with something that is. Thanks, Alex -- Alex Williamson HP Open Source & Linux Org. _______________________________________________ Xen-ia64-devel mailing list Xen-ia64-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-ia64-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.