Xen project Mailing List

Re: [Xen-ia64-devel] How to support NUMA?

To: Alex Williamson <alex.williamson@xxxxxx>

From: Jes Sorensen <jes@xxxxxxx>

Date: 31 May 2007 05:38:08 -0400

Cc: xen-ia64-devel <xen-ia64-devel@xxxxxxxxxxxxxxxxxxx>

Delivery-date: Thu, 31 May 2007 02:36:13 -0700

List-id: Discussion of the ia64 port of Xen <xen-ia64-devel.lists.xensource.com>

>>>>> "Alex" == Alex Williamson <alex.williamson@xxxxxx> writes: Alex> First, there are a couple ways that NUMA is described and Alex> implemented in Linux. Many of us are more familiar with the Alex> ACPI approach (or "DIG" as Jes might call it). This is Alex> comprised of ACPI static tables and methods in namespace. The Alex> SRAT static table defines processors and memory ranges and Alex> assigns each into a proximity domain. Hi Alex, Sorry I'm so behind on this, but let me try and add a few bits. First of all, the ACPI approach vs the Altix approach are not incompatible, the issue is that we are not DIG compliant and so for certain things like TLB invalidation and sending IPI's I need to know on what node a processor is located to be able to do it, as we don't use the standard ia64 instructions for this but go via the SHUB chip, which is located on each node. This is what really makes us different from DIG. Alex> Another approach is that used on the SGI Altix systems. I'm Alex> no expert here, but as I understand it, a range of bits within Alex> the physical address defines which node the physical address Alex> resides. I haven't looked in the SN code base, but presumably Alex> PCI root buses, iommus, and perhaps other hardware including Alex> processors are associated with nodes in a similar way. Maybe Alex> Jes can expand on this a bit for us. Also, is there a way to Alex> describe multiple levels of locality in the Altix scheme, or is Alex> it simply local vs non-local? So on Altix we could/can also describe all the memory regions via the ACPI tables and thats not the problem here. However we have the knowledge that the physical address contains the node ID, but also in addition I need the node ID to figure out how to program the IOMMU for a given PCI device as the IOMMU is in the SHUB chip for one series of systems and in the TIO chip on on I/O only blades (I haven't gotten anywhere near looking at support for those yet though). The real problem with relying on the ACPI tables are the following: ACPI 2.x only supports up to 256 nodes if I remember correctly. Thats kinda small :-) Second, if we boot a system with say 64 nodes, the lookup time is going to go through the rough if we are to traverse a table on every lookup instead of just being able to do a few bit shifts. As for the multiple levels of locality, then we have that issue, ie. the Altix is basically a routed network. The further away you go there more expensive it is. I don't know all the details of this though, but once we get there we can look at it. In our experience, what really makes the performance difference is node-local vs off-node memory. Alex> Will this latter approach eventually devolve/evolve into the Alex> former? I think all that Jes really needs is a way to get the Alex> node info from a metaphysical address. To support NUMA, there's Alex> no way to get around P-> node==M->node, correct? We simply Alex> can't do per page lookups in P-> the Alex> mm code to get a node ID and expect any kind of performance. Thats correct. As we discussed on IRC, it's key for Altix that the node ID bits in the metaphysical address matches the node ID bits on the real physical node. Otherwise I am going to have to rewrite a pretty serious chunk of dom0's memory management and I/O code. In addition performance is going to go through the toilet as I mentioned above. However, just to make it more clear. It's perfectly legitimate for us (at least thats my current belief :-) that we present the meta-physical chunks within a node as one virtually contiguous chunk of meta-physical memory. Ie. it doesn't have to be that P=M exactly, just that P[36:48]=M[36:48]. Hope I got the bit numbers right here, but basically thats the idea. Alex> The guest needs to be able to assume contiguous metaphysical Alex> addresses come from the same locality (except of course at the Alex> edges of a node). We have to assign some kind of metaphysical Alex> address to a guest, so why shouldn't at least the Node ID bits Alex> of the metaphysical address match the machine physical Alex> addresses? The part that I think we're missing is that pages Alex> within a node don't need to map 1:1, P==M. Effectively we end Alex> up with a pool of VP memory for each node. In the SGI case, a Alex> few high order bits in the metaphysical address will happen to Alex> match the machine physical high order bits. In the ACPI NUMA Alex> case, we might choose to do something similar so that we have to Alex> modify the SRAT table a little less. Yes sounds good to me. In fact I suspect that on most NUMA systems, even the none SGI ones, you would be able to benefit from this. But obviously I don't know how the memory layout is on the zx1000 and other non SGI systems. Alex> Even if this is the base, there are still a lot of questions. Alex> Is this model only for dom0, or can we specify it for domU also? Alex> There are obvious performance advantages to a NUMA aware domU if Alex> its running on a NUMA boxes and doesn't entirely fit within a Alex> node. How do we specify which resources go to which domains for Alex> both the dom0 and domU cases? I think I mentioned this a long time ago (and it was in my Xen Sumit slides), but yes I'd very much like to see this as an option for creating dom0's. By being able to fake a non-NUMA system for domU's, we'd be able to run certain non-NUMA aware operating systems under Xen which would be interesting. However for performance I'd very much like to see domU get proper NUMA info in it's memory placement as otherwise performance of it will be practically useless. The thing is that user applications on a NUMA system needs to be NUMA aware to perform optimally. Thats why we have libnuma on Linux and if we start presenting an incorrect NUMA layout to the domU and the app use that, then performance is going to get even worse. For the same reason I'd like to be able to bind vCPUs to specific physical CPUs to avoid ended up running out of off-node memory. Alex> Can NUMA aware domains be migrated or restored? Thats tricky to do, I guess it can be done, but it's not going to be easy. Personally I consider this a low priority item. Alex> Do non-NUMA aware domains have zero-based Alex> metaphysical memory (below 4G)? Why not, I don't see why they shouldn't. However it leaves open the issue of what happens if you try and do I/O without an IOMMU? Alex> Does a non-NUMA aware domain Alex> that spans nodes have a discontiguous address map? If the code is non-NUMA aware than it really doesn't matter. It could be made an option, but if the OS is not trying to do anything with it, it probably makes little difference. Alex> How do driver domains fit into the picture? How can a NUMA Alex> aware domain be told the locality of a PCI device? Well if it's part of the meta physical address range, it should show up automatically to the dom :-) Alex> Will we make an attempt to allocate non-NUMA aware guests within Alex> a node? That would be good for performance - I don't see it causing any problems to try and do this. Alex> Please comment and discuss. Let me know if I'm way off base. Alex> If this doesn't meet our needs or is not feasible, let's come up Alex> with something that is. Thanks, Sounds good to me so far, thanks for trying to guide the discussion in the right direction. Cheers, Jes _______________________________________________ Xen-ia64-devel mailing list Xen-ia64-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-ia64-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.