[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-ia64-devel] How to support NUMA?



>>>>> "Alex" == Alex Williamson <alex.williamson@xxxxxx> writes:

Alex>    First, there are a couple ways that NUMA is described and
Alex> implemented in Linux.  Many of us are more familiar with the
Alex> ACPI approach (or "DIG" as Jes might call it).  This is
Alex> comprised of ACPI static tables and methods in namespace.  The
Alex> SRAT static table defines processors and memory ranges and
Alex> assigns each into a proximity domain.

Hi Alex,

Sorry I'm so behind on this, but let me try and add a few bits.

First of all, the ACPI approach vs the Altix approach are not
incompatible, the issue is that we are not DIG compliant and so for
certain things like TLB invalidation and sending IPI's I need to know
on what node a processor is located to be able to do it, as we don't
use the standard ia64 instructions for this but go via the SHUB chip,
which is located on each node. This is what really makes us different
from DIG.

Alex>    Another approach is that used on the SGI Altix systems.  I'm
Alex> no expert here, but as I understand it, a range of bits within
Alex> the physical address defines which node the physical address
Alex> resides.  I haven't looked in the SN code base, but presumably
Alex> PCI root buses, iommus, and perhaps other hardware including
Alex> processors are associated with nodes in a similar way.  Maybe
Alex> Jes can expand on this a bit for us.  Also, is there a way to
Alex> describe multiple levels of locality in the Altix scheme, or is
Alex> it simply local vs non-local?

So on Altix we could/can also describe all the memory regions via the
ACPI tables and thats not the problem here. However we have the
knowledge that the physical address contains the node ID, but also in
addition I need the node ID to figure out how to program the IOMMU for
a given PCI device as the IOMMU is in the SHUB chip for one series of
systems and in the TIO chip on on I/O only blades (I haven't gotten
anywhere near looking at support for those yet though).

The real problem with relying on the ACPI tables are the following:
ACPI 2.x only supports up to 256 nodes if I remember correctly. Thats
kinda small :-) Second, if we boot a system with say 64 nodes, the
lookup time is going to go through the rough if we are to traverse a
table on every lookup instead of just being able to do a few bit
shifts.

As for the multiple levels of locality, then we have that issue,
ie. the Altix is basically a routed network. The further away you go
there more expensive it is. I don't know all the details of this
though, but once we get there we can look at it. In our experience,
what really makes the performance difference is node-local vs off-node
memory.

Alex>    Will this latter approach eventually devolve/evolve into the
Alex> former?  I think all that Jes really needs is a way to get the
Alex> node info from a metaphysical address.  To support NUMA, there's
Alex> no way to get around P-> node==M->node, correct?  We simply
Alex> can't do per page lookups in P-> the
Alex> mm code to get a node ID and expect any kind of performance.

Thats correct. As we discussed on IRC, it's key for Altix that the
node ID bits in the metaphysical address matches the node ID bits on
the real physical node. Otherwise I am going to have to rewrite a
pretty serious chunk of dom0's memory management and I/O code. In
addition performance is going to go through the toilet as I mentioned
above.

However, just to make it more clear. It's perfectly legitimate for us
(at least thats my current belief :-) that we present the
meta-physical chunks within a node as one virtually contiguous chunk
of meta-physical memory. Ie. it doesn't have to be that P=M exactly,
just that P[36:48]=M[36:48]. Hope I got the bit numbers right here,
but basically thats the idea.

Alex> The guest needs to be able to assume contiguous metaphysical
Alex> addresses come from the same locality (except of course at the
Alex> edges of a node).  We have to assign some kind of metaphysical
Alex> address to a guest, so why shouldn't at least the Node ID bits
Alex> of the metaphysical address match the machine physical
Alex> addresses?  The part that I think we're missing is that pages
Alex> within a node don't need to map 1:1, P==M.  Effectively we end
Alex> up with a pool of VP memory for each node.  In the SGI case, a
Alex> few high order bits in the metaphysical address will happen to
Alex> match the machine physical high order bits.  In the ACPI NUMA
Alex> case, we might choose to do something similar so that we have to
Alex> modify the SRAT table a little less.

Yes sounds good to me. In fact I suspect that on most NUMA systems,
even the none SGI ones, you would be able to benefit from this. But
obviously I don't know how the memory layout is on the zx1000 and
other non SGI systems.

Alex>    Even if this is the base, there are still a lot of questions.
Alex> Is this model only for dom0, or can we specify it for domU also?
Alex> There are obvious performance advantages to a NUMA aware domU if
Alex> its running on a NUMA boxes and doesn't entirely fit within a
Alex> node.  How do we specify which resources go to which domains for
Alex> both the dom0 and domU cases?

I think I mentioned this a long time ago (and it was in my Xen Sumit
slides), but yes I'd very much like to see this as an option for
creating dom0's. By being able to fake a non-NUMA system for domU's,
we'd be able to run certain non-NUMA aware operating systems under Xen
which would be interesting. However for performance I'd very much like
to see domU get proper NUMA info in it's memory placement as otherwise
performance of it will be practically useless.

The thing is that user applications on a NUMA system needs to be NUMA
aware to perform optimally. Thats why we have libnuma on Linux and if
we start presenting an incorrect NUMA layout to the domU and the app
use that, then performance is going to get even worse.

For the same reason I'd like to be able to bind vCPUs to specific
physical CPUs to avoid ended up running out of off-node memory.

Alex> Can NUMA aware domains be migrated or restored?

Thats tricky to do, I guess it can be done, but it's not going to be
easy. Personally I consider this a low priority item.

Alex> Do non-NUMA aware domains have zero-based
Alex> metaphysical memory (below 4G)?

Why not, I don't see why they shouldn't. However it leaves open the
issue of what happens if you try and do I/O without an IOMMU?

Alex> Does a non-NUMA aware domain
Alex> that spans nodes have a discontiguous address map?

If the code is non-NUMA aware than it really doesn't matter. It could
be made an option, but if the OS is not trying to do anything with it,
it probably makes little difference.

Alex> How do driver domains fit into the picture?  How can a NUMA
Alex> aware domain be told the locality of a PCI device?

Well if it's part of the meta physical address range, it should show
up automatically to the dom :-)

Alex> Will we make an attempt to allocate non-NUMA aware guests within
Alex> a node?

That would be good for performance - I don't see it causing any
problems to try and do this.

Alex>    Please comment and discuss.  Let me know if I'm way off base.
Alex> If this doesn't meet our needs or is not feasible, let's come up
Alex> with something that is.  Thanks,

Sounds good to me so far, thanks for trying to guide the discussion in
the right direction.

Cheers,
Jes

_______________________________________________
Xen-ia64-devel mailing list
Xen-ia64-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-ia64-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.