[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

On Mon, Aug 19, 2013 at 01:58:51PM +0100, David Vrabel wrote:
> On 16/08/13 05:13, Yechen Li wrote:
> > 
> > +### nodemask VNODE\_TO\_PNODE(int vnode) ###
> > +
> > +This service is provided by the hypervisor (and wired, if necessary, all 
> > the
> > +way up to the proper toolstack layer or guest kernel), since it is only Xen
> > +that knows both the virtual and the physical topologies.
> The physical NUMA topology must not be exposed to guests that have a
> virtual NUMA topology -- only the toolstack and Xen should know the
> mapping between the two.

I think exposing any NUMA topology to guest - irregardless whether it is based
on real NUMA or not, is OK - and actually a pretty neat thing.

Meaning you could tell a PV guest that it is running on a 16 socket NUMA
box while in reality it is running on a single socket box. Or vice-versa.
It can serve as a way to increase performance (or decrease) - and also
do resource capping (This PV guest will only get 1G of real fast
memory and then 7GB of slow memory) and let the OS handle the details
of it (which it does nowadays).

The mapping thought - of which PV pages should belong to which fake
PV NUMA node - and how they bind to the real NUMA topology - that part
I am not sure how to solve. More on this later.
> A guest cannot make sensible use of a machine topology as it may be
> migrated to a host with a different topology.

Correct. And that is OK - it just means that the performance can suck
horribly while it is there. Or the guest can be migrated to even a better
NUMA machine where it will perform even better.

That is nothing new and this is no different if you had PV NUMA
or not in a guest.

> > +## Description of the problem ##

I think you have to backup with the problem description. That is you
need to think of:
 - How a PV guest will allocate pages at bootup based on this
 - How it will balloon up/down within those "buckets".

If you are using the guests NUMA hints it usually is in the form of
'allocate pages on this node' and the node information is of type
'pfn X to pfn Y are on this NUMA'. That does not work very well with
ballooning as it can be scattered across various nodes. But that
is mostly b/c the balloon driver is not even trying to use NUMA
APIs. It could use it and then it would do the best it can and
perhaps balloon round-robin across the NUMA pools. Or 
perhaps a better option would be to use the hotplug memory mechanism
(which is implemented in the balloon driver) and do large swaths of

But more problematic is the migration. If you migrate a guest
to node that has different NUMA topologies what you really really
want is:
        - unplug all of the memory in the guest
        - replug the memory with the new NUMA topology

Obviously this means you need some dynamic NUMA system - and I don't
know of such. The unplug/plug can be done via the balloon driver
and or hotplug memory system. But then - the boundaries of the NUMA
pools is set a bootup time. And you would want to change them.
Is SRAT/SLIT dynamic? Could it change during runtime?

Then there is the concept of AutoNUMA were you would migrate
pages from one node to another. With a PV guest that would imply
that the hypervisor would poke the guest and say: "ok, time
alter your P2M table". Which I guess right now is done best
via the balloon driver - so what you really want is a callback
to tell the balloon driver: Hey, balloon down and up this
PFN block with on NUMA node X.

Perhaps what could be done is to setup in the cluster of hosts
the worst case NUMA topology and force it on all the guests.
Then when migrating the "pools" can be filled/unfilled
depending on which host the guest is - and whether it can
fill up the NUMA pools properly. For example it migrates
from a 1 node box to a 16 node box and all the memory
is remote. It will empty out the PV NUMA box of the "closest"
memory to zero and fill up the PV NUMA pool of the "farthest"
with all memory to balance it out and have some real
sense of the PV to machine host memory.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.