[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

On ven, 2013-08-23 at 13:53 -0700, Konrad Rzeszutek Wilk wrote:
> On Mon, Aug 19, 2013 at 01:58:51PM +0100, David Vrabel wrote:
> > On 16/08/13 05:13, Yechen Li wrote:
> > > 
> > > +### nodemask VNODE\_TO\_PNODE(int vnode) ###
> > > +
> > > +This service is provided by the hypervisor (and wired, if necessary, all 
> > > the
> > > +way up to the proper toolstack layer or guest kernel), since it is only 
> > > Xen
> > > +that knows both the virtual and the physical topologies.
> > 
> > The physical NUMA topology must not be exposed to guests that have a
> > virtual NUMA topology -- only the toolstack and Xen should know the
> > mapping between the two.
> I think exposing any NUMA topology to guest - irregardless whether it is based
> on real NUMA or not, is OK - and actually a pretty neat thing.
Yes, that is exactly how Elena, which is doing such work for PV guests,
is doing it.

> Meaning you could tell a PV guest that it is running on a 16 socket NUMA
> box while in reality it is running on a single socket box. Or vice-versa.
> It can serve as a way to increase performance (or decrease) - and also
> do resource capping (This PV guest will only get 1G of real fast
> memory and then 7GB of slow memory) and let the OS handle the details
> of it (which it does nowadays).
Yes, exactly... Again. :-)

> The mapping thought - of which PV pages should belong to which fake
> PV NUMA node - and how they bind to the real NUMA topology - that part
> I am not sure how to solve. More on this later.
That is fine too. Again, Elena is working on both how to build up a
virtual topology and how to somehow map it to the real topology, for the
sake of performance.

However, this series is about NUMA-aware ballooning, which is something
that makes sense _ONLY_ after we'll have all that virtual NUMA thing in
place. That being said, I told Yechen that submitting what he already
had as an RFC could have been helpful anyway, i.e., he could get some
comments on the design, the approach, the interface, etc., which is
actually what has happened. :-)

He should be more clear about the fact that some preliminary work was
missing, during the first submission. During the second submission, I
tried to help him make that more clear... If it still did not work, and
generated confusion instead, I am sorry about that.

About the technical part of this comment (guest knowledge about the real
NUMA topology), as I said already, I'm fine with letting the guest
completely in the dark, if it's fine to provide a suitable interface
between Xen and the guest that will allow ballooning up to work (as
George pointed out in his e-mails).

> > 
> > > +## Description of the problem ##
> I think you have to backup with the problem description. That is you
> need to think of:
>  - How a PV guest will allocate pages at bootup based on this
That's not this series' job...

>  - How it will balloon up/down within those "buckets".
That, I'm not sure I got (more below)...

> If you are using the guests NUMA hints it usually is in the form of
> 'allocate pages on this node' and the node information is of type
> 'pfn X to pfn Y are on this NUMA'. That does not work very well with
> ballooning as it can be scattered across various nodes. But that
> is mostly b/c the balloon driver is not even trying to use NUMA
> APIs. 
Indeed. The whole point is this. "If it has been somehow established, at
boot time, that pfn X is from virtual NUMA node 2, and that all the
pfn-s from virtual node 2 are allocated --on the host-- on hardware NUMA
node 0, let's, when ballooning pfn X down and then ballooning it back
up, make sure that: 1) in the guest it still belongs to virtual node 2,
and 2) on the host is still backed by a page on hardware node 0"

Does that make sense?

> It could use it and then it would do the best it can and
> perhaps balloon round-robin across the NUMA pools.
Exactly, that is what David suggested and what I also think it would be
a nice first step (without any need of adding xenstore keys).

>  Or 
> perhaps a better option would be to use the hotplug memory mechanism
> (which is implemented in the balloon driver) and do large swaths of
> memory.
Mmm... I think it should all be possible without bothering with memory
hotplug, but I may be wrong (I don't really know much about memory

> But more problematic is the migration. If you migrate a guest
> to node that has different NUMA topologies what you really really
> want is:
>       - unplug all of the memory in the guest
>       - replug the memory with the new NUMA topology
> Obviously this means you need some dynamic NUMA system - and I don't
> know of such. 
We don't plan to support dynamically varying virtual NUMA topologies in
the short term future. :-)

> The unplug/plug can be done via the balloon driver
> and or hotplug memory system. But then - the boundaries of the NUMA
> pools is set a bootup time. And you would want to change them.
> Is SRAT/SLIT dynamic? Could it change during runtime?
I don't know if the real hw tables can actually change, but again,
support for varying the virtual topology is not a priority right now.

> Then there is the concept of AutoNUMA were you would migrate
> pages from one node to another. With a PV guest that would imply
> that the hypervisor would poke the guest and say: "ok, time
> alter your P2M table". 
Yes, and that's what I am working on for a while. It's particularly
tricky for a PV guest and, although very similar in principle, it's
going to be different than AutoNUMA in Linux, since for us, a migration
is way more expensive than for them.

> Which I guess right now is done best
> via the balloon driver - so what you really want is a callback
> to tell the balloon driver: Hey, balloon down and up this
> PFN block with on NUMA node X.
I'm currently doing it via some sort of "lightweight suspend/resume
cycle". I like the idea of trying to exploit the ballooning driver for
that, but that will probably happen in a subsequent step (I want it
working that way, before starting to think on how to improve it! :-P).

Anyway, the above just to say that this is also not this series' job,
and although there surely are contact point, I think things can be
considered (and hence worked on/implemented) pretty independently.

> Perhaps what could be done is to setup in the cluster of hosts
> the worst case NUMA topology and force it on all the guests.
> Then when migrating the "pools" can be filled/unfilled
> depending on which host the guest is - and whether it can
> fill up the NUMA pools properly. For example it migrates
> from a 1 node box to a 16 node box and all the memory
> is remote. It will empty out the PV NUMA box of the "closest"
> memory to zero and fill up the PV NUMA pool of the "farthest"
> with all memory to balance it out and have some real
> sense of the PV to machine host memory.
That is also nice... Perhaps this is meat for some sort of high (for
sure higher than xl/libxl) level management/orchestration layer, isn't

Anyway... I hope I helped clarifying things a bit.

Thanks for having a look and Regards,

<<This happens because I choose it to happen!>> (Raistlin Majere)
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.