[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Linux Xen Balloon Driver Improvement (Draft 2)

On Mon, Oct 27, 2014 at 02:23:22PM +0000, David Vrabel wrote:
> On 27/10/14 12:33, Wei Liu wrote:
> > 
> > Changes in this version:
> > 
> > 1. Style, grammar and typo fixes.
> > 2. Make this document Linux centric.
> > 3. Add a new section for NUMA-aware ballooning.
> You've not included the required changes to the toolstack and
> autoballoon driver to always use 2M multiples when creating VMs and
> setting targets.

When creating VM, toolstack already tries to use as many huge pages as

Setting target doesn't use 2M multiples.  But I don't think this is
necessary. To balloon in / out X MB memory

  nr_2m = X % 2M
  nr_4k = (X / 2M) / 4k

The remainder just goes to 4K queue.

And what do you mean by "autoballoon" driver? Do you mean functionality
of xl? In the end the request is still fulfilled by Xen balloon driver
in kernel. So if dom0 is using the new balloon driver proposed here, it
should balloon down in 2M multiples automatically.

> > ## Introduction
> > 
> > This document describe a design to improve Xen balloon driver in Linux.
> "Linux balloon driver for Xen guests"?


> > ## Goal of improvement
> > 
> > The balloon driver makes use of as many huge pages as possible,
> > defragmenting guest address space. Contiguous guest address space
> > permits huge page ballooning which helps prevent host address space
> > fragmentation.
> > 
> > This should be achieved without any particular hypervisor side
> > feature.
> I really think you need to be taking whole-system view and not focusing
> on just the guest balloon driver.

I don't think there's terribly tight linkage between hypervisor side
change and guest side change. This design doesn't involve new hypervisor
interface and I intend to remain so.

To have guest automatically defragmenting it's address space while at
the same time helps prevent hypervisor memory from fragmenting (at least
this is what the design aims for, as for how it works in practice, it
needs to be prototyped and benchmarked).

The above reasoning is good enough to justify this change, isn't it?

I think Ian Campbell explained better than me in another email. To quote
him verbatim:

Compaction on the guest side serves two purposes immediately even
without hypervisor side compaction: Firstly it increases the chances of
being able to allocate a 2M page when required to balloon one out,
either right now or at some point in the future, IOW it helps towards
the goal of doing as much ballooning as possible in 2M chunks.

Secondly it means that we will end up with contiguous 2M holes which
will give the opportunity for future balloon operations to up with 2M
mappings, this is useful in its own right even if it is neutral wrt the
fragmentation of the populated 2M regions right now (and we know it
can't make things worse in that regard).

If you have very concrete concern we can talk about it case by case. If
you have any concern about linkage between guest and hypervisor we can
also analyse it further.

> > ### Make use of balloon page compaction
> > 
> > The core of migration callback is XENMEM\_exchange hypercall. This
> > makes sure that inflation of old page and deflation of new page is
> > done atomically, so even if a domain is beyond its memory target and
> > the target is being enforced, it can still compact memory.
> Having looked at what XENMEM_exchange actually does, I can't see how
> you're using it to give this behaviour.


Doesn't it guarantee atomicity (a single hypercall)? Isn't it able to
exchange pages even if target is enforced? (Note the MEMF_no_refcount
when calling steal_page / assign_pages).

So which aspect do you think it doesn't work? Can you make this clearer
so that I can answer your question better?

> IMO, XEMMEM_exchange should probably be renamed XENMEM_repopulate or
> something.

I will leave it to hypervisor maintainer.  TBH I don't think
XENMEM_repopulate reflects the nature of this hypercall either.

> > ### Periodically exchange normal size pages with huge pages
> > 
> > Worker thread wakes up periodically to check if there are enough pages
> > in normal size page queue to coalesce into a huge page. If so, it will
> > try to exchange that huge page into a number of normal size pages with
> > XENMEM\_exchange hypercall.
> I don't see what this is supposed to achieve.  This is going to take a
> (potentially) non-fragmented superpage and fragment it.

Let's look at this from start of day.

Guest always tries to balloon in / out as many 2M pages as possible. So
if we have a long list of 4K pages, it means the underlying host super
frames are fragmented already.

So if 1) there are enough 4K pages in ballooned out list, 2) there is a
spare 2M page, it means that the 2M page comes from the result of
balloon page compaction, which means the underlying host super frame is

What this tries to achieve is that we build up a cycle to create chances
to balloon in / out 2M pages. As you're releasing a 2M page backed by
512 4K pages then balloon it back, that 2M page can be backed by a 2M
host frame.

> Your set of 512 4k ballooned pages needs to be ordered, contiguous and
> superpage aligned, for this to be any use.

The idea to promote sorted aligned pages from 4K list to 2M list is of
course achievable and probably easier to reason about. But it won't help
prevent hypervisor side fragmentation though, as it doesn't involve
exchanging memory when doing promotion. However in the end it might still be
able to build up a cycle to help prevent host fragmentation.

I plan to prototype both and choose the one that works better.  In any
case, this is implementation detail.

> > ## Relationship with NUMA-aware ballooning
> > 
> > Another orthogonal improvement to Linux balloon driver is NUMA-aware
> > ballooning.
> > 
> > The use of balloon page compaction will not interfere with NUMA-ware
> > ballooning because balloon compaction, which is part of Linux's memory
> > subsystem, is already NUMA-aware.
> > 
> > All the changes proposed in this design can be made NUMA-aware
> > provided virtual NUMA topology information is in place.
> How?

The exchange hypercall accepts node information. So it's potentially the
same level of work as to make balloon driver NUMA-aware (the increase /
decrease hypercall).


> David

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.