[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Analysis of using balloon page compaction in Xen balloon driver



On Thu, 2014-10-16 at 18:46 +0100, David Vrabel wrote:
> On 16/10/14 18:12, Wei Liu wrote:
> > This document analyses the impact of using balloon compaction
> > infrastructure in Xen balloon driver.
> 
> Thanks for writing this.   This is a excellent starting point for a
> productive design discussion.
> 
> > ## Benefit for auto-translated guest
> > 
> > HVM/PVH/ARM guest can have contiguous guest physical address space
> > after balloon pages are compacted, which potentially improves memory
> > performance provided guest makes use of huge pages, either via
> > Hugetlbfs or Transparent Huge Page (THP).
> > 
> > Consider memory access pattern of these guests, one access to guest
> > physical address involves several accesses to machine memory. The
> > total number of memory accesses can be represented as:
> > 
> >> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1
> > 
> > Hx denotes second stage page table walk levels and Gx denotes guest
> > page table walk levels.
> > 
> > By having contiguous guest physical address, guest can make use of
> > huge pages. This can reduce the number of G's in formula.
> > 
> > Reducing number of H's is another project for hypervisor side
> > improvement and should be decoupled from Linux side changes.
> 
> Whilst this analysis is fine, I don't think this is the real benefit of
> using superpages which is reducing TLB usage and reducing the number of
> TLB misses.

It depends a bit on whether the TLB caches partial walks etc, but more
importantly using super pages reduces the cost of a TLB miss, by
requiring less memory accesses on the walk.

> With fragmented stage 2 tables I don't think you will see see much
> improvement in TLB usage.

I think you will save some, by cutting off a level of the stage one
pages you avoid the need for doing a stage 2 walk at that level, which
might be 3-4 levels of lookup.

I expect it is not as significant a benefit as stage 2 superpages (which
saves you accesses at every level of the stage 1 walk), but it will be
there.

> > ## HAP table fragmentation is not made worse
> 
> This reasoning looks reasonable to me.  But it suggests that the balloon
> compaction isn't doing it's job properly.  It seems like it should be
> much more proactive in resolving fragmentation.

Speaking to some KVM folks here at plumbers it seems they find
compaction to be working pretty well for them, but there are some proc
knobs one has to twiddle to make it more aggressive. (sadly there are
none of them here right now so I can't ask for more pointers to said
knobs).

> > ## Beyond Linux balloon compaction infrastructure
> > 
> > Currently there's no mechanism in Xen to coalesce HAP table
> > entries. To coalesce HAP entries we would need to make sure all
> > discrete entries belong to one huge page, are in correct order and
> > correct state.
> 
> I would like to see a more detailed description of the Xen-side
> solution.  So we can be sure the Linux half is compatible with it.I 

I believe Dario has hacked up some prototypes (on x86) at some point.
But I don't believe there will be terribly much linkage between the
guest and hypervisor half beyond having the guest side arrange for as
many 2MB slots as possible to be completely populated such that the
hypervisor has opportunities to do compaction.

The compaction (both guest and Xen side) is not the first order issue
here though.

The first thing we should be doing is to be trying to balloon up and
down in 2M increments in the first place wherever possible. That
includes things like alloc_xenballooned_pages operating in 2M increments
under the hood such that things like grant maps which require 4K p2m
entries are condensed into the least number of 2M regions possible, IOW
having been forced to fragment a region use it for as many other 4K
mappings as possible.

However we know that we are not always going be able to allocate a 2M
page to balloon out, and we need to be prepared to mitigate this, which
is where compaction comes in.

Compaction on the guest side serves two purposes immediately even
without hypervisor side compaction: Firstly it increases the chances of
being able to allocate a 2M page when required to balloon one out,
either right now or at some point in the future, IOW it helps towards
the goal of doing as much ballooning as possible in 2M chunks.

Secondly it means that we will end up with contiguous 2M holes which
will give the opportunity for future balloon operations to up with 2M
mappings, this is useful in its own right even if it is neutral wrt the
fragmentation of the populated 2M regions right now (and we know it
can't make things worse in that regard).

I think it is important to realise that this is an independently useful
change which is also a prerequisite for some interesting future work. I
think blocking this work now pending the completion of that future
interesting work is unreasonable.

> Before accepting any series I would also need to see real world
> performance improvements, not just theoretical ones.

I think the interesting statistics here will be:

      * The numbers of 4K and 2M mappings used by the domain's p2m
        (since it is well established that 2M mappings improve
        performance in multiple workloads on multiple architectures,
        there is no need to reproduce that result yet again IMHO).
      * The numbers of completely depopulated 2M regions, which
        represent the potential for improved mappings when ballooning
        back up.
      * The numbers of completely populated 2M regions in the p2m, which
        represent opportunities for the hypervisor to make further
        improvements *in the future*.
      * The numbers of 2M regions which consist either solely of 4K
        mappings of RAM + holes or solely "special" 4K mappings (grant
        mappings etc) + holes.

Those last three are somewhat complementary I think.

The ARM p2m tracks the numbers of each size of mapping for a given
domain. The number of holes/full regions is not tracked but a debug
hypercall or console keyhandler could quite easily scan for them, based
on looking at the p2m type associated with each entry.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.