Xen project Mailing List

Re: [Xen-devel] Analysis of using balloon page compaction in Xen balloon driver

To: David Vrabel <david.vrabel@xxxxxxxxxx>

From: Ian Campbell <ian.campbell@xxxxxxxxxx>

Date: Fri, 17 Oct 2014 09:20:05 +0100

Cc: Wei Liu <wei.liu2@xxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>

Delivery-date: Fri, 17 Oct 2014 08:20:37 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Thu, 2014-10-16 at 18:46 +0100, David Vrabel wrote: > On 16/10/14 18:12, Wei Liu wrote: > > This document analyses the impact of using balloon compaction > > infrastructure in Xen balloon driver. > > Thanks for writing this. This is a excellent starting point for a > productive design discussion. > > > ## Benefit for auto-translated guest > > > > HVM/PVH/ARM guest can have contiguous guest physical address space > > after balloon pages are compacted, which potentially improves memory > > performance provided guest makes use of huge pages, either via > > Hugetlbfs or Transparent Huge Page (THP). > > > > Consider memory access pattern of these guests, one access to guest > > physical address involves several accesses to machine memory. The > > total number of memory accesses can be represented as: > > > >> X = H1 * G1 + H2 * G2 + ... + Hn * Gn + 1 > > > > Hx denotes second stage page table walk levels and Gx denotes guest > > page table walk levels. > > > > By having contiguous guest physical address, guest can make use of > > huge pages. This can reduce the number of G's in formula. > > > > Reducing number of H's is another project for hypervisor side > > improvement and should be decoupled from Linux side changes. > > Whilst this analysis is fine, I don't think this is the real benefit of > using superpages which is reducing TLB usage and reducing the number of > TLB misses. It depends a bit on whether the TLB caches partial walks etc, but more importantly using super pages reduces the cost of a TLB miss, by requiring less memory accesses on the walk. > With fragmented stage 2 tables I don't think you will see see much > improvement in TLB usage. I think you will save some, by cutting off a level of the stage one pages you avoid the need for doing a stage 2 walk at that level, which might be 3-4 levels of lookup. I expect it is not as significant a benefit as stage 2 superpages (which saves you accesses at every level of the stage 1 walk), but it will be there. > > ## HAP table fragmentation is not made worse > > This reasoning looks reasonable to me. But it suggests that the balloon > compaction isn't doing it's job properly. It seems like it should be > much more proactive in resolving fragmentation. Speaking to some KVM folks here at plumbers it seems they find compaction to be working pretty well for them, but there are some proc knobs one has to twiddle to make it more aggressive. (sadly there are none of them here right now so I can't ask for more pointers to said knobs). > > ## Beyond Linux balloon compaction infrastructure > > > > Currently there's no mechanism in Xen to coalesce HAP table > > entries. To coalesce HAP entries we would need to make sure all > > discrete entries belong to one huge page, are in correct order and > > correct state. > > I would like to see a more detailed description of the Xen-side > solution. So we can be sure the Linux half is compatible with it.I I believe Dario has hacked up some prototypes (on x86) at some point. But I don't believe there will be terribly much linkage between the guest and hypervisor half beyond having the guest side arrange for as many 2MB slots as possible to be completely populated such that the hypervisor has opportunities to do compaction. The compaction (both guest and Xen side) is not the first order issue here though. The first thing we should be doing is to be trying to balloon up and down in 2M increments in the first place wherever possible. That includes things like alloc_xenballooned_pages operating in 2M increments under the hood such that things like grant maps which require 4K p2m entries are condensed into the least number of 2M regions possible, IOW having been forced to fragment a region use it for as many other 4K mappings as possible. However we know that we are not always going be able to allocate a 2M page to balloon out, and we need to be prepared to mitigate this, which is where compaction comes in. Compaction on the guest side serves two purposes immediately even without hypervisor side compaction: Firstly it increases the chances of being able to allocate a 2M page when required to balloon one out, either right now or at some point in the future, IOW it helps towards the goal of doing as much ballooning as possible in 2M chunks. Secondly it means that we will end up with contiguous 2M holes which will give the opportunity for future balloon operations to up with 2M mappings, this is useful in its own right even if it is neutral wrt the fragmentation of the populated 2M regions right now (and we know it can't make things worse in that regard). I think it is important to realise that this is an independently useful change which is also a prerequisite for some interesting future work. I think blocking this work now pending the completion of that future interesting work is unreasonable. > Before accepting any series I would also need to see real world > performance improvements, not just theoretical ones. I think the interesting statistics here will be: * The numbers of 4K and 2M mappings used by the domain's p2m (since it is well established that 2M mappings improve performance in multiple workloads on multiple architectures, there is no need to reproduce that result yet again IMHO). * The numbers of completely depopulated 2M regions, which represent the potential for improved mappings when ballooning back up. * The numbers of completely populated 2M regions in the p2m, which represent opportunities for the hypervisor to make further improvements *in the future*. * The numbers of 2M regions which consist either solely of 4K mappings of RAM + holes or solely "special" 4K mappings (grant mappings etc) + holes. Those last three are somewhat complementary I think. The ARM p2m tracks the numbers of each size of mapping for a given domain. The number of holes/full regions is not tracked but a debug hypercall or console keyhandler could quite easily scan for them, based on looking at the p2m type associated with each entry. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.