[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Proposed new "memory capacity claim" hypercall/feature

> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> Subject: RE: Proposed new "memory capacity claim" hypercall/feature
> >>> On 30.10.12 at 18:13, Dan Magenheimer <dan.magenheimer@xxxxxxxxxx> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@xxxxxxxx]

(NOTE TO KEIR: Input from you requested in first stanza below.)

Hi Jan --

Thanks for the continued feedback!

I've slightly re-ordered the email to focus on the problem
(moved tmem-specific discussion to the end).

> As long as the allocation times can get brought down to an
> acceptable level, I continue to not see a need for the extra
> "claim" approach you're proposing. So working on that one (or
> showing that without unreasonable effort this cannot be
> further improved) would be a higher priority thing from my pov
> (without anyone arguing about its usefulness).

Fair enough.  I will do some measurement and analysis of this
code.  However, let me ask something of you and Keir as well:
Please estimate how long (in usec) you think it is acceptable
to hold the heap_lock.  If your limit is very small (as I expect),
doing anything "N" times in a loop with the lock held (for N==2^26,
which is a 256GB domain) may make the analysis moot.

> But yes, with all the factors you mention brought in, there is
> certainly some improvement needed (whether your "claim"
> proposal is a the right thing is another question, not to mention
> that I currently don't see how this would get implemented in
> a consistent way taking several orders of magnitude less time
> to carry out).

OK, I will start on the next step... proof-of-concept.
I'm envisioning simple arithmetic, but maybe you are
right and arithmetic will not be sufficient.

> > Suppose you have a huge 256GB machine and you have already launched
> > a 64GB tmem guest "A".  The guest is idle for now, so slowly
> > selfballoons down to maybe 4GB.  You start to launch another 64GB
> > guest "B" which, as we know, is going to take some time to complete.
> > In the middle of launching "B", "A" suddenly gets very active and
> > needs to balloon up as quickly as possible or it can't balloon fast
> > enough (or at all if "frozen" as suggested) so starts swapping (and,
> > thanks to Linux frontswap, the swapping tries to go to hypervisor/tmem
> > memory).  But ballooning and tmem are both blocked and so the
> > guest swaps its poor little butt off even though there's >100GB
> > of free physical memory available.
> That's only one side of the overcommit situation you're striving
> to get work right here: That same self ballooning guest, after
> sufficiently more guest got started so that the rest of the memory
> got absorbed by them would suffer the very same problems in
> the described situation, so it has to be prepared for this case
> anyway.

The tmem design does ensure the guest is prepared for this case
anyway... the guest swaps.  And, unlike page-sharing, the guest
determines which pages to swap, not the host, and there is no
possibility of double-paging.

In your scenario, the host memory is truly oversubscribed.  This
scenario is ultimately a weakness of virtualization in general;
trying to statistically-share an oversubscribed fixed resource
among a number of guests will sometimes cause a performance
degradation, whether the resource is CPU or LAN bandwidth or,
in this case, physical memory.  That very generic problem
is I think not one any of us can solve.  Toolstacks need to
be able to recognize the problem (whether CPU, LAN, or memory)
and act accordingly (report, or auto-migrate).

In my scenario, guest performance is hammered only because of
the unfortunate deficiency in the existing hypervisor memory
allocation mechanisms, namely that small allocations must
be artificially "frozen" until a large allocation can complete.
That specific problem is one I am trying to solve.

BTW, with tmem, some future toolstack might monitor various
available tmem statistics and predict/avoid your scenario.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.