[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Proposed new "memory capacity claim" hypercall/feature

> From: Keir Fraser [mailto:keir.xen@xxxxxxxxx]
> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> On 30/10/2012 00:03, "Dan Magenheimer" <dan.magenheimer@xxxxxxxxxx> wrote:
> >> From: Keir Fraser [mailto:keir@xxxxxxx]
> >> Subject: Re: Proposed new "memory capacity claim" hypercall/feature
> >>
> >> On 29/10/2012 21:08, "Dan Magenheimer" <dan.magenheimer@xxxxxxxxxx> wrote:
> >>
> >> Well it does depend how scalable domain creation actually is as an
> >> operation. If it is spending most of its time allocating memory then it is
> >> quite likely that parallel creations will spend a lot of time competing for
> >> the heap spinlock, and actually there will be little/no speedup compared
> >> with serialising the creations. Further, if domain creation can take
> >> minutes, it may be that we simply need to go optimise that -- we already
> >> found one stupid thing in the heap allocator recently that was burining
> >> loads of time during large-memory domain creations, and fixed it for a
> >> massive speedup in that particular case.
> >
> > I suppose ultimately it is a scalability question.  But Oracle's
> > measure of success here is based on how long a human or a tool
> > has to wait for confirmation to ensure that a domain will
> > successfully launch.  If two domains are launched in parallel
> > AND an indication is given that both will succeed, spinning on
> > the heaplock a bit just makes for a longer "boot" time, which is
> > just a cost of virtualization.  If they are launched in parallel
> > and, minutes later (or maybe even 20 seconds later), one or
> > both say "oops, I was wrong, there wasn't enough memory, so
> > try again", that's not OK for data center operations, especially if
> > there really was enough RAM for one, but not for both. Remember,
> > in the Oracle environment, we are talking about an administrator/automation
> > overseeing possibly hundreds of physical servers, not just a single
> > user/server.
> >
> > Does that make more sense?
> Yes, that makes sense.


So, not to beat a dead horse, but let me re-emphasize that the problem
exists even without considering tmem.  I wish to solve the problem,
but would like to do it in a way which also resolves a similar problem
for tmem.  I think the "claim" approach does that.
> > The "claim" approach immediately guarantees success or failure.
> > Unless there are enough "stupid things/optimisations" found that
> > you would be comfortable putting memory allocation for a domain
> > creation in a hypervisor spinlock, there will be a race unless
> > an atomic mechanism exists such as "claiming" where
> > only simple arithmetic must be done within a hypervisor lock.
> >
> > Do you disagree?
> >
> >>> and (2) tmem and/or other dynamic
> >>> memory mechanisms may be asynchronously absorbing small-but-significant
> >>> portions of RAM for other purposes during an attempted domain launch.
> >>
> >> This is an argument against allocate-rather-than-reserve? I don't think 
> >> that
> >> makes sense -- so is this instead an argument against
> >> reservation-as-a-toolstack-only-mechanism? I'm not actually convinced yet 
> >> we
> >> need reservations *at all*, before we get down to where it should be
> >> implemented.
> >
> > I'm not sure if we are defining terms the same, so that's hard
> > to answer.  If you define "allocation" as "a physical RAM page frame
> > number is selected (and possibly the physical page is zeroed)",
> > then I'm not sure how your definition of "reservation" differs
> > (because that's how increase/decrease_reservation are implemented
> > in the hypervisor, right?).
> >
> > Or did you mean "allocate-rather-than-claim" (where "allocate" is
> > select a specific physical pageframe and "claim" means do accounting
> > only?  If so, see the atomicity argument above.
> >
> > I'm not just arguing against reservation-as-a-toolstack-mechanism,
> > I'm stating I believe unequivocally that reservation-as-a-toolstack-
> > only-mechanism and tmem are incompatible.  (Well, not _totally_
> > incompatible... the existing workaround, tmem freeze/thaw, works
> > but is also single-threaded and has fairly severe unnecessary
> > performance repercussions.  So I'd like to solve both problems
> > at the same time.)
> Okay, so why is tmem incompatible with implementing claims in the toolstack?

(Hmmm... maybe I could schedule the equivalent of a PhD qual exam
for tmem with all the core Xen developers as examiners?)

The short answer is tmem moves memory capacity around far too
frequently to be managed by a userland toolstack, especially if
the "controller" lives on a central "manager machine" in a
data center (Oracle's model).  The ebb and flow of memory supply
and demand for each guest is instead managed entirely dynamically.

The somewhat longer answer (and remember all of this is
implemented and upstream in Xen and Linux today):

First, in the tmem model, each guest is responsible for driving
its memory utilization (what Xen tools calls "current" and Xen
hypervisor calls "tot_pages") as low as it can.  This is done
in Linux with selfballooning.  At 50Hz (default), the guest
kernel will attempt to expand or contract the balloon to match
the guest kernel's current demand for memory.  Agreed, one guest
requesting changes at 50Hz could probably be handled by
a userland toolstack, but what about 100 guests?  Maybe...
but there's more.

Second, in the tmem model, each guest is making tmem hypercalls
at a rate of perhaps thousands per second, driven by the kernel
memory management internals.  Each call deals with a single
page of memory and each possibly may remove a page from (or
return a page to) Xen's free list.  Interacting with a userland
toolstack for each page is simply not feasible for this high
of a frequency, even in a single guest.

Third, tmem in Xen implements both compression and deduplication
so each attempt to put a page of data from the guest into
the hypervisor may or may not require a new physical page.
Only the hypervisor knows.

So, even on a single machine, tmem is tossing memory capacity
about at a very very high frequency.  A userland toolstack can't
possibly keep track, let alone hope to control it; that would
entirely defeat the value of tmem.  It would be like requiring
the toolstack to participate in every vcpu->pcpu transition
in the Xen cpu scheduler.

Does that make sense and answer your question?

Anyway, I think the proposed "claim" hypercall/subop neatly
solves the problem of races between large-chunk memory demands
(i.e. large domain launches) and small-chunk memory demands
(i.e. small domain launches and single-page tmem allocations).


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.