[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions



> From: Ian Campbell [mailto:Ian.Campbell@xxxxxxxxxx]
> Sent: Thursday, January 10, 2013 3:32 AM
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of 
> problem and alternate
> solutions

Hi Ian --

In your email I think is the most detailed description of the
mechanism of your proposal I've seen yet, so I think I now
understand it better than before.  Thanks for that.

I'm still quite concerned about the policy issues, however, as
well as the unintended consequences of interactions between your
proposal and existing guest->hypervisor interactions including
tmem, in-guest ballooning, and (possibly) page-sharing.

So thanks much for continuing the discussion and please read on...

> On Tue, 2013-01-08 at 19:41 +0000, Dan Magenheimer wrote:
> > Then a second premise that I would like to check to ensure we
> > agree:  In the Oracle model, as I said, "open source guest kernels
> > can intelligently participate in optimizing their own memory usage...
> > such guests are now shipping" (FYI Fedora, Ubuntu, and Oracle Linux).
> > With these mechanisms, there is direct guest->hypervisor interaction
> > that, without knowledge of the toolstack, causes d->tot_pages
> > to increase.  This interaction may (and does) occur from several
> > domains simultaneously and the increase for any domain may occur
> > frequently, unpredictably and sometimes dramatically.
> 
> Agreed.

OK, for brevity, I'm going to call these (guest->hypervisor interactions
that cause d->tot_pages to increase) "dynamic allocations".

> > Ian, do you agree with this premise and that a "capacity allocation
> > solution" (whether hypervisor-based or toolstack-based) must work
> > properly in this context?
> 
> > Or are you maybe proposing to eliminate all such interactions?
> 
> I think these interactions are fine. They are obviously a key part of
> your model. My intention is to suggest a possible userspace solution to
> the claim proposal which continues to allow this behaviour.

Good.  I believe George suggested much earlier in this thread that
such interactions should simply be disallowed, which made me a bit cross.
(I may also have misunderstood.)
 
> > Or are you maybe proposing to insert the toolstack in the middle of
> > all such interactions?
> 
> Not at all.

Good.  I believe Ian Jackson's proposal much earlier in a related thread
was something along these lines.  (Again, I may have misunderstood.)

So, Ian, for the sake of argument below, please envision a domain
in which d->tot_pages varies across time like a high-frequency
high-amplitude sine wave.  By bad luck, when d->tot_pages is sampled
at t=0, d->tot_pages is at the minimum point of the sine wave.
For brevity, let's call this a "worst-case domain."  (I realize
it is contrived, but nor is it completely unrealistic.)

And, as we've agreed, the toolstack is completely unaware of this
sine wave behavior.

> > Next, in your most recent reply, I think you skipped replying to my
> > comment of "[in your proposal] the toolstack must make intelligent
> > policy decisions about how to vary current_maxmem relative to
> > lifetime_maxmem, across all the domains on the system [1]".  We
> > seem to disagree on whether this need only be done twice per domain
> > launch (once at domain creation start and once at domain creation
> > finish, in your proposal) vs. more frequently.  But in either case,
> > do you agree that the toolstack is not equipped to make policy
> > decisions across multiple guests to do this
> 
> No, I don't agree.

OK, so then this is an important point of discussion.  You believe
the toolstack IS equipped to make policy decisions across multiple
guests.  Let's get back to that in a minute.

> > and that poor choices may have dire consequences (swapstorm, OOM) on a
> > guest?
> 
> Setting maxmem on a domain does not immediately force a domain to that
> amount of RAM and so the act of doing setting maxmem is not going to
> cause a swap storm. (I think this relates to the "distinction between
> current_maxmem and lifetime_maxmem was added for Citrix DMC support"
> patch you were referring too below, previously to that Xen would reject
> attempts to set max < current)

Agreed that it doesn't "immediately force a domain", but let's
leave open the "not going to cause a swap storm" as a possible
point of disagreement.

> Setting maxmem doesn't even ask the domain to try and head for that
> limit (that is the target which is a separate thing). So the domain
> won't react to setting maxmem at all and unless it goes specifically
> looking I don't think it would even be aware that its maximum has been
> temporarily reduced.

Agreed, _except_ that during the period where its max_pages is temporarily
reduced (which, we've demonstrated earlier in a related thread, may
be a period of many minutes), there are now two differences:

1) if d->max_pages is set below d->tot_pages, all dynamic allocations
of the type that would otherwise cause d->tot_pages to increase will
now fail, and
2) if d->max_pages is set "somewhat" higher than d->tot_pages, the
possible increase of d->tot_pages has now been constrained; some
dynamic allocations will succeed and some will fail.

Do you agree that there is a possibility that these differences
may result in unintended consequences?

> Having set all the maxmem's on the domains you would then immediately
> check if each domain has tot_pages under or over the temporary maxmem
> limit.
>
> If all domains are under then the claim has succeeded and you may
> proceed to build the domain. If any one domain is over then the claim
> has failed and you need to reset all the maxmems back to the lifetime
> value and try again on another host (I understand that this is an
> accepted possibility with the h/v based claim approach too).

NOW you are getting into policy.  You say "set all the maxmem's on
the domains" and "immediately check each domain tot_pages".  Let me
interpret this as a policy statement and try to define it more precisely:

1) For the N domains running on the system (and N may be measured in
   the hundreds), you must select L domains (where 1<=L<=N) and, for
   each, make a hypercall to change d->max_pages.  How do you
   propose to select these L?  Or, in your proposal, is L==N?
   (i.e. L may also be >100)?
2) For each of the L domains, you must decide _how much_ to
   decrease d->max_pages.  (How do you propose to do this?  Maybe
   decrease each by the same amount, M-divided-by-L?)
3) You now make L (or is it N?) hypercalls to read each d->tot_pages.
4) I may be wrong, but I assume _before_ you decrease d->max_pages
   you will likely want to sample d->tot_pages for each L to inform
   your selection  process in (1) and (2) above.  If so, for each
   of L (possibly N?) domains, a hypercall is required to check
   d->tot_pages and a TOCTOU race is introduced because tot_pages
   may change unless and until you set d->max_pages lower than
   d->tot_pages.
5) Since the toolstack is unaware of dynamic allocations, your
   proposal might unwittingly decrease d->max_pages on a worst-case
   domain to the point where max_pages is much lower than the
   peak of the sine wave, and this constraint may be imposed for
   several minutes, potentially causing swapping or OOMs for our
   worst-case domains.  (Do you still disagree?)
6) You are imposing the above constraints on _all_ toolstacks.

Also, I'm not positive I understand, but it appears that your
solution as outlined will have false negatives; i.e. your
algorithm will cause some claims to fail when there is
actually sufficient RAM (in the case of "if any ONE domain is
over").  But unless you specify your selection criteria more
precisely, I don't know.

In sum, this all seems like a very high price to pay to avoid
less than a hundred lines of code (plus comments) in the
hypervisor.

> I forgot to say but you'd obviously want to use whatever controls tmem
> provides to ensure it doesn't just gobble up the M bytes needed for the
> new domain. It can of course continue to operate as normal on the
> remainder of the spare RAM.

Hmmm.. so you want to shut off _all_ dynamic allocations for
a period of possibly several minutes?   And how does tmem know
what the "remainder of the spare RAM" is... isn't that information
now only in the toolstack?  Forgive me if I am missing something
obvious, but in any case...

Tmem does have a gross ham-handed freeze/thaw mechanism to do this
via tmem hypercalls.  But AFAIK there is no equivalent mechanism for
controlling in-guest ballooning (nor AFAIK for shared-page
CoW resolution).  But reserving the M bytes in the hypervisor
(as the proposed XENMEM_claim_pages does) is atomic so solves any
TOCTOU races and both eliminates the need for tmem freeze/thaw and
solves the problem for in-guest-kernel selfballooning all at the
same time. (And, I think, shared-page CoW stuff as well.)
 
One more subtle but very important point, especially in the
context of memory overcommit:  Your toolstack-based proposal
explicitly constrains the growth of L independent domains.
This is a sum-of-maxes constraint.  The hypervisor-based proposal
constrains only the _total_ growth of N domains and is thus
a max-of-sums constraint.  Statistically, for any resource
management problem, a max-of-sums solution provides much
much more flexibility.  So even academically speaking, the
hypervisor solution is superior.  (If that's clear as mud,
please let me know and I can try to explain further.)

Dan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.