Xen project Mailing List

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

To: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>

Date: Thu, 10 Jan 2013 10:42:47 -0800 (PST)

Cc: "Keir \(Xen.org\)" <keir@xxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>, "Tim \(Xen.org\)" <tim@xxxxxxx>, xen-devel@xxxxxxxxxxxxx, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>

Delivery-date: Thu, 10 Jan 2013 18:43:36 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

> From: Ian Campbell [mailto:Ian.Campbell@xxxxxxxxxx] > Sent: Thursday, January 10, 2013 3:32 AM > Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of > problem and alternate > solutions Hi Ian -- In your email I think is the most detailed description of the mechanism of your proposal I've seen yet, so I think I now understand it better than before. Thanks for that. I'm still quite concerned about the policy issues, however, as well as the unintended consequences of interactions between your proposal and existing guest->hypervisor interactions including tmem, in-guest ballooning, and (possibly) page-sharing. So thanks much for continuing the discussion and please read on... > On Tue, 2013-01-08 at 19:41 +0000, Dan Magenheimer wrote: > > Then a second premise that I would like to check to ensure we > > agree: In the Oracle model, as I said, "open source guest kernels > > can intelligently participate in optimizing their own memory usage... > > such guests are now shipping" (FYI Fedora, Ubuntu, and Oracle Linux). > > With these mechanisms, there is direct guest->hypervisor interaction > > that, without knowledge of the toolstack, causes d->tot_pages > > to increase. This interaction may (and does) occur from several > > domains simultaneously and the increase for any domain may occur > > frequently, unpredictably and sometimes dramatically. > > Agreed. OK, for brevity, I'm going to call these (guest->hypervisor interactions that cause d->tot_pages to increase) "dynamic allocations". > > Ian, do you agree with this premise and that a "capacity allocation > > solution" (whether hypervisor-based or toolstack-based) must work > > properly in this context? > > > Or are you maybe proposing to eliminate all such interactions? > > I think these interactions are fine. They are obviously a key part of > your model. My intention is to suggest a possible userspace solution to > the claim proposal which continues to allow this behaviour. Good. I believe George suggested much earlier in this thread that such interactions should simply be disallowed, which made me a bit cross. (I may also have misunderstood.) > > Or are you maybe proposing to insert the toolstack in the middle of > > all such interactions? > > Not at all. Good. I believe Ian Jackson's proposal much earlier in a related thread was something along these lines. (Again, I may have misunderstood.) So, Ian, for the sake of argument below, please envision a domain in which d->tot_pages varies across time like a high-frequency high-amplitude sine wave. By bad luck, when d->tot_pages is sampled at t=0, d->tot_pages is at the minimum point of the sine wave. For brevity, let's call this a "worst-case domain." (I realize it is contrived, but nor is it completely unrealistic.) And, as we've agreed, the toolstack is completely unaware of this sine wave behavior. > > Next, in your most recent reply, I think you skipped replying to my > > comment of "[in your proposal] the toolstack must make intelligent > > policy decisions about how to vary current_maxmem relative to > > lifetime_maxmem, across all the domains on the system [1]". We > > seem to disagree on whether this need only be done twice per domain > > launch (once at domain creation start and once at domain creation > > finish, in your proposal) vs. more frequently. But in either case, > > do you agree that the toolstack is not equipped to make policy > > decisions across multiple guests to do this > > No, I don't agree. OK, so then this is an important point of discussion. You believe the toolstack IS equipped to make policy decisions across multiple guests. Let's get back to that in a minute. > > and that poor choices may have dire consequences (swapstorm, OOM) on a > > guest? > > Setting maxmem on a domain does not immediately force a domain to that > amount of RAM and so the act of doing setting maxmem is not going to > cause a swap storm. (I think this relates to the "distinction between > current_maxmem and lifetime_maxmem was added for Citrix DMC support" > patch you were referring too below, previously to that Xen would reject > attempts to set max < current) Agreed that it doesn't "immediately force a domain", but let's leave open the "not going to cause a swap storm" as a possible point of disagreement. > Setting maxmem doesn't even ask the domain to try and head for that > limit (that is the target which is a separate thing). So the domain > won't react to setting maxmem at all and unless it goes specifically > looking I don't think it would even be aware that its maximum has been > temporarily reduced. Agreed, _except_ that during the period where its max_pages is temporarily reduced (which, we've demonstrated earlier in a related thread, may be a period of many minutes), there are now two differences: 1) if d->max_pages is set below d->tot_pages, all dynamic allocations of the type that would otherwise cause d->tot_pages to increase will now fail, and 2) if d->max_pages is set "somewhat" higher than d->tot_pages, the possible increase of d->tot_pages has now been constrained; some dynamic allocations will succeed and some will fail. Do you agree that there is a possibility that these differences may result in unintended consequences? > Having set all the maxmem's on the domains you would then immediately > check if each domain has tot_pages under or over the temporary maxmem > limit. > > If all domains are under then the claim has succeeded and you may > proceed to build the domain. If any one domain is over then the claim > has failed and you need to reset all the maxmems back to the lifetime > value and try again on another host (I understand that this is an > accepted possibility with the h/v based claim approach too). NOW you are getting into policy. You say "set all the maxmem's on the domains" and "immediately check each domain tot_pages". Let me interpret this as a policy statement and try to define it more precisely: 1) For the N domains running on the system (and N may be measured in the hundreds), you must select L domains (where 1<=L<=N) and, for each, make a hypercall to change d->max_pages. How do you propose to select these L? Or, in your proposal, is L==N? (i.e. L may also be >100)? 2) For each of the L domains, you must decide _how much_ to decrease d->max_pages. (How do you propose to do this? Maybe decrease each by the same amount, M-divided-by-L?) 3) You now make L (or is it N?) hypercalls to read each d->tot_pages. 4) I may be wrong, but I assume _before_ you decrease d->max_pages you will likely want to sample d->tot_pages for each L to inform your selection process in (1) and (2) above. If so, for each of L (possibly N?) domains, a hypercall is required to check d->tot_pages and a TOCTOU race is introduced because tot_pages may change unless and until you set d->max_pages lower than d->tot_pages. 5) Since the toolstack is unaware of dynamic allocations, your proposal might unwittingly decrease d->max_pages on a worst-case domain to the point where max_pages is much lower than the peak of the sine wave, and this constraint may be imposed for several minutes, potentially causing swapping or OOMs for our worst-case domains. (Do you still disagree?) 6) You are imposing the above constraints on _all_ toolstacks. Also, I'm not positive I understand, but it appears that your solution as outlined will have false negatives; i.e. your algorithm will cause some claims to fail when there is actually sufficient RAM (in the case of "if any ONE domain is over"). But unless you specify your selection criteria more precisely, I don't know. In sum, this all seems like a very high price to pay to avoid less than a hundred lines of code (plus comments) in the hypervisor. > I forgot to say but you'd obviously want to use whatever controls tmem > provides to ensure it doesn't just gobble up the M bytes needed for the > new domain. It can of course continue to operate as normal on the > remainder of the spare RAM. Hmmm.. so you want to shut off _all_ dynamic allocations for a period of possibly several minutes? And how does tmem know what the "remainder of the spare RAM" is... isn't that information now only in the toolstack? Forgive me if I am missing something obvious, but in any case... Tmem does have a gross ham-handed freeze/thaw mechanism to do this via tmem hypercalls. But AFAIK there is no equivalent mechanism for controlling in-guest ballooning (nor AFAIK for shared-page CoW resolution). But reserving the M bytes in the hypervisor (as the proposed XENMEM_claim_pages does) is atomic so solves any TOCTOU races and both eliminates the need for tmem freeze/thaw and solves the problem for in-guest-kernel selfballooning all at the same time. (And, I think, shared-page CoW stuff as well.) One more subtle but very important point, especially in the context of memory overcommit: Your toolstack-based proposal explicitly constrains the growth of L independent domains. This is a sum-of-maxes constraint. The hypervisor-based proposal constrains only the _total_ growth of N domains and is thus a max-of-sums constraint. Statistically, for any resource management problem, a max-of-sums solution provides much much more flexibility. So even academically speaking, the hypervisor solution is superior. (If that's clear as mud, please let me know and I can try to explain further.) Dan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.