Xen project Mailing List

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

To: Tim Deegan <tim@xxxxxxx>

From: Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>

Date: Thu, 17 Jan 2013 10:26:46 -0500

Cc: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, "Keir \(Xen.org\)" <keir@xxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>

Delivery-date: Thu, 17 Jan 2013 15:27:09 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Jan 17, 2013, at 10:12 AM, Tim Deegan <tim@xxxxxxx> wrote: > Hi, > > At 13:43 -0800 on 10 Jan (1357825433), Dan Magenheimer wrote: >>> From: Tim Deegan [mailto:tim@xxxxxxx] >>> Not quite. I think there are other viable options, and I don't >>> particularly like the reservation hypercall. >> >> Are you suggesting an alternative option other than the max_pages >> toolstack-based proposal that Ian and I are discussing in a parallel >> subthread? > > Yes, I suggested three just below in that email. > >> Are there reasons other than "incompleteness" (see below) that >> you dislike the reservation hypercall? > > Yes. Mostly it strikes me as treating a symptom. That is, it solves > the specific problem of delayed build failure rather than looking at the > properties of the system that caused it. > > If I were given a self-ballooning system and asked to support it, I'd be > looking at other things first, and probably solving the delayed failure > of VM creation as a side-effect. For example: > - the lack of policy. If we assume all VMs have the same admin, > so we can ignore malicious attackers, a buggy guest or guests > can still starve out well-behaved ones. And because it implicitly > relies on all OSes having an equivalent measure of how much they > 'need' memory, on a host with a mix of guest OSes, the aggressive > ones will starve the others. > - the lack of fairness: when a storm of activity hits an idle system, > whichever VMs get busy first will get all the memory. > - allocating _all_ memory with no slack makes the system more vulnerable > to any bugs in the rest of xen where allocation failure isn't handled > cleanly. There shouldn't be any, but I bet there are. > - there's no way of forcing a new VM into a 'full' system; the admin must > wait and hope for the existing VMs to shrink. (If there were such > a system, it would solve the delayed-failure problem because you'd > just use it to enforce the > > Now, of course, I don't want to dictate what you do in your own system, > and in any case I haven't time to get involved in a long discussion > about it. And as I've said this reservation hypercall seems harmless > enough. > >>> That could be worked around with an upcall to a toolstack >>> agent that reshuffles things on a coarse granularity based on need. I >>> agree that's slower than having the hypervisor make the decisions but >>> I'm not convinced it'd be unmanageable. >> >> "Based on need" begs a number of questions, starting with how >> "need" is defined and how conflicting needs are resolved. >> Tmem balances need as a self-adapting system. For your upcalls, >> you'd have to convince me that, even if "need" could be communicated >> to an guest-external entity (i.e. a toolstack), that the entity >> would/could have any data to inform a policy to intelligently resolve >> conflicts. > > It can easily have all the information that Xen has -- that is, some VMs > are asking for more memory. It can even make the same decision about > what to do that Xen might, though I think it can probably do better. > >> I also don't see how it could be done without either >> significant hypervisor or guest-kernel changes. > > The only hypervisor change would be a ring (or even an eventchn) to > notify the tools when a guest's XENMEM_populate_physmap fails. We already have a notification ring for ENOMEM on unshare. It's named "sharing" ring, but frankly it's more like an "enomem" ring. It can be easily generalized. I hope… Andres > >>> Or, how about actually moving towards a memory scheduler like you >>> suggested -- for example by integrating memory allocation more tightly >>> with tmem. There could be an xsm-style hook in the allocator for >>> tmem-enabled domains. That way tmem would have complete control over >>> all memory allocations for the guests under its control, and it could >>> implement a shared upper limit. Potentially in future the tmem >>> interface could be extended to allow it to force guests to give back >>> more kinds of memory, so that it could try to enforce fairness (e.g. if >>> two VMs are busy, why should the one that spiked first get to keep all >>> the RAM?) or other nice scheduler-like properties. >> >> Tmem (plus selfballooning), unchanged, already does some of this. >> While I would be interested in discussing better solutions, the >> now four-year odyssey of pushing what I thought were relatively >> simple changes upstream into Linux has left a rather sour taste >> in my mouth, so rather than consider any solution that requires >> more guest kernel changes [...] > > I don't mean that you'd have to do all of that now, but if you were > considering moving in that direction, an easy first step would be to add > a hook allowing tmem to veto allocations for VMs under its control. > That would let tmem have proper control over its client VMs (so it can > solve the delayed-failure race for you), while at the same time being a > constructive step towards a more complete memory scheduler. > >>> Or, you could consider booting the new guest pre-ballooned so it doesn't >>> have to allocate all that memory in the build phase. It would boot much >>> quicker (solving the delayed-failure problem), and join the scramble for >>> resources on an equal footing with its peers. >> >> I'm not positive I understand "pre-ballooned" but IIUC, all Linux >> guests already boot pre-ballooned, in that, from the vm.cfg file, >> "mem=" is allocated, not "maxmem=". > > Absolutely. > >> Tmem, with self-ballooning, launches the guest with "mem=", and >> then the guest kernel "self adapts" to (dramatically) reduce its usage >> soon after boot. It can be fun to "watch(1)", meaning using the >> Linux "watch -d 'head -1 /proc/meminfo'" command. > > If it were to launch the same guest with mem= a much smaller number and > then let it selfballoon _up_ to its chosen amount, vm-building failures > due to allocation races could be (a) much rarer and (b) much faster. > >>>>> My own position remains that I can live with the reservation hypercall, >>>>> as long as it's properly done - including handling PV 32-bit and PV >>>>> superpage guests. >>>> >>>> Tim, would you at least agree that "properly" is a red herring? >>> >>> I'm not quite sure what you mean by that. To the extent that this isn't >>> a criticism of the high-level reservation design, maybe. But I stand by >>> it as a criticism of the current implementation. >> >> Sorry, I was just picking on word usage. IMHO, the hypercall >> does work "properly" for the classes of domains it was designed >> to work on (which I'd estimate in the range of 98% of domains >> these days). > > But it's deliberately incorrect for PV-superpage guests, which are a > feature developed and maintained by Oracle. I assume you'll want to > make them work with your own toolstack -- why would you not? > > Tim. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.