Xen project Mailing List

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

To: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>

From: Tim Deegan <tim@xxxxxxx>

Date: Thu, 17 Jan 2013 15:12:08 +0000

Cc: "Keir \(Xen.org\)" <keir@xxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>

Delivery-date: Thu, 17 Jan 2013 15:12:28 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hi, At 13:43 -0800 on 10 Jan (1357825433), Dan Magenheimer wrote: > > From: Tim Deegan [mailto:tim@xxxxxxx] > > Not quite. I think there are other viable options, and I don't > > particularly like the reservation hypercall. > > Are you suggesting an alternative option other than the max_pages > toolstack-based proposal that Ian and I are discussing in a parallel > subthread? Yes, I suggested three just below in that email. > Are there reasons other than "incompleteness" (see below) that > you dislike the reservation hypercall? Yes. Mostly it strikes me as treating a symptom. That is, it solves the specific problem of delayed build failure rather than looking at the properties of the system that caused it. If I were given a self-ballooning system and asked to support it, I'd be looking at other things first, and probably solving the delayed failure of VM creation as a side-effect. For example: - the lack of policy. If we assume all VMs have the same admin, so we can ignore malicious attackers, a buggy guest or guests can still starve out well-behaved ones. And because it implicitly relies on all OSes having an equivalent measure of how much they 'need' memory, on a host with a mix of guest OSes, the aggressive ones will starve the others. - the lack of fairness: when a storm of activity hits an idle system, whichever VMs get busy first will get all the memory. - allocating _all_ memory with no slack makes the system more vulnerable to any bugs in the rest of xen where allocation failure isn't handled cleanly. There shouldn't be any, but I bet there are. - there's no way of forcing a new VM into a 'full' system; the admin must wait and hope for the existing VMs to shrink. (If there were such a system, it would solve the delayed-failure problem because you'd just use it to enforce the Now, of course, I don't want to dictate what you do in your own system, and in any case I haven't time to get involved in a long discussion about it. And as I've said this reservation hypercall seems harmless enough. > > That could be worked around with an upcall to a toolstack > > agent that reshuffles things on a coarse granularity based on need. I > > agree that's slower than having the hypervisor make the decisions but > > I'm not convinced it'd be unmanageable. > > "Based on need" begs a number of questions, starting with how > "need" is defined and how conflicting needs are resolved. > Tmem balances need as a self-adapting system. For your upcalls, > you'd have to convince me that, even if "need" could be communicated > to an guest-external entity (i.e. a toolstack), that the entity > would/could have any data to inform a policy to intelligently resolve > conflicts. It can easily have all the information that Xen has -- that is, some VMs are asking for more memory. It can even make the same decision about what to do that Xen might, though I think it can probably do better. > I also don't see how it could be done without either > significant hypervisor or guest-kernel changes. The only hypervisor change would be a ring (or even an eventchn) to notify the tools when a guest's XENMEM_populate_physmap fails. > > Or, how about actually moving towards a memory scheduler like you > > suggested -- for example by integrating memory allocation more tightly > > with tmem. There could be an xsm-style hook in the allocator for > > tmem-enabled domains. That way tmem would have complete control over > > all memory allocations for the guests under its control, and it could > > implement a shared upper limit. Potentially in future the tmem > > interface could be extended to allow it to force guests to give back > > more kinds of memory, so that it could try to enforce fairness (e.g. if > > two VMs are busy, why should the one that spiked first get to keep all > > the RAM?) or other nice scheduler-like properties. > > Tmem (plus selfballooning), unchanged, already does some of this. > While I would be interested in discussing better solutions, the > now four-year odyssey of pushing what I thought were relatively > simple changes upstream into Linux has left a rather sour taste > in my mouth, so rather than consider any solution that requires > more guest kernel changes [...] I don't mean that you'd have to do all of that now, but if you were considering moving in that direction, an easy first step would be to add a hook allowing tmem to veto allocations for VMs under its control. That would let tmem have proper control over its client VMs (so it can solve the delayed-failure race for you), while at the same time being a constructive step towards a more complete memory scheduler. > > Or, you could consider booting the new guest pre-ballooned so it doesn't > > have to allocate all that memory in the build phase. It would boot much > > quicker (solving the delayed-failure problem), and join the scramble for > > resources on an equal footing with its peers. > > I'm not positive I understand "pre-ballooned" but IIUC, all Linux > guests already boot pre-ballooned, in that, from the vm.cfg file, > "mem=" is allocated, not "maxmem=". Absolutely. > Tmem, with self-ballooning, launches the guest with "mem=", and > then the guest kernel "self adapts" to (dramatically) reduce its usage > soon after boot. It can be fun to "watch(1)", meaning using the > Linux "watch -d 'head -1 /proc/meminfo'" command. If it were to launch the same guest with mem= a much smaller number and then let it selfballoon _up_ to its chosen amount, vm-building failures due to allocation races could be (a) much rarer and (b) much faster. > > > > My own position remains that I can live with the reservation hypercall, > > > > as long as it's properly done - including handling PV 32-bit and PV > > > > superpage guests. > > > > > > Tim, would you at least agree that "properly" is a red herring? > > > > I'm not quite sure what you mean by that. To the extent that this isn't > > a criticism of the high-level reservation design, maybe. But I stand by > > it as a criticism of the current implementation. > > Sorry, I was just picking on word usage. IMHO, the hypercall > does work "properly" for the classes of domains it was designed > to work on (which I'd estimate in the range of 98% of domains > these days). But it's deliberately incorrect for PV-superpage guests, which are a feature developed and maintained by Oracle. I assume you'll want to make them work with your own toolstack -- why would you not? Tim. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.