[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

On Jan 17, 2013, at 10:12 AM, Tim Deegan <tim@xxxxxxx> wrote:

> Hi,
> At 13:43 -0800 on 10 Jan (1357825433), Dan Magenheimer wrote:
>>> From: Tim Deegan [mailto:tim@xxxxxxx]
>>> Not quite.  I think there are other viable options, and I don't
>>> particularly like the reservation hypercall.
>> Are you suggesting an alternative option other than the max_pages
>> toolstack-based proposal that Ian and I are discussing in a parallel
>> subthread?
> Yes, I suggested three just below in that email.
>> Are there reasons other than "incompleteness" (see below) that
>> you dislike the reservation hypercall?
> Yes.  Mostly it strikes me as treating a symptom.  That is, it solves
> the specific problem of delayed build failure rather than looking at the
> properties of the system that caused it. 
> If I were given a self-ballooning system and asked to support it, I'd be
> looking at other things first, and probably solving the delayed failure
> of VM creation as a side-effect.  For example:
> - the lack of policy.  If we assume all VMs have the same admin,
>   so we can ignore malicious attackers, a buggy guest or guests
>   can still starve out well-behaved ones.  And because it implicitly
>   relies on all OSes having an equivalent measure of how much they
>   'need' memory, on a host with a mix of guest OSes, the aggressive
>   ones will starve the others.
> - the lack of fairness: when a storm of activity hits an idle system,
>   whichever VMs get busy first will get all the memory.
> - allocating _all_ memory with no slack makes the system more vulnerable
>   to any bugs in the rest of xen where allocation failure isn't handled
>   cleanly.  There shouldn't be any, but I bet there are. 
> - there's no way of forcing a new VM into a 'full' system; the admin must
>   wait and hope for the existing VMs to shrink.  (If there were such
>   a system, it would solve the delayed-failure problem because you'd
>   just use it to enforce the 
> Now, of course, I don't want to dictate what you do in your own system,
> and in any case I haven't time to get involved in a long discussion
> about it.  And as I've said this reservation hypercall seems harmless
> enough.
>>> That could be worked around with an upcall to a toolstack
>>> agent that reshuffles things on a coarse granularity based on need.  I
>>> agree that's slower than having the hypervisor make the decisions but
>>> I'm not convinced it'd be unmanageable.
>> "Based on need" begs a number of questions, starting with how
>> "need" is defined and how conflicting needs are resolved.
>> Tmem balances need as a self-adapting system. For your upcalls,
>> you'd have to convince me that, even if "need" could be communicated
>> to an guest-external entity (i.e. a toolstack), that the entity
>> would/could have any data to inform a policy to intelligently resolve
>> conflicts. 
> It can easily have all the information that Xen has -- that is, some VMs
> are asking for more memory.  It can even make the same decision about
> what to do that Xen might, though I think it can probably do better.
>> I also don't see how it could be done without either
>> significant hypervisor or guest-kernel changes.
> The only hypervisor change would be a ring (or even an eventchn) to
> notify the tools when a guest's XENMEM_populate_physmap fails.

We already have a notification ring for ENOMEM on unshare. It's named "sharing" 
ring, but frankly it's more like an "enomem" ring. It can be easily 
generalized. I hope…


>>> Or, how about actually moving towards a memory scheduler like you
>>> suggested -- for example by integrating memory allocation more tightly
>>> with tmem.  There could be an xsm-style hook in the allocator for
>>> tmem-enabled domains.  That way tmem would have complete control over
>>> all memory allocations for the guests under its control, and it could
>>> implement a shared upper limit.  Potentially in future the tmem
>>> interface could be extended to allow it to force guests to give back
>>> more kinds of memory, so that it could try to enforce fairness (e.g. if
>>> two VMs are busy, why should the one that spiked first get to keep all
>>> the RAM?) or other nice scheduler-like properties.
>> Tmem (plus selfballooning), unchanged, already does some of this.
>> While I would be interested in discussing better solutions, the
>> now four-year odyssey of pushing what I thought were relatively
>> simple changes upstream into Linux has left a rather sour taste
>> in my mouth, so rather than consider any solution that requires
>> more guest kernel changes [...]
> I don't mean that you'd have to do all of that now, but if you were
> considering moving in that direction, an easy first step would be to add
> a hook allowing tmem to veto allocations for VMs under its control.
> That would let tmem have proper control over its client VMs (so it can
> solve the delayed-failure race for you), while at the same time being a
> constructive step towards a more complete memory scheduler.
>>> Or, you could consider booting the new guest pre-ballooned so it doesn't
>>> have to allocate all that memory in the build phase.  It would boot much
>>> quicker (solving the delayed-failure problem), and join the scramble for
>>> resources on an equal footing with its peers.
>> I'm not positive I understand "pre-ballooned" but IIUC, all Linux
>> guests already boot pre-ballooned, in that, from the vm.cfg file,
>> "mem=" is allocated, not "maxmem=".
> Absolutely.
>> Tmem, with self-ballooning, launches the guest with "mem=", and
>> then the guest kernel "self adapts" to (dramatically) reduce its usage
>> soon after boot.  It can be fun to "watch(1)", meaning using the
>> Linux "watch -d 'head -1 /proc/meminfo'" command.
> If it were to launch the same guest with mem= a much smaller number and
> then let it selfballoon _up_ to its chosen amount, vm-building failures
> due to allocation races could be (a) much rarer and (b) much faster.  
>>>>> My own position remains that I can live with the reservation hypercall,
>>>>> as long as it's properly done - including handling PV 32-bit and PV
>>>>> superpage guests.
>>>> Tim, would you at least agree that "properly" is a red herring?
>>> I'm not quite sure what you mean by that.  To the extent that this isn't
>>> a criticism of the high-level reservation design, maybe.  But I stand by
>>> it as a criticism of the current implementation.
>> Sorry, I was just picking on word usage.  IMHO, the hypercall
>> does work "properly" for the classes of domains it was designed
>> to work on (which I'd estimate in the range of 98% of domains
>> these days).
> But it's deliberately incorrect for PV-superpage guests, which are a
> feature developed and maintained by Oracle.  I assume you'll want to
> make them work with your own toolstack -- why would you not?
> Tim.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.