[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions


At 13:43 -0800 on 10 Jan (1357825433), Dan Magenheimer wrote:
> > From: Tim Deegan [mailto:tim@xxxxxxx]
> > Not quite.  I think there are other viable options, and I don't
> > particularly like the reservation hypercall.
> Are you suggesting an alternative option other than the max_pages
> toolstack-based proposal that Ian and I are discussing in a parallel
> subthread?

Yes, I suggested three just below in that email.

> Are there reasons other than "incompleteness" (see below) that
> you dislike the reservation hypercall?

Yes.  Mostly it strikes me as treating a symptom.  That is, it solves
the specific problem of delayed build failure rather than looking at the
properties of the system that caused it. 

If I were given a self-ballooning system and asked to support it, I'd be
looking at other things first, and probably solving the delayed failure
of VM creation as a side-effect.  For example:
 - the lack of policy.  If we assume all VMs have the same admin,
   so we can ignore malicious attackers, a buggy guest or guests
   can still starve out well-behaved ones.  And because it implicitly
   relies on all OSes having an equivalent measure of how much they
   'need' memory, on a host with a mix of guest OSes, the aggressive
   ones will starve the others.
 - the lack of fairness: when a storm of activity hits an idle system,
   whichever VMs get busy first will get all the memory.
 - allocating _all_ memory with no slack makes the system more vulnerable
   to any bugs in the rest of xen where allocation failure isn't handled
   cleanly.  There shouldn't be any, but I bet there are. 
 - there's no way of forcing a new VM into a 'full' system; the admin must
   wait and hope for the existing VMs to shrink.  (If there were such
   a system, it would solve the delayed-failure problem because you'd
   just use it to enforce the 

Now, of course, I don't want to dictate what you do in your own system,
and in any case I haven't time to get involved in a long discussion
about it.  And as I've said this reservation hypercall seems harmless

> > That could be worked around with an upcall to a toolstack
> > agent that reshuffles things on a coarse granularity based on need.  I
> > agree that's slower than having the hypervisor make the decisions but
> > I'm not convinced it'd be unmanageable.
> "Based on need" begs a number of questions, starting with how
> "need" is defined and how conflicting needs are resolved.
> Tmem balances need as a self-adapting system. For your upcalls,
> you'd have to convince me that, even if "need" could be communicated
> to an guest-external entity (i.e. a toolstack), that the entity
> would/could have any data to inform a policy to intelligently resolve
> conflicts. 

It can easily have all the information that Xen has -- that is, some VMs
are asking for more memory.  It can even make the same decision about
what to do that Xen might, though I think it can probably do better.

> I also don't see how it could be done without either
> significant hypervisor or guest-kernel changes.

The only hypervisor change would be a ring (or even an eventchn) to
notify the tools when a guest's XENMEM_populate_physmap fails.

> > Or, how about actually moving towards a memory scheduler like you
> > suggested -- for example by integrating memory allocation more tightly
> > with tmem.  There could be an xsm-style hook in the allocator for
> > tmem-enabled domains.  That way tmem would have complete control over
> > all memory allocations for the guests under its control, and it could
> > implement a shared upper limit.  Potentially in future the tmem
> > interface could be extended to allow it to force guests to give back
> > more kinds of memory, so that it could try to enforce fairness (e.g. if
> > two VMs are busy, why should the one that spiked first get to keep all
> > the RAM?) or other nice scheduler-like properties.
> Tmem (plus selfballooning), unchanged, already does some of this.
> While I would be interested in discussing better solutions, the
> now four-year odyssey of pushing what I thought were relatively
> simple changes upstream into Linux has left a rather sour taste
> in my mouth, so rather than consider any solution that requires
> more guest kernel changes [...]

I don't mean that you'd have to do all of that now, but if you were
considering moving in that direction, an easy first step would be to add
a hook allowing tmem to veto allocations for VMs under its control.
That would let tmem have proper control over its client VMs (so it can
solve the delayed-failure race for you), while at the same time being a
constructive step towards a more complete memory scheduler.

> > Or, you could consider booting the new guest pre-ballooned so it doesn't
> > have to allocate all that memory in the build phase.  It would boot much
> > quicker (solving the delayed-failure problem), and join the scramble for
> > resources on an equal footing with its peers.
> I'm not positive I understand "pre-ballooned" but IIUC, all Linux
> guests already boot pre-ballooned, in that, from the vm.cfg file,
> "mem=" is allocated, not "maxmem=".


> Tmem, with self-ballooning, launches the guest with "mem=", and
> then the guest kernel "self adapts" to (dramatically) reduce its usage
> soon after boot.  It can be fun to "watch(1)", meaning using the
> Linux "watch -d 'head -1 /proc/meminfo'" command.

If it were to launch the same guest with mem= a much smaller number and
then let it selfballoon _up_ to its chosen amount, vm-building failures
due to allocation races could be (a) much rarer and (b) much faster.  

> > > > My own position remains that I can live with the reservation hypercall,
> > > > as long as it's properly done - including handling PV 32-bit and PV
> > > > superpage guests.
> > >
> > > Tim, would you at least agree that "properly" is a red herring?
> > 
> > I'm not quite sure what you mean by that.  To the extent that this isn't
> > a criticism of the high-level reservation design, maybe.  But I stand by
> > it as a criticism of the current implementation.
> Sorry, I was just picking on word usage.  IMHO, the hypercall
> does work "properly" for the classes of domains it was designed
> to work on (which I'd estimate in the range of 98% of domains
> these days).

But it's deliberately incorrect for PV-superpage guests, which are a
feature developed and maintained by Oracle.  I assume you'll want to
make them work with your own toolstack -- why would you not?


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.