[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions



> From: Tim Deegan [mailto:tim@xxxxxxx]
> Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of 
> problem and alternate
> solutions
> 
> Hi,

Hi Tim --

It's probably worth correcting a few of your points below,
even if only for the xen-devel archives and posterity...
 
> If I were given a self-ballooning system and asked to support it, I'd be
> looking at other things first, and probably solving the delayed failure
> of VM creation as a side-effect.

Agreed.  These other things were looked at in 2009 when tmem
was added into Xen and prototyped for Linux.  And the delayed
failure was solved poorly with a hack in 2010 and is being
looked at again in 2012/2013 with the intent of solving it correctly.

> For example:
>  - the lack of policy.  If we assume all VMs have the same admin,
>    so we can ignore malicious attackers, a buggy guest or guests
>    can still starve out well-behaved ones.  And because it implicitly
>    relies on all OSes having an equivalent measure of how much they
>    'need' memory, on a host with a mix of guest OSes, the aggressive
>    ones will starve the others.

With tmem, a malicious attacker can never get more memory than
the original maxmem assigned by the host administrator when the
guest is launched.  This is also true of any non-tmem guests
running (e.g. proprietary Windows).

And the architecture of tmem takes into account the difference
between memory a guest "needs" vs memory it "wants".  Though this
is a basic OS concept that exists in some form in all OS's,
AFAIK it has never been exposed outside of the OS (e.g. to
a hypervisor) because, in a physical system, RAM is RAM and
the only limit is the total amount of physical RAM in the system.
Tmem changes in the guest kernel expose the needs/wants information
and tmem in the hypervisor defines very simple carrots and
sticks to keep guests in line by offering, under well-defined
constraints, to keep and manage certain pages of data for the guest.

While it is true of any resource sharing mechanism (including CPU
and I/O scheduling under Xen) that the "must" demand for the resource
may exceed the total available resource, just as with CPU
scheduling, resource demand can be controlled by a few simple policy
variables that default to reasonable values and are enforced,
as necessary, in the hypervisor.  Just as with CPU schedulers
and I/O schedulers, different workloads may over time expose
weaknesses, but that doesn't mean we throw away our CPU and
I/O schedulers and partition those resources instead.  Nor should
we do so with RAM.

All this has been implemented in Xen for years and the Linux-side
is now shipping.  I would very much welcome input and improvements.
But it is very frustrating when people say, on the one hand,
that "it can't be done" or "it won't work" or "it's too hard",
while on the other hand those same people are saying "I don't
have time to understand tmem".

> For example:
>  - the lack of fairness: when a storm of activity hits an idle system,
>    whichever VMs get busy first will get all the memory.

True, but only up to the policy limits built into tmem (i.e
not "all").  Also true of CPU scheduling up to the policy
limits built into the CPU scheduler.

(BTW, tmem optionally supports caps and weights too.)

> For example:
>  - allocating _all_ memory with no slack makes the system more vulnerable
>    to any bugs in the rest of xen where allocation failure isn't handled
>    cleanly.  There shouldn't be any, but I bet there are.

Once tmem has been running for awhile, it works in an eternal
state of "no slack".  IIRC there was a bug or two worked through
years ago.  The real issue has always been fragmentation and
non-resilience of failed allocation of higher-order pages, but
Jan (as of 4.1?) has removed all of those issues from Xen.

So tmem is using ALL the memory in the system.  Keir (and Jan) wrote
a very solid memory manager and it works very well even under stress.

> For example:
>  - there's no way of forcing a new VM into a 'full' system; the admin must
>    wait and hope for the existing VMs to shrink.  (If there were such
>    a system, it would solve the delayed-failure problem because you'd
>    just use it to enforce the

Not true at all.  With tmem, the "want" pages of all the guests (plus
any "fallow" pages that might be truly free at the moment for various
reasons) is the source of pages for adding a new VM.  By definition,
the hypervisor can "free" any or all of these pages when the toolstack
tells the hypervisor to allocate memory for a new guest. No waiting
necessary.  That's how the claim_pages hypercall works so cleanly
and quickly.

(And, sorry to sound like a broken record, but I think it's worth
emphasizing and re-emphasizing, this is not a blue sky proposal.
All of this code is already working in the Xen hypervisor today.)

> > > Or, how about actually moving towards a memory scheduler like you
> > > suggested -- for example by integrating memory allocation more tightly
> > > with tmem.  There could be an xsm-style hook in the allocator for
> > > tmem-enabled domains.  That way tmem would have complete control over
> > > all memory allocations for the guests under its control, and it could
> > > implement a shared upper limit.  Potentially in future the tmem
> > > interface could be extended to allow it to force guests to give back
> > > more kinds of memory, so that it could try to enforce fairness (e.g. if
> > > two VMs are busy, why should the one that spiked first get to keep all
> > > the RAM?) or other nice scheduler-like properties.
> >
> > Tmem (plus selfballooning), unchanged, already does some of this.
> > While I would be interested in discussing better solutions, the
> > now four-year odyssey of pushing what I thought were relatively
> > simple changes upstream into Linux has left a rather sour taste
> > in my mouth, so rather than consider any solution that requires
> > more guest kernel changes [...]
> 
> I don't mean that you'd have to do all of that now, but if you were
> considering moving in that direction, an easy first step would be to add
> a hook allowing tmem to veto allocations for VMs under its control.
> That would let tmem have proper control over its client VMs (so it can
> solve the delayed-failure race for you), while at the same time being a
> constructive step towards a more complete memory scheduler.

While you are using different words, you are describing what
tmem does today.  Tmem does have control and uses the existing
hypervisor mechanisms and the existing hypervisor lock for memory
allocation.  That's why it's so clean to solve the "delayed-failure
race" using the same lock.

Dan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.