[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions



On 18/12/12 22:17, Konrad Rzeszutek Wilk wrote:
Hi Dan, an issue with your reasoning throughout has been the constant invocation of the multi host environment as a justification for your proposal. But this argument is not used in your proposal below beyond this mention in passing. Further, there is no relation between what you are changing (the hypervisor) and what you are claiming it is needed for (multi host VM management).
Heh. I hadn't realized that the emails need to conform to a
the way legal briefs are written in US :-) Meaning that
each topic must be addressed.

Every time we try to suggest alternatives, Dan goes on some rant about how we're on different planets, how we're all old-guard stuck in static-land thinking, and how we're focused on single-server use cases, but that multi-server use cases are so different. That's not a one-off, Dan has brought up the multi-server case as a reason that a user-space version won't work several times. But when it comes down to it, he (apparently) has barely mentioned it. If it's such a key reason point, why does he not bring it up here? It turns out we were right all along -- the whole multi-server thing has nothing to do with it. That's the point Andres is getting at, I think.

(FYI I'm not wasting my time reading mail from Dan anymore on this subject. As far as I can tell in this entire discussion he has never changed his mind or his core argument in response to anything anyone has said, nor has he understood better our ideas or where we are coming from. He has only responded by generating more verbiage than anyone has the time to read and understand, much less respond to. That's why I suggested to Dan that he ask someone else to take over the conversation.)

Anyhow, the multi-host env or a single-host env has the same
issue - you try to launch multiple guests and you some of
them might not launch.

The changes that Dan is proposing (the claim hypercall)
would provide the functionality to fix this problem.

A fairly bizarre limitation of a balloon-based approach to memory management. 
Why on earth should the guest be allowed to change the size of its balloon, and 
therefore its footprint on the host. This may be justified with arguments 
pertaining to the stability of the in-guest workload. What they really reveal 
are limitations of ballooning. But the inadequacy of the balloon in itself 
doesn't automatically translate into justifying the need for a new hyper call.
Why is this a limitation? Why shouldn't the guest the allowed to change
its memory usage? It can go up and down as it sees fit.
And if it goes down and it gets better performance - well, why shouldn't
it do it?

I concur it is odd - but it has been like that for decades.

Well, it shouldn't be allowed to do it because it causes this problem you're having with creating guests in parallel. Ultimately, that is the core of your problem. So if you want us to solve the problem by implementing something in the hypervisor, then you need to justify why "Just don't have guests balloon down" is an unacceptable option. Saying "why shouldn't it", and "it's been that way for decades*" isn't a good enough reason.

* Xen is only just 10, so "decades" is a bit of a hyperbole. :-)



the hypervisor, which adjusts the domain memory footprint, which changes the 
number of free pages _without_ the toolstack knowledge.
The toolstack controls constraints (essentially a minimum and maximum)
which the hypervisor enforces.  The toolstack can ensure that the
minimum and maximum are identical to essentially disallow Linux from
using this functionality.  Indeed, this is precisely what Citrix's
Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always has 
complete control and, so, knowledge of any domain memory
footprint changes.  But DMC is not prescribed by the toolstack,
Neither is enforcing min==max. This was my argument when previously commenting 
on this thread. The fact that you have enforcement of a maximum domain 
allocation gives you an excellent tool to keep a domain's unsupervised growth 
at bay. The toolstack can choose how fine-grained, how often to be alerted and 
stall the domain.
There is a down-call (so events) to the tool-stack from the hypervisor when
the guest tries to balloon in/out? So the need for this problem arose
but the mechanism to deal with it has been shifted to the user-space
then? What to do when the guest does this in/out balloon at freq
intervals?

I am missing actually the reasoning behind wanting to stall the domain?
Is that to compress/swap the pages that the guest requests? Meaning
an user-space daemon that does "things" and has ownership
of the pages?

and some real Oracle Linux customers use and depend on the flexibility
provided by in-guest ballooning.   So guest-privileged-user-driven-
ballooning is a potential issue for toolstack-based capacity allocation.

[IIGT: This is why I have brought up DMC several times and have
called this the "Citrix model,".. I'm not trying to be snippy
or impugn your morals as maintainers.]

B) Xen's page sharing feature has slowly been completed over a number
of recent Xen releases.  It takes advantage of the fact that many
pages often contain identical data; the hypervisor merges them to save
Great care has been taken for this statement to not be exactly true. The hypervisor 
discards one of two pages that the toolstack tells it to (and patches the physmap of the 
VM previously pointing to the discard page). It doesn't merge, nor does it look into 
contents. The hypervisor doesn't care about the page contents. This is deliberate, so as 
to avoid spurious claims of "you are using technique X!"

Is the toolstack (or a daemon in userspace) doing this? I would
have thought that there would be some optimization to do this
somewhere?

physical RAM.  When any "shared" page is written, the hypervisor
"splits" the page (aka, copy-on-write) by allocating a new physical
page.  There is a long history of this feature in other virtualization
products and it is known to be possible that, under many circumstances, 
thousands of splits may occur in any fraction of a second.  The
hypervisor does not notify or ask permission of the toolstack.
So, page-splitting is an issue for toolstack-based capacity
allocation, at least as currently coded in Xen.

[Andre: Please hold your objection here until you read further.]
Name is Andres. And please cc me if you'll be addressing me directly!

Note that I don't disagree with your previous statement in itself. Although 
"page-splitting" is fairly unique terminology, and confusing (at least to me). 
CoW works.
<nods>
C) Transcendent Memory ("tmem") has existed in the Xen hypervisor and
toolstack for over three years.  It depends on an in-guest-kernel
adaptive technique to constantly adjust the domain memory footprint as
well as hooks in the in-guest-kernel to move data to and from the
hypervisor.  While the data is in the hypervisor's care, interesting
memory-load balancing between guests is done, including optional
compression and deduplication.  All of this has been in Xen since 2009
and has been awaiting changes in the (guest-side) Linux kernel. Those
changes are now merged into the mainstream kernel and are fully
functional in shipping distros.

While a complete description of tmem's guest<->hypervisor interaction
is beyond the scope of this document, it is important to understand
that any tmem-enabled guest kernel may unpredictably request thousands
or even millions of pages directly via hypercalls from the hypervisor in a 
fraction of a second with absolutely no interaction with the toolstack.  
Further, the guest-side hypercalls that allocate pages
via the hypervisor are done in "atomic" code deep in the Linux mm
subsystem.

Indeed, if one truly understands tmem, it should become clear that
tmem is fundamentally incompatible with toolstack-based capacity
allocation. But let's stop discussing tmem for now and move on.
You have not discussed tmem pool thaw and freeze in this proposal.
Oooh, you know about it :-) Dan didn't want to go too verbose on
people. It is a bit of rathole - and this hypercall would
allow to deprecate said freeze/thaw calls.

OK.  So with existing code both in Xen and Linux guests, there are
three challenges to toolstack-based capacity allocation.  We'd
really still like to do capacity allocation in the toolstack.  Can
something be done in the toolstack to "fix" these three cases?

Possibly.  But let's first look at hypervisor-based capacity
allocation: the proposed "XENMEM_claim_pages" hypercall.

HYPERVISOR-BASED CAPACITY ALLOCATION

The posted patch for the claim hypercall is quite simple, but let's
look at it in detail.  The claim hypercall is actually a subop
of an existing hypercall.  After checking parameters for validity,
a new function is called in the core Xen memory management code.
This function takes the hypervisor heaplock, checks for a few
special cases, does some arithmetic to ensure a valid claim, stakes
the claim, releases the hypervisor heaplock, and then returns.  To
review from earlier, the hypervisor heaplock protects _all_ page/slab
allocations, so we can be absolutely certain that there are no other
page allocation races.  This new function is about 35 lines of code,
not counting comments.

The patch includes two other significant changes to the hypervisor:
First, when any adjustment to a domain's memory footprint is made
(either through a toolstack-aware hypercall or one of the three
toolstack-unaware methods described above), the heaplock is
taken, arithmetic is done, and the heaplock is released.  This
is 12 lines of code.  Second, when any memory is allocated within
Xen, a check must be made (with the heaplock already held) to
determine if, given a previous claim, the domain has exceeded
its upper bound, maxmem.  This code is a single conditional test.

With some declarations, but not counting the copious comments,
all told, the new code provided by the patch is well under 100 lines.

What about the toolstack side?  First, it's important to note that
the toolstack changes are entirely optional.  If any toolstack
wishes either to not fix the original problem, or avoid toolstack-
unaware allocation completely by ignoring the functionality provided
by in-guest ballooning, page-sharing, and/or tmem, that toolstack need
not use the new hyper call.
You are ruling out any other possibility here. In particular, but not limited 
to, use of max_pages.
The one max_page check that comes to my mind is the one that Xapi
uses. That is it has a daemon that sets the max_pages of all the
guests at some value so that it can squeeze in as many guests as
possible. It also balloons pages out of a guest to make space if
need to launch. The heurestic of how many pages or the ratio
of max/min looks to be proportional (so to make space for 1GB
for a guest, and say we have 10 guests, we will subtract
101MB from each guest - the extra 1MB is for extra overhead).
This depends on one hypercall that 'xl' or 'xm' toolstack do not
use - which sets the max_pages.

That code makes certain assumptions - that the guest will not go/up down
in the ballooning once the toolstack has decreed how much
memory the guest should use. It also assumes that the operations
are semi-atomic - and to make it so as much as it can - it executes
these operations in serial.

No, the xapi code does no such assumptions. After it tells a guest to balloon down, it watches to see what actually happens, and has heuristics to deal with "non-cooperative guests". It does assume that if it sets max_pages lower than or equal to the current amount of used memory, that the hypervisor will not allow the guest to balloon up -- but that's a pretty safe assumption. A guest can balloon down if it wants to, but as xapi does not consider that memory free, it will never use it.

BTW, I don't know if you realize this: Originally Xen would return an error if you tried to set max_pages below tot_pages. But as a result of the DMC work, it was seen as useful to allow the toolstack to tell the hypervisor once, "Once the VM has ballooned down to X, don't let it balloon up above X anymore."

This goes back to the problem statement - if we try to parallize
this we run in the problem that the amount of memory we thought
we free is not true anymore. The start of this email has a good
description of some of the issues.

In essence, the max_pages does work - _if_ one does these operations
in serial. We are trying to make this work in parallel and without
any failures - for that we - one way that is quite simplistic
is the claim hypercall. It sets up a 'stake' of the amount of
memory that the hypervisor should reserve. This way other
guests creations/ ballonning do not infringe on the 'claimed' amount.

I'm not sure what you mean by "do these operations in serial" in this context. Each of your "reservation hypercalls" has to happen in serial. If we had a user-space daemon that was in charge of freeing up or reserving memory, each request to that daemon would happen in serial as well. But once the allocation / reservation happened, the domain builds could happen in parallel.

I believe with this hypercall the Xapi can be made to do its operations
in parallel as well.

xapi can already boot guests in parallel when there's enough memory to do so -- what operations did you have in mind?

I haven't followed all of the discussion (for reasons mentioned above), but I think the alternative to Dan's solution is something like below. Maybe you can tell me why it's not very suitable:

Have one place in the user-space -- either in the toolstack, or a separate daemon -- that is responsible for knowing all the places where memory might be in use. Memory can be in use either by Xen, or by one of several VMs, or in a tmem pool.

In your case, when not creating VMs, it can remove all limitations -- allow the guests or tmem to grow or shrink as much as they want.

When a request comes in for a certain amount of memory, it will go and set each VM's max_pages, and the max tmem pool size. It can then check whether there is enough free memory to complete the allocation or not (since there's a race between checking how much memory a guest is using and setting max_pages). If that succeeds, it can return "success". If, while that VM is being built, another request comes in, it can again go around and set the max sizes lower. It has to know how much of the memory is "reserved" for the first guest being built, but if there's enough left after that, it can return "success" and allow the second VM to start being built.

After the VMs are built, the toolstack can remove the limits again if it wants, again allowing the free flow of memory.

Do you see any problems with this scheme? All it requires is for the toolstack to be able to temporarliy set limits on both guests ballooning up and on tmem allocating more than a certain amount of memory. We already have mechanisms for the first, so if we had a "max_pages" for tmem, then you'd have all the tools you need to implement it.

This is the point at which Dan says something about giant multi-host deployments, which has absolutely no bearing on the issue -- the reservation happens at a host level, whether it's in userspace or the hypervisor.

It's also where he goes on about how we're stuck in an old stodgy static world and he lives in a magical dynamic hippie world of peace and free love... er, free memory. Which is also not true -- in the scenario I describe above, tmem is actively being used, and guests can actively balloon down and up, while the VM builds are happening. In Dan's proposal, tmem and guests are prevented from allocating "reserved" memory by some complicated scheme inside the allocator; in the above proposal, tmem and guests are prevented from allocating "reserved" memory by simple hypervisor-enforced max_page settings. The end result looks the same to me.

 -George



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.