[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions



I earlier promised a complete analysis of the problem
addressed by the proposed claim hypercall as well as
an analysis of the alternate solutions.  I had not
yet provided these analyses when I asked for approval
to commit the hypervisor patch, so there was still
a good amount of misunderstanding, and I am trying
to fix that here.

I had hoped this essay could be both concise and complete
but quickly found it to be impossible to be both at the
same time.  So I have erred on the side of verbosity,
but also have attempted to ensure that the analysis
flows smoothly and is understandable to anyone interested
in learning more about memory allocation in Xen.
I'd appreciate feedback from other developers to understand
if I've also achieved that goal.

Ian, Ian, George, and Tim -- I have tagged a few
out-of-flow questions to you with [IIGF].  If I lose
you at any point, I'd especially appreciate your feedback
at those points.  I trust that, first, you will read
this completely.  As I've said, I understand that
Oracle's paradigm may differ in many ways from your
own, so I also trust that you will read it completely
with an open mind.

Thanks,
Dan

PROBLEM STATEMENT OVERVIEW

The fundamental problem is a race; two entities are
competing for part or all of a shared resource: in this case,
physical system RAM.  Normally, a lock is used to mediate
a race.

For memory allocation in Xen, there are two significant
entities, the toolstack and the hypervisor.  And, in
general terms, there are currently two important locks:
one used in the toolstack for domain creation;
and one in the hypervisor used for the buddy allocator.

Considering first only domain creation, the toolstack
lock is taken to ensure that domain creation is serialized.
The lock is taken when domain creation starts, and released
when domain creation is complete.

As system and domain memory requirements grow, the amount
of time to allocate all necessary memory to launch a large
domain is growing and may now exceed several minutes, so
this serialization is increasingly problematic.  The result
is a customer reported problem:  If a customer wants to
launch two or more very large domains, the "wait time"
required by the serialization is unacceptable.

Oracle would like to solve this problem.  And Oracle
would like to solve this problem not just for a single
customer sitting in front of a single machine console, but
for the very complex case of a large number of machines,
with the "agent" on each machine taking independent
actions including automatic load balancing and power
management via migration.  (This complex environment
is sold by Oracle today; it is not a "future vision".)

[IIGT] Completely ignoring any possible solutions to this
problem, is everyone in agreement that this _is_ a problem
that _needs_ to be solved with _some_ change in the Xen
ecosystem?

SOME IMPORTANT BACKGROUND INFORMATION

In the subsequent discussion, it is important to
understand a few things:

While the toolstack lock is held, allocating memory for
the domain creation process is done as a sequence of one
or more hypercalls, each asking the hypervisor to allocate
one or more -- "X" -- slabs of physical RAM, where a slab
is 2**N contiguous aligned pages, also known as an
"order N" allocation.  While the hypercall is defined
to work with any value of N, common values are N=0
(individual pages), N=9 ("hugepages" or "superpages"),
and N=18 ("1GiB pages").  So, for example, if the toolstack
requires 201MiB of memory, it will make two hypercalls:
One with X=100 and N=9, and one with X=1 and N=0.

While the toolstack may ask for a smaller number X of
order==9 slabs, system fragmentation may unpredictably
cause the hypervisor to fail the request, in which case
the toolstack will fall back to a request for 512*X
individual pages.  If there is sufficient RAM in the system,
this request for order==0 pages is guaranteed to succeed.
Thus for a 1TiB domain, the hypervisor must be prepared
to allocate up to 256Mi individual pages.

Note carefully that when the toolstack hypercall asks for
100 slabs, the hypervisor "heaplock" is currently taken
and released 100 times.  Similarly, for 256M individual
pages... 256 million spin_lock-alloc_page-spin_unlocks.
This means that domain creation is not "atomic" inside
the hypervisor, which means that races can and will still
occur.

RULING OUT SOME SIMPLE SOLUTIONS

Is there an elegant simple solution here?

Let's first consider the possibility of removing the toolstack
serialization entirely and/or the possibility that two
independent toolstack threads (or "agents") can simultaneously
request a very large domain creation in parallel.  As described
above, the hypervisor's heaplock is insufficient to serialize RAM
allocation, so the two domain creation processes race.  If there
is sufficient resource for either one to launch, but insufficient
resource for both to launch, the winner of the race is indeterminate,
and one or both launches will fail, possibly after one or both 
domain creation threads have been working for several minutes.
This is a classic "TOCTOU" (time-of-check-time-of-use) race.
If a customer is unhappy waiting several minutes to launch
a domain, they will be even more unhappy waiting for several
minutes to be told that one or both of the launches has failed.
Multi-minute failure is even more unacceptable for an automated
agent trying to, for example, evacuate a machine that the
data center administrator needs to powercycle.

[IIGT: Please hold your objections for a moment... the paragraph
above is discussing the simple solution of removing the serialization;
your suggested solution will be discussed soon.]
 
Next, let's consider the possibility of changing the heaplock
strategy in the hypervisor so that the lock is held not
for one slab but for the entire request of N slabs.  As with
any core hypervisor lock, holding the heaplock for a "long time"
is unacceptable.  To a hypervisor, several minutes is an eternity.
And, in any case, by serializing domain creation in the hypervisor,
we have really only moved the problem from the toolstack into
the hypervisor, not solved the problem.

[IIGT] Are we in agreement that these simple solutions can be
safely ruled out?

CAPACITY ALLOCATION VS RAM ALLOCATION

Looking for a creative solution, one may realize that it is the
page allocation -- especially in large quantities -- that is very
time-consuming.  But, thinking outside of the box, it is not
the actual pages of RAM that we are racing on, but the quantity of pages 
required to launch a domain!  If we instead have a way to
"claim" a quantity of pages cheaply now and then allocate the actual
physical RAM pages later, we have changed the race to require only 
serialization of the claiming process!  In other words, if some entity
knows the number of pages available in the system, and can "claim"
N pages for the benefit of a domain being launched, the successful launch of 
the domain can be ensured.  Well... the domain launch may
still fail for an unrelated reason, but not due to a memory TOCTOU
race.  But, in this case, if the cost (in time) of the claiming
process is very small compared to the cost of the domain launch,
we have solved the memory TOCTOU race with hardly any delay added
to a non-memory-related failure that would have occurred anyway.

This "claim" sounds promising.  But we have made an assumption that
an "entity" has certain knowledge.  In the Xen system, that entity
must be either the toolstack or the hypervisor.  Or, in the Oracle
environment, an "agent"... but an agent and a toolstack are similar
enough for our purposes that we will just use the more broadly-used
term "toolstack".  In using this term, however, it's important to
remember it is necessary to consider the existence of multiple
threads within this toolstack.

Now I quote Ian Jackson: "It is a key design principle of a system
like Xen that the hypervisor should provide only those facilities
which are strictly necessary.  Any functionality which can be
reasonably provided outside the hypervisor should be excluded
from it."

So let's examine the toolstack first.

[IIGT] Still all on the same page (pun intended)?

TOOLSTACK-BASED CAPACITY ALLOCATION

Does the toolstack know how many physical pages of RAM are available?
Yes, it can use a hypercall to find out this information after Xen and
dom0 launch, but before it launches any domain.  Then if it subtracts
the number of pages used when it launches a domain and is aware of
when any domain dies, and adds them back, the toolstack has a pretty
good estimate.  In actuality, the toolstack doesn't _really_ know the
exact number of pages used when a domain is launched, but there
is a poorly-documented "fuzz factor"... the toolstack knows the
number of pages within a few megabytes, which is probably close enough.

This is a fairly good description of how the toolstack works today
and the accounting seems simple enough, so does toolstack-based
capacity allocation solve our original problem?  It would seem so.
Even if there are multiple threads, the accounting -- not the extended
sequence of page allocation for the domain creation -- can be
serialized by a lock in the toolstack.  But note carefully, either
the toolstack and the hypervisor must always be in sync on the
number of available pages (within an acceptable margin of error);
or any query to the hypervisor _and_ the toolstack-based claim must
be paired atomically, i.e. the toolstack lock must be held across
both.  Otherwise we again have another TOCTOU race. Interesting,
but probably not really a problem.

Wait, isn't it possible for the toolstack to dynamically change the
number of pages assigned to a domain?  Yes, this is often called
ballooning and the toolstack can do this via a hypercall.  But
that's still OK because each call goes through the toolstack and
it simply needs to add more accounting for when it uses ballooning
to adjust the domain's memory footprint.  So we are still OK.

But wait again... that brings up an interesting point.  Are there
any significant allocations that are done in the hypervisor without
the knowledge and/or permission of the toolstack?  If so, the
toolstack may be missing important information.

So are there any such allocations?  Well... yes. There are a few.
Let's take a moment to enumerate them:

A) In Linux, a privileged user can write to a sysfs file which writes
to the balloon driver which makes hypercalls from the guest kernel to
the hypervisor, which adjusts the domain memory footprint, which changes the 
number of free pages _without_ the toolstack knowledge.
The toolstack controls constraints (essentially a minimum and maximum)
which the hypervisor enforces.  The toolstack can ensure that the
minimum and maximum are identical to essentially disallow Linux from
using this functionality.  Indeed, this is precisely what Citrix's
Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always has 
complete control and, so, knowledge of any domain memory
footprint changes.  But DMC is not prescribed by the toolstack,
and some real Oracle Linux customers use and depend on the flexibility
provided by in-guest ballooning.   So guest-privileged-user-driven-
ballooning is a potential issue for toolstack-based capacity allocation.

[IIGT: This is why I have brought up DMC several times and have
called this the "Citrix model,".. I'm not trying to be snippy
or impugn your morals as maintainers.]

B) Xen's page sharing feature has slowly been completed over a number
of recent Xen releases.  It takes advantage of the fact that many
pages often contain identical data; the hypervisor merges them to save
physical RAM.  When any "shared" page is written, the hypervisor
"splits" the page (aka, copy-on-write) by allocating a new physical
page.  There is a long history of this feature in other virtualization
products and it is known to be possible that, under many circumstances, 
thousands of splits may occur in any fraction of a second.  The
hypervisor does not notify or ask permission of the toolstack.
So, page-splitting is an issue for toolstack-based capacity
allocation, at least as currently coded in Xen.

[Andre: Please hold your objection here until you read further.]

C) Transcendent Memory ("tmem") has existed in the Xen hypervisor and
toolstack for over three years.  It depends on an in-guest-kernel
adaptive technique to constantly adjust the domain memory footprint as
well as hooks in the in-guest-kernel to move data to and from the
hypervisor.  While the data is in the hypervisor's care, interesting
memory-load balancing between guests is done, including optional
compression and deduplication.  All of this has been in Xen since 2009
and has been awaiting changes in the (guest-side) Linux kernel. Those
changes are now merged into the mainstream kernel and are fully
functional in shipping distros.

While a complete description of tmem's guest<->hypervisor interaction
is beyond the scope of this document, it is important to understand
that any tmem-enabled guest kernel may unpredictably request thousands
or even millions of pages directly via hypercalls from the hypervisor in a 
fraction of a second with absolutely no interaction with the toolstack.  
Further, the guest-side hypercalls that allocate pages
via the hypervisor are done in "atomic" code deep in the Linux mm
subsystem.

Indeed, if one truly understands tmem, it should become clear that
tmem is fundamentally incompatible with toolstack-based capacity
allocation. But let's stop discussing tmem for now and move on.

OK.  So with existing code both in Xen and Linux guests, there are
three challenges to toolstack-based capacity allocation.  We'd
really still like to do capacity allocation in the toolstack.  Can
something be done in the toolstack to "fix" these three cases?

Possibly.  But let's first look at hypervisor-based capacity
allocation: the proposed "XENMEM_claim_pages" hypercall.

HYPERVISOR-BASED CAPACITY ALLOCATION

The posted patch for the claim hypercall is quite simple, but let's
look at it in detail.  The claim hypercall is actually a subop
of an existing hypercall.  After checking parameters for validity,
a new function is called in the core Xen memory management code.
This function takes the hypervisor heaplock, checks for a few
special cases, does some arithmetic to ensure a valid claim, stakes
the claim, releases the hypervisor heaplock, and then returns.  To
review from earlier, the hypervisor heaplock protects _all_ page/slab
allocations, so we can be absolutely certain that there are no other
page allocation races.  This new function is about 35 lines of code,
not counting comments.

The patch includes two other significant changes to the hypervisor:
First, when any adjustment to a domain's memory footprint is made
(either through a toolstack-aware hypercall or one of the three
toolstack-unaware methods described above), the heaplock is
taken, arithmetic is done, and the heaplock is released.  This
is 12 lines of code.  Second, when any memory is allocated within
Xen, a check must be made (with the heaplock already held) to
determine if, given a previous claim, the domain has exceeded
its upper bound, maxmem.  This code is a single conditional test.

With some declarations, but not counting the copious comments,
all told, the new code provided by the patch is well under 100 lines.

What about the toolstack side?  First, it's important to note that
the toolstack changes are entirely optional.  If any toolstack
wishes either to not fix the original problem, or avoid toolstack-
unaware allocation completely by ignoring the functionality provided
by in-guest ballooning, page-sharing, and/or tmem, that toolstack need
not use the new hypercall.  Second, it's very relevant to note that the Oracle 
product uses a combination of a proprietary "manager"
which oversees many machines, and the older open-source xm/xend
toolstack, for which the current Xen toolstack maintainers are no
longer accepting patches.

The preface of the published patch does suggest, however, some
straightforward pseudo-code, as follows:

Current toolstack domain creation memory allocation code fragment:

1. call populate_physmap repeatedly to achieve mem=N memory
2. if any populate_physmap call fails, report -ENOMEM up the stack
3. memory is held until domain dies or the toolstack decreases it

Proposed toolstack domain creation memory allocation code fragment
(new code marked with "+"):

+  call claim for mem=N amount of memory
+. if claim succeeds:
1.  call populate_physmap repeatedly to achieve mem=N memory (failsafe)
+  else
2.  report -ENOMEM up the stack
+  claim is held until mem=N is achieved or the domain dies or
    forced to 0 by a second hypercall
3. memory is held until domain dies or the toolstack decreases it

Reviewing the pseudo-code, one can readily see that the toolstack
changes required to implement the hypercall are quite small.

To complete this discussion, it has been pointed out that
the proposed hypercall doesn't solve the original problem
for certain classes of legacy domains... but also neither
does it make the problem worse.  It has also been pointed
out that the proposed patch is not (yet) NUMA-aware.

Now let's return to the earlier question:  There are three 
challenges to toolstack-based capacity allocation, which are
all handled easily by in-hypervisor capacity allocation. But we'd
really still like to do capacity allocation in the toolstack.
Can something be done in the toolstack to "fix" these three cases?

The answer is, of course, certainly... anything can be done in
software.  So, recalling Ian Jackson's stated requirement:

 "Any functionality which can be reasonably provided outside the
  hypervisor should be excluded from it."

we are now left to evaluate the subjective term "reasonably".

CAN TOOLSTACK-BASED CAPACITY ALLOCATION OVERCOME THE ISSUES?

In earlier discussion on this topic, when page-splitting was raised
as a concern, some of the authors of Xen's page-sharing feature
pointed out that a mechanism could be designed such that "batches"
of pages were pre-allocated by the toolstack and provided to the
hypervisor to be utilized as needed for page-splitting.  Should the
batch run dry, the hypervisor could stop the domain that was provoking
the page-split until the toolstack could be consulted and the toolstack, at its 
leisure, could request the hypervisor to refill
the batch, which then allows the page-split-causing domain to proceed.

But this batch page-allocation isn't implemented in Xen today.

Andres Lagar-Cavilla says "... this is because of shortcomings in the
[Xen] mm layer and its interaction with wait queues, documented
elsewhere."  In other words, this batching proposal requires
significant changes to the hypervisor, which I think we
all agreed we were trying to avoid.

[Note to Andre: I'm not objecting to the need for this functionality
for page-sharing to work with proprietary kernels and DMC; just
pointing out that it, too, is dependent on further hypervisor changes.]

Such an approach makes sense in the min==max model enforced by
DMC but, again, DMC is not prescribed by the toolstack.

Further, this waitqueue solution for page-splitting only awkwardly
works around in-guest ballooning (probably only with more hypervisor
changes, TBD) and would be useless for tmem.  [IIGT: Please argue
this last point only if you feel confident you truly understand how
tmem works.]

So this as-yet-unimplemented solution only really solves a part
of the problem.

Are there any other possibilities proposed?  Ian Jackson has
suggested a somewhat different approach:

Let me quote Ian Jackson again:

"Of course if it is really desired to have each guest make its own
decisions and simply for them to somehow agree to divvy up the
available resources, then even so a new hypervisor mechanism is
not needed.  All that is needed is a way for those guests to
synchronise their accesses and updates to shared records of the
available and in-use memory."

Ian then goes on to say:  "I don't have a detailed counter-proposal
design of course..."

This proposal is certainly possible, but I think most would agree that
it would require some fairly massive changes in OS memory management
design that would run contrary to many years of computing history.
It requires guest OS's to cooperate with each other about basic memory
management decisions.  And to work for tmem, it would require
communication from atomic code in the kernel to user-space, then communication 
from user-space in a guest to user-space-in-domain0
and then (presumably... I don't have a design either) back again.
One must also wonder what the performance impact would be.

CONCLUDING REMARKS

"Any functionality which can be reasonably provided outside the
  hypervisor should be excluded from it."

I think this document has described a real customer problem and
a good solution that could be implemented either in the toolstack
or in the hypervisor.  Memory allocation in existing Xen functionality
has been shown to interfere significantly with the toolstack-based
solution and suggested partial solutions to those issues either
require even more hypervisor work, or are completely undesigned and,
at least, call into question the definition of "reasonably".

The hypervisor-based solution has been shown to be extremely
simple, fits very logically with existing Xen memory management
mechanisms/code, and has been reviewed through several iterations
by Xen hypervisor experts.

While I understand completely the Xen maintainers' desire to
fend off unnecessary additions to the hypervisor, I believe
XENMEM_claim_pages is a reasonable and natural hypervisor feature
and I hope you will now Ack the patch.

Acknowledgements: Thanks very much to Konrad for his thorough
read-through and for suggestions on how to soften my combative
style which may have alienated the maintainers more than the
proposal itself.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.