Xen project Mailing List

[Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

To: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, "Tim (Xen.org)" <tim@xxxxxxx>

From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>

Date: Mon, 3 Dec 2012 12:54:16 -0800 (PST)

Cc: "Keir \(Xen.org\)" <keir@xxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Mon, 03 Dec 2012 20:54:57 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

I earlier promised a complete analysis of the problem addressed by the proposed claim hypercall as well as an analysis of the alternate solutions. I had not yet provided these analyses when I asked for approval to commit the hypervisor patch, so there was still a good amount of misunderstanding, and I am trying to fix that here. I had hoped this essay could be both concise and complete but quickly found it to be impossible to be both at the same time. So I have erred on the side of verbosity, but also have attempted to ensure that the analysis flows smoothly and is understandable to anyone interested in learning more about memory allocation in Xen. I'd appreciate feedback from other developers to understand if I've also achieved that goal. Ian, Ian, George, and Tim -- I have tagged a few out-of-flow questions to you with [IIGF]. If I lose you at any point, I'd especially appreciate your feedback at those points. I trust that, first, you will read this completely. As I've said, I understand that Oracle's paradigm may differ in many ways from your own, so I also trust that you will read it completely with an open mind. Thanks, Dan PROBLEM STATEMENT OVERVIEW The fundamental problem is a race; two entities are competing for part or all of a shared resource: in this case, physical system RAM. Normally, a lock is used to mediate a race. For memory allocation in Xen, there are two significant entities, the toolstack and the hypervisor. And, in general terms, there are currently two important locks: one used in the toolstack for domain creation; and one in the hypervisor used for the buddy allocator. Considering first only domain creation, the toolstack lock is taken to ensure that domain creation is serialized. The lock is taken when domain creation starts, and released when domain creation is complete. As system and domain memory requirements grow, the amount of time to allocate all necessary memory to launch a large domain is growing and may now exceed several minutes, so this serialization is increasingly problematic. The result is a customer reported problem: If a customer wants to launch two or more very large domains, the "wait time" required by the serialization is unacceptable. Oracle would like to solve this problem. And Oracle would like to solve this problem not just for a single customer sitting in front of a single machine console, but for the very complex case of a large number of machines, with the "agent" on each machine taking independent actions including automatic load balancing and power management via migration. (This complex environment is sold by Oracle today; it is not a "future vision".) [IIGT] Completely ignoring any possible solutions to this problem, is everyone in agreement that this _is_ a problem that _needs_ to be solved with _some_ change in the Xen ecosystem? SOME IMPORTANT BACKGROUND INFORMATION In the subsequent discussion, it is important to understand a few things: While the toolstack lock is held, allocating memory for the domain creation process is done as a sequence of one or more hypercalls, each asking the hypervisor to allocate one or more -- "X" -- slabs of physical RAM, where a slab is 2**N contiguous aligned pages, also known as an "order N" allocation. While the hypercall is defined to work with any value of N, common values are N=0 (individual pages), N=9 ("hugepages" or "superpages"), and N=18 ("1GiB pages"). So, for example, if the toolstack requires 201MiB of memory, it will make two hypercalls: One with X=100 and N=9, and one with X=1 and N=0. While the toolstack may ask for a smaller number X of order==9 slabs, system fragmentation may unpredictably cause the hypervisor to fail the request, in which case the toolstack will fall back to a request for 512*X individual pages. If there is sufficient RAM in the system, this request for order==0 pages is guaranteed to succeed. Thus for a 1TiB domain, the hypervisor must be prepared to allocate up to 256Mi individual pages. Note carefully that when the toolstack hypercall asks for 100 slabs, the hypervisor "heaplock" is currently taken and released 100 times. Similarly, for 256M individual pages... 256 million spin_lock-alloc_page-spin_unlocks. This means that domain creation is not "atomic" inside the hypervisor, which means that races can and will still occur. RULING OUT SOME SIMPLE SOLUTIONS Is there an elegant simple solution here? Let's first consider the possibility of removing the toolstack serialization entirely and/or the possibility that two independent toolstack threads (or "agents") can simultaneously request a very large domain creation in parallel. As described above, the hypervisor's heaplock is insufficient to serialize RAM allocation, so the two domain creation processes race. If there is sufficient resource for either one to launch, but insufficient resource for both to launch, the winner of the race is indeterminate, and one or both launches will fail, possibly after one or both domain creation threads have been working for several minutes. This is a classic "TOCTOU" (time-of-check-time-of-use) race. If a customer is unhappy waiting several minutes to launch a domain, they will be even more unhappy waiting for several minutes to be told that one or both of the launches has failed. Multi-minute failure is even more unacceptable for an automated agent trying to, for example, evacuate a machine that the data center administrator needs to powercycle. [IIGT: Please hold your objections for a moment... the paragraph above is discussing the simple solution of removing the serialization; your suggested solution will be discussed soon.] Next, let's consider the possibility of changing the heaplock strategy in the hypervisor so that the lock is held not for one slab but for the entire request of N slabs. As with any core hypervisor lock, holding the heaplock for a "long time" is unacceptable. To a hypervisor, several minutes is an eternity. And, in any case, by serializing domain creation in the hypervisor, we have really only moved the problem from the toolstack into the hypervisor, not solved the problem. [IIGT] Are we in agreement that these simple solutions can be safely ruled out? CAPACITY ALLOCATION VS RAM ALLOCATION Looking for a creative solution, one may realize that it is the page allocation -- especially in large quantities -- that is very time-consuming. But, thinking outside of the box, it is not the actual pages of RAM that we are racing on, but the quantity of pages required to launch a domain! If we instead have a way to "claim" a quantity of pages cheaply now and then allocate the actual physical RAM pages later, we have changed the race to require only serialization of the claiming process! In other words, if some entity knows the number of pages available in the system, and can "claim" N pages for the benefit of a domain being launched, the successful launch of the domain can be ensured. Well... the domain launch may still fail for an unrelated reason, but not due to a memory TOCTOU race. But, in this case, if the cost (in time) of the claiming process is very small compared to the cost of the domain launch, we have solved the memory TOCTOU race with hardly any delay added to a non-memory-related failure that would have occurred anyway. This "claim" sounds promising. But we have made an assumption that an "entity" has certain knowledge. In the Xen system, that entity must be either the toolstack or the hypervisor. Or, in the Oracle environment, an "agent"... but an agent and a toolstack are similar enough for our purposes that we will just use the more broadly-used term "toolstack". In using this term, however, it's important to remember it is necessary to consider the existence of multiple threads within this toolstack. Now I quote Ian Jackson: "It is a key design principle of a system like Xen that the hypervisor should provide only those facilities which are strictly necessary. Any functionality which can be reasonably provided outside the hypervisor should be excluded from it." So let's examine the toolstack first. [IIGT] Still all on the same page (pun intended)? TOOLSTACK-BASED CAPACITY ALLOCATION Does the toolstack know how many physical pages of RAM are available? Yes, it can use a hypercall to find out this information after Xen and dom0 launch, but before it launches any domain. Then if it subtracts the number of pages used when it launches a domain and is aware of when any domain dies, and adds them back, the toolstack has a pretty good estimate. In actuality, the toolstack doesn't _really_ know the exact number of pages used when a domain is launched, but there is a poorly-documented "fuzz factor"... the toolstack knows the number of pages within a few megabytes, which is probably close enough. This is a fairly good description of how the toolstack works today and the accounting seems simple enough, so does toolstack-based capacity allocation solve our original problem? It would seem so. Even if there are multiple threads, the accounting -- not the extended sequence of page allocation for the domain creation -- can be serialized by a lock in the toolstack. But note carefully, either the toolstack and the hypervisor must always be in sync on the number of available pages (within an acceptable margin of error); or any query to the hypervisor _and_ the toolstack-based claim must be paired atomically, i.e. the toolstack lock must be held across both. Otherwise we again have another TOCTOU race. Interesting, but probably not really a problem. Wait, isn't it possible for the toolstack to dynamically change the number of pages assigned to a domain? Yes, this is often called ballooning and the toolstack can do this via a hypercall. But that's still OK because each call goes through the toolstack and it simply needs to add more accounting for when it uses ballooning to adjust the domain's memory footprint. So we are still OK. But wait again... that brings up an interesting point. Are there any significant allocations that are done in the hypervisor without the knowledge and/or permission of the toolstack? If so, the toolstack may be missing important information. So are there any such allocations? Well... yes. There are a few. Let's take a moment to enumerate them: A) In Linux, a privileged user can write to a sysfs file which writes to the balloon driver which makes hypercalls from the guest kernel to the hypervisor, which adjusts the domain memory footprint, which changes the number of free pages _without_ the toolstack knowledge. The toolstack controls constraints (essentially a minimum and maximum) which the hypervisor enforces. The toolstack can ensure that the minimum and maximum are identical to essentially disallow Linux from using this functionality. Indeed, this is precisely what Citrix's Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always has complete control and, so, knowledge of any domain memory footprint changes. But DMC is not prescribed by the toolstack, and some real Oracle Linux customers use and depend on the flexibility provided by in-guest ballooning. So guest-privileged-user-driven- ballooning is a potential issue for toolstack-based capacity allocation. [IIGT: This is why I have brought up DMC several times and have called this the "Citrix model,".. I'm not trying to be snippy or impugn your morals as maintainers.] B) Xen's page sharing feature has slowly been completed over a number of recent Xen releases. It takes advantage of the fact that many pages often contain identical data; the hypervisor merges them to save physical RAM. When any "shared" page is written, the hypervisor "splits" the page (aka, copy-on-write) by allocating a new physical page. There is a long history of this feature in other virtualization products and it is known to be possible that, under many circumstances, thousands of splits may occur in any fraction of a second. The hypervisor does not notify or ask permission of the toolstack. So, page-splitting is an issue for toolstack-based capacity allocation, at least as currently coded in Xen. [Andre: Please hold your objection here until you read further.] C) Transcendent Memory ("tmem") has existed in the Xen hypervisor and toolstack for over three years. It depends on an in-guest-kernel adaptive technique to constantly adjust the domain memory footprint as well as hooks in the in-guest-kernel to move data to and from the hypervisor. While the data is in the hypervisor's care, interesting memory-load balancing between guests is done, including optional compression and deduplication. All of this has been in Xen since 2009 and has been awaiting changes in the (guest-side) Linux kernel. Those changes are now merged into the mainstream kernel and are fully functional in shipping distros. While a complete description of tmem's guest<->hypervisor interaction is beyond the scope of this document, it is important to understand that any tmem-enabled guest kernel may unpredictably request thousands or even millions of pages directly via hypercalls from the hypervisor in a fraction of a second with absolutely no interaction with the toolstack. Further, the guest-side hypercalls that allocate pages via the hypervisor are done in "atomic" code deep in the Linux mm subsystem. Indeed, if one truly understands tmem, it should become clear that tmem is fundamentally incompatible with toolstack-based capacity allocation. But let's stop discussing tmem for now and move on. OK. So with existing code both in Xen and Linux guests, there are three challenges to toolstack-based capacity allocation. We'd really still like to do capacity allocation in the toolstack. Can something be done in the toolstack to "fix" these three cases? Possibly. But let's first look at hypervisor-based capacity allocation: the proposed "XENMEM_claim_pages" hypercall. HYPERVISOR-BASED CAPACITY ALLOCATION The posted patch for the claim hypercall is quite simple, but let's look at it in detail. The claim hypercall is actually a subop of an existing hypercall. After checking parameters for validity, a new function is called in the core Xen memory management code. This function takes the hypervisor heaplock, checks for a few special cases, does some arithmetic to ensure a valid claim, stakes the claim, releases the hypervisor heaplock, and then returns. To review from earlier, the hypervisor heaplock protects _all_ page/slab allocations, so we can be absolutely certain that there are no other page allocation races. This new function is about 35 lines of code, not counting comments. The patch includes two other significant changes to the hypervisor: First, when any adjustment to a domain's memory footprint is made (either through a toolstack-aware hypercall or one of the three toolstack-unaware methods described above), the heaplock is taken, arithmetic is done, and the heaplock is released. This is 12 lines of code. Second, when any memory is allocated within Xen, a check must be made (with the heaplock already held) to determine if, given a previous claim, the domain has exceeded its upper bound, maxmem. This code is a single conditional test. With some declarations, but not counting the copious comments, all told, the new code provided by the patch is well under 100 lines. What about the toolstack side? First, it's important to note that the toolstack changes are entirely optional. If any toolstack wishes either to not fix the original problem, or avoid toolstack- unaware allocation completely by ignoring the functionality provided by in-guest ballooning, page-sharing, and/or tmem, that toolstack need not use the new hypercall. Second, it's very relevant to note that the Oracle product uses a combination of a proprietary "manager" which oversees many machines, and the older open-source xm/xend toolstack, for which the current Xen toolstack maintainers are no longer accepting patches. The preface of the published patch does suggest, however, some straightforward pseudo-code, as follows: Current toolstack domain creation memory allocation code fragment: 1. call populate_physmap repeatedly to achieve mem=N memory 2. if any populate_physmap call fails, report -ENOMEM up the stack 3. memory is held until domain dies or the toolstack decreases it Proposed toolstack domain creation memory allocation code fragment (new code marked with "+"): + call claim for mem=N amount of memory +. if claim succeeds: 1. call populate_physmap repeatedly to achieve mem=N memory (failsafe) + else 2. report -ENOMEM up the stack + claim is held until mem=N is achieved or the domain dies or forced to 0 by a second hypercall 3. memory is held until domain dies or the toolstack decreases it Reviewing the pseudo-code, one can readily see that the toolstack changes required to implement the hypercall are quite small. To complete this discussion, it has been pointed out that the proposed hypercall doesn't solve the original problem for certain classes of legacy domains... but also neither does it make the problem worse. It has also been pointed out that the proposed patch is not (yet) NUMA-aware. Now let's return to the earlier question: There are three challenges to toolstack-based capacity allocation, which are all handled easily by in-hypervisor capacity allocation. But we'd really still like to do capacity allocation in the toolstack. Can something be done in the toolstack to "fix" these three cases? The answer is, of course, certainly... anything can be done in software. So, recalling Ian Jackson's stated requirement: "Any functionality which can be reasonably provided outside the hypervisor should be excluded from it." we are now left to evaluate the subjective term "reasonably". CAN TOOLSTACK-BASED CAPACITY ALLOCATION OVERCOME THE ISSUES? In earlier discussion on this topic, when page-splitting was raised as a concern, some of the authors of Xen's page-sharing feature pointed out that a mechanism could be designed such that "batches" of pages were pre-allocated by the toolstack and provided to the hypervisor to be utilized as needed for page-splitting. Should the batch run dry, the hypervisor could stop the domain that was provoking the page-split until the toolstack could be consulted and the toolstack, at its leisure, could request the hypervisor to refill the batch, which then allows the page-split-causing domain to proceed. But this batch page-allocation isn't implemented in Xen today. Andres Lagar-Cavilla says "... this is because of shortcomings in the [Xen] mm layer and its interaction with wait queues, documented elsewhere." In other words, this batching proposal requires significant changes to the hypervisor, which I think we all agreed we were trying to avoid. [Note to Andre: I'm not objecting to the need for this functionality for page-sharing to work with proprietary kernels and DMC; just pointing out that it, too, is dependent on further hypervisor changes.] Such an approach makes sense in the min==max model enforced by DMC but, again, DMC is not prescribed by the toolstack. Further, this waitqueue solution for page-splitting only awkwardly works around in-guest ballooning (probably only with more hypervisor changes, TBD) and would be useless for tmem. [IIGT: Please argue this last point only if you feel confident you truly understand how tmem works.] So this as-yet-unimplemented solution only really solves a part of the problem. Are there any other possibilities proposed? Ian Jackson has suggested a somewhat different approach: Let me quote Ian Jackson again: "Of course if it is really desired to have each guest make its own decisions and simply for them to somehow agree to divvy up the available resources, then even so a new hypervisor mechanism is not needed. All that is needed is a way for those guests to synchronise their accesses and updates to shared records of the available and in-use memory." Ian then goes on to say: "I don't have a detailed counter-proposal design of course..." This proposal is certainly possible, but I think most would agree that it would require some fairly massive changes in OS memory management design that would run contrary to many years of computing history. It requires guest OS's to cooperate with each other about basic memory management decisions. And to work for tmem, it would require communication from atomic code in the kernel to user-space, then communication from user-space in a guest to user-space-in-domain0 and then (presumably... I don't have a design either) back again. One must also wonder what the performance impact would be. CONCLUDING REMARKS "Any functionality which can be reasonably provided outside the hypervisor should be excluded from it." I think this document has described a real customer problem and a good solution that could be implemented either in the toolstack or in the hypervisor. Memory allocation in existing Xen functionality has been shown to interfere significantly with the toolstack-based solution and suggested partial solutions to those issues either require even more hypervisor work, or are completely undesigned and, at least, call into question the definition of "reasonably". The hypervisor-based solution has been shown to be extremely simple, fits very logically with existing Xen memory management mechanisms/code, and has been reviewed through several iterations by Xen hypervisor experts. While I understand completely the Xen maintainers' desire to fend off unnecessary additions to the hypervisor, I believe XENMEM_claim_pages is a reasonable and natural hypervisor feature and I hope you will now Ack the patch. Acknowledgements: Thanks very much to Konrad for his thorough read-through and for suggestions on how to soften my combative style which may have alienated the maintainers more than the proposal itself. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.