Xen project Mailing List

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

To: George Dunlap <george.dunlap@xxxxxxxxxxxxx>

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Tue, 22 Jan 2013 16:57:48 -0500

Cc: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, "Keir \(Xen.org\)" <keir@xxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>, "Tim \(Xen.org\)" <tim@xxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>

Delivery-date: Tue, 22 Jan 2013 21:58:32 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hey George, Sorry for taking so long to answer. On Mon, Jan 14, 2013 at 06:28:48PM +0000, George Dunlap wrote: > On 02/01/13 21:59, Konrad Rzeszutek Wilk wrote: > >Thanks for the clarification. I am not that fluent in the OCaml code. > > I'm not fluent in OCaml either, I'm mainly going from memory based > on the discussions I had with the author when it was being designed, > as well as discussions with the xapi team when dealing with bugs at > later points. I was looking at xen-api/ocaml/xenops/squeeze.ml and just reading the comments and feebly trying to understand how the OCaml code is. Best I could understand it does various measurements, makes the appropiate hypercalls and waits for everything to stabilize before allowing the guest to start. N.B: With tmem, the 'stabilization' might never happen. > > >>When a request comes in for a certain amount of memory, it will go > >>and set each VM's max_pages, and the max tmem pool size. It can > >>then check whether there is enough free memory to complete the > >>allocation or not (since there's a race between checking how much > >>memory a guest is using and setting max_pages). If that succeeds, > >>it can return "success". If, while that VM is being built, another > >>request comes in, it can again go around and set the max sizes > >>lower. It has to know how much of the memory is "reserved" for the > >>first guest being built, but if there's enough left after that, it > >>can return "success" and allow the second VM to start being built. > >> > >>After the VMs are built, the toolstack can remove the limits again > >>if it wants, again allowing the free flow of memory. > >This sounds to me like what Xapi does? > > No, AFAIK xapi always sets the max_pages to what it wants the guest > to be using at any given time. I talked about removing the limits > (and about operating without limits in the normal case) because it > seems like something that Oracle wants (having to do with tmem). We still (and we do want them as much as possible) have the limits in the hypervisor. The guest can't go above max_pages which is absolutly fine. We don't want guests going above max_pages. Conversly we also do no want to reduce max_pages. It is risky to do so. > >>Do you see any problems with this scheme? All it requires is for > >>the toolstack to be able to temporarliy set limits on both guests > >>ballooning up and on tmem allocating more than a certain amount of > >>memory. We already have mechanisms for the first, so if we had a > >>"max_pages" for tmem, then you'd have all the tools you need to > >>implement it. > >Of the top of my hat the thing that come in my mind are: > > - The 'lock' over the memory usage (so the tmem freeze + maxpages set) > > looks to solve the launching in parallel of guests. > > It will allow us to launch multiple guests - but it will also > > suppressing the tmem asynchronous calls and having to balloon up/down > > the guests. The claim hypercall does not do any of those and > > gives a definite 'yes' or 'no'. > > So when you say, "tmem freeze", are you specifically talking about > not allowing tmem to allocate more memory (what I called a > "max_pages" for tmem)? Or is there more to it? I think I am going to confuse you here a bit. > > Secondly, just to clarify: when a guest is using memory from the > tmem pool, is that added to tot_pages? Yes and no. It depends on what type of tmem page it is (there are only two). Pages that must persist (such as swap pages) are accounted in the d->tot_pages. Pages that are cache type not accounted in the tot_pages. These are called ephemeral or temporary pages. Note that they are utilizing the balloon system - so the content of them could be thrown out, but the pages themselves might need to be put back in the guest (and increase the d->tot_pages). The tmem_freeze is basically putting a plug on the current activity of a guest trying to put more pages in the ephemeral and in the pool of pages that is accounted for using d->tot_pages. It has the similar bad effect of setting d->max_pages == d->tot_pages. The hypercall would replace this bandaid. As said, there are two types. The temporary pages subtract d->tot_pages and end up in the heap memory (and if there is memory pressure in the hypervisor it can happily usurp). In essence the "pages" move from a domain accounting to this "pool". If the guest needs it back, the pool size decreases and d->tot_pages increases. N.B: The pool can be usurped by the Xen hypervisor - so the pages are not locked in and can be re-used for launch of a new guest. The persistent ones do not end up in that pool. Rather they are accounted for in the d->tot_pages. The amount of memory that is "flowing" for a guest remains constant - it just that it can be in a pool or in the d->tot_pages. (I am ignoring the de-duplication or compression that tmem can do) The idea behind the claim call is that we do not want to put pressure on this "flow" as the guest might suddently need that memory back - as much as it can. Putting pressure is by altering the d->max_pages. > > I'm not sure what "gives a definite yes or no" is supposed to mean > -- the scheme I described also gives a definite yes or no. > > In any case, your point about ballooning is taken: if we set > max_pages for a VM and just leave it there while VMs are being > built, then VMs cannot balloon up, even if there is "free" memory > (i.e., memory that will not be used for the currently-building VM), > and cannot be moved *bewteen* VMs either (i.e., by ballooning down > one and ballooning the other up). Both of these be done by > extending the toolstack with a memory model (see below), but that > adds an extra level of complication. > > > - Complex code that has to keep track of this in the user-space. > > It also has to know of the extra 'reserved' space that is associated > > with a guest. I am not entirely sure how that would couple with > > PCI passthrough. The claim hypercall is fairly simple - albeit > > having it extended to do Super pages and 32-bit guests could make this > > longer. > > What do you mean by the extra 'reserved' space? And what potential > issues are there with PCI passthrough? I was thinking about space for VIRQ, VCPUs, IOMMU entries to cover a PCI device permissions, and grant-tables. I think the IOMMU entries consume the most bulk - but maybe all of this is under 1MB. > > To be accepted, the reservation hypercall will certainly have to be > extended to do superpages and 32-bit guests, so that's the case we > should be considering. OK. That sounds to me like you are OK with the idea - you would like to make the claim hypercall taking in-to account the lesser used cases. The reason Dan stopped looking at expanding is b/c it seemed that folks would like to undertand the usage scenarios in depth - and that has taken a bit of time to explain. I believe the corner cases in the claim hypercall are mostly tied in with PV (specifically the super-pages and 32-bit guests with more than a certain amount of memory). > > > - I am not sure whether the toolstack can manage all the memory > > allocation. It sounds like it could but I am just wondering if there > > are some extra corners that we hadn't thought off. > > Wouldn't the same argument apply to the reservation hypercall? > Suppose that there was enough domain memory but not enough Xen heap > memory, or enough of some other resource -- the hypercall might > succeed, but then the domain build still fail at some later point > when the other resource allocation failed. This is refering to the 1MB that I mentioned above. Anyhow, if the hypercall fails and the domain build fails then we are back at the toolstack making an choice whether it wants to allocate the guest on a different node. Or for that matter balloon the existing guests. > > > - Latency. With the locks being placed on the pools of memory the > > existing workload can be negatively affected. Say that this means we > > need to balloon down a couple hundred guests, then launch the new > > guest. This process of 'lower all of them by X', lets check the > > 'free amount'. Oh nope - not enougth - lets do this again. That would > > delay the creation process. > > > > The claim hypercall will avoid all of that by just declaring: > > "This is how much you will get." without having to balloon the rest > > of the guests. > > > > Here is how I see what your toolstack would do: > > > > [serial] > > 1). Figure out how much memory we need for X guests. > > 2). round-robin existing guests to decrease their memory > > consumption (if they can be ballooned down). Or this > > can be exectued in parallel for the guests. > > 3). check if the amount of free memory is at least X > > [this check has to be done in serial] > > [parallel] > > 4). launch multiple guests at the same time. > > > > The claim hypercall would avoid the '3' part b/c it is inherently > > part of the Xen's MM bureaucracy. It would allow: > > > > [parallel] > > 1). claim hypercall for X guest. > > 2). if any of the claim's return 0 (so success), then launch guest > > 3). if the errno was -ENOMEM then: > > [serial] > > 3a). round-robin existing guests to decrease their memory > > consumption if allowed. Goto 1). and here I forgot about the other way of fixing this - that is launch the guest on another node altogether as at least in our product - we don't want to change the initial d->max_pages. This is due in part to the issues that were pointed out - it might suddenly need that memory or otherwise it will OOM. > > > > So the 'error-case' only has to run in the slow-serial case. > Hmm, I don't think what you wrote about mine is quite right. Here's > what I had in mind for mine (let me call it "limit-and-check"): > > [serial] > 1). Set limits on all guests, and tmem, and see how much memory is left. > 2) Read free memory > [parallel] > 2a) Claim memory for each guest from freshly-calculated pool of free memory. > 3) For each claim that can be satisfied, launch a guest > 4) If there are guests that can't be satisfied with the current free > memory, then: > [serial] > 4a) round-robin existing guests to decrease their memory consumption > if allowed. Goto 2. > 5) Remove limits on guests. > > Note that 1 would only be done for the first such "request", and 5 > would only be done after all such requests have succeeded or failed. > Also note that steps 1 and 5 are only necessary if you want to go > without such limits -- xapi doesn't do them, because it always keeps > max_pages set to what it wants the guest to be using. > > Also, note that the "claiming" (2a for mine above and 1 for yours) > has to be serialized with other "claims" in both cases (in the > reservation hypercall case, with a lock inside the hypervisor), but > that the building can begin in parallel with the "claiming" in both > cases. Sure. The claim call has a very short duration as it has to take a lock in the hypervisor. It would a bunch of super-fast calls. Heck, you could even use the multicall for this to batch it up. The problem we are trying to fix is that launching a guest can take minutes. During that time other guests are artificially blocked from growing and might OOM. > > But I think I do see what you're getting at. The "free memory" > measurement has to be taken when the system is in a "quiescent" > state -- or at least a "grow only" state -- otherwise it's > meaningless. So #4a should really be: Exactly! With tmem running the quiescent state might never happen. > > 4a) Round-robin existing guests to decrease their memory consumption > if allowed. I believe this is what Xapi does. The question comes how does the toolstack decide that properly and on the spot 100% of the time? I believe that the source of that knowledge lays with the guest kernel - and it can determine when it needs more or less. We have set the boundaries (d->max_pages) which haven't changed since the bootup and we let the guest decide where it wants to be within that spectrum. > 4b) Wait for currently-building guests to finish building (if any), > then go to #2. > > So suppose the following cases, in which several requests for guest > creation come in over a short period of time (not necessarily all at > once): > A. There is enough memory for all requested VMs to be built without > ballooning / something else > B. There is enough for some, but not all of the VMs to be built > without ballooning / something else > > In case A, then I think "limit-and-check" and "reservation > hypercall" should perform the same. For each new request that comes > in, the toolstack can say, "Well, when I checked I had 64GiB free; > then I started to build a 16GiB VM. So I should have 48GiB left, > enough to build this 32GiB VM." "Well, when I checked I had 64GiB > free; then I started to build a 16GiB VM and a 32GiB VM, so I should > have 16GiB left, enough to be able to build this 16GiB VM." For case A, I assume all the guests are launched with mem=maxmem and there is no PoD, no PCI passthrough and no tmem. Then yes. For case B, "Limit-and-check" requires "limiting" one of the guests (or more). This means we limit one (or more) of the guests. Which one is choosen and what criteria are done means more heuristics (or just take the shotgun approach and limit all of the guest by some number). In other words: d->max_pages -= some X value. The other way is limiting the total growth of all guests (so d->tot_pages can't reach d->max_pages). We don't set the d->max_pages and let the guests balloon up. Note that with tmem in here you can "move" the temporary pages back in the guest so that the d->tot_pages can increase by some Y, and the total free amount of heap space increases by Y as well - b/c the Y value has moved). Now back to your question: Accounting for this in user-space is possible, but there are latency issues and catching breath for the toolstack as there might be millions of these updates on heavily used machine. There might not be any "quiescent" state ever. > > The main difference comes in case B. The "reservation hypercall" > method will not have to wait until all existing guests have finished > building to be able to start subsequent guests; but > "limit-and-check" would have to wait until the currently-building > guests are finished before doing another check. Correct. And the check is imprecise b/c the moment it gets the value the system might have changed dramatically. The right time to get the value is when the host is in "quiescent" state, but who knows when that is going to happen. Perhaps never, at which point you might be spinning for a long time trying to get that value. > > This limitation doesn't apply to xapi, because it doesn't use the > hypervisor's free memory as a measure of the memory it has available > to it. Instead, it keeps an internal model of the free memory the > hypervisor has available. This is based on MAX(current_target, > tot_pages) of each guest (where "current_target" for a domain in the > process of being built is the amount of memory it will have > eventually). We might call this the "model" approach. > OK. I think it actually checks how much memory the guest is consumed. This is what one of the comments says: (* Some VMs are considered by us (but not by xen) to have an "initial-reservation". For VMs which have never run (eg which are still being built or restored) we take the difference between memory_actual_kib and the reservation and subtract this manually from the host's free memory. Note that we don't get an atomic snapshot of system state so there is a natural race between the hypercalls. Hopefully the memory is being consumed fairly slowly and so the error is small. *) So that would imply that a check against "current" memory consumption is done. But you know comments - sometimes they do not match what the code is doing. But if they do match then it looks like this system would hit issues with self-ballooning and tmem. I believe that the claim hypercall would fix that easily. It probably would also make the OCaml code much much simpler. > We could extend "limit-and-check" to "limit-check-and-model" (i.e., > estimate how much memory is really free after ballooning based on > how much the guests' tot_pages), or "limit-model" (basically, fully > switch to a xapi-style "model" approach while you're doing domain > creation). That would be significantly more complicated. On the > other hand, a lot of the work has already been done by the XenServer > team, and (I believe) the code in question is all GPL'ed, so Oracle > could just take the algorithms and adapt them with just a bit if > tweaking (and a bit of code translation). It seems to me that he > "model" approach brings a lot of other benefits as well. It is hard for me to be convienced by that since the code is in OCaml and I am having a hard time understanding it. If it was in C, it would have been much easier to get it and make that evaluation. The other part of this that I am not sure if I am explaining well is that the kernel with self-balloon and tmem is very self-adaptive. It seems to me that having the toolstack be minutely aware of the guests memory changes so that it can know exactly how much free memory there is - is duplicating efforts. > > But at any rate -- without debating the value or cost of the "model" > approach, would you agree with my analysis and conclusions? Namely: > > a. "limit-and-check" and "reservation hypercall" are similar wrt > guest creation when there is enough memory currently free to build > all requested guests Not 100%. When there is enough memory free "for the entire period of time that it takes to build all the requested guests", then yes. > b. "limit-and-check" may be slower if some guests can succeed in > being built but others must wait for memory to be freed up, since > the "check" has to wait for current guests to finish building No. The check also races with the amount of memory that the hypervisor reports as free - and that might be altered by the existing guests (so not the guests that are being built). > c. (From further back) One downside of a pure "limit-and-check" > approach is that while VMs are being built, VMs cannot increase in > size, even if there is "free" memory (not being used to build the > currently-building domain(s)) or if another VM can be ballooned > down. Ah, yes. We really want to avoid that. > d. "model"-based approaches can mitigate b and c, at the cost of a > more complicated algorithm Correct. And also more work done in the userspace to track this. > > > - This still has the race issue - how much memory you see vs the > > moment you launch it. Granted you can avoid it by having a "fudge" > > factor (so when a guest says it wants 1G you know it actually > > needs an extra 100MB on top of the 1GB or so). The claim hypercall > > would count all of that for you so you don't have to race. > I'm sorry, what race / fudge factor are you talking about? The scenario when the host is not in "quiescent" state. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.