Xen project Mailing List

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

To: George Dunlap <george.dunlap@xxxxxxxxxxxxx>

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Wed, 2 Jan 2013 16:59:01 -0500

Cc: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, "Keir \(Xen.org\)" <keir@xxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>, "Tim \(Xen.org\)" <tim@xxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>

Delivery-date: Wed, 02 Jan 2013 22:00:49 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

. snip.. > >Heh. I hadn't realized that the emails need to conform to a > >the way legal briefs are written in US :-) Meaning that > >each topic must be addressed. > > Every time we try to suggest alternatives, Dan goes on some rant > about how we're on different planets, how we're all old-guard stuck > in static-land thinking, and how we're focused on single-server use .. snip.. > than anyone has the time to read and understand, much less respond > to. That's why I suggested to Dan that he ask someone else to take > over the conversation.) First of, lets leave the characterization of people out of this. I have great respect for Dan and I am hurt that you would so cavaliery treat him. But that is your choice and lets leave this thread to just a technical discussion. > > >Anyhow, the multi-host env or a single-host env has the same > >issue - you try to launch multiple guests and you some of > >them might not launch. > > > >The changes that Dan is proposing (the claim hypercall) > >would provide the functionality to fix this problem. > > > >>A fairly bizarre limitation of a balloon-based approach to memory > >>management. Why on earth should the guest be allowed to change the size of > >>its balloon, and therefore its footprint on the host. This may be justified > >>with arguments pertaining to the stability of the in-guest workload. What > >>they really reveal are limitations of ballooning. But the inadequacy of the > >>balloon in itself doesn't automatically translate into justifying the need > >>for a new hyper call. > >Why is this a limitation? Why shouldn't the guest the allowed to change > >its memory usage? It can go up and down as it sees fit. > >And if it goes down and it gets better performance - well, why shouldn't > >it do it? > > > >I concur it is odd - but it has been like that for decades. > > Well, it shouldn't be allowed to do it because it causes this > problem you're having with creating guests in parallel. Ultimately, > that is the core of your problem. So if you want us to solve the > problem by implementing something in the hypervisor, then you need > to justify why "Just don't have guests balloon down" is an > unacceptable option. Saying "why shouldn't it", and "it's been that > way for decades*" isn't a good enough reason. We find that the balloon usage very flexible and see no problems with it. .. snip.. > >>>What about the toolstack side? First, it's important to note that > >>>the toolstack changes are entirely optional. If any toolstack > >>>wishes either to not fix the original problem, or avoid toolstack- > >>>unaware allocation completely by ignoring the functionality provided > >>>by in-guest ballooning, page-sharing, and/or tmem, that toolstack need > >>>not use the new hyper call. > >>You are ruling out any other possibility here. In particular, but not > >>limited to, use of max_pages. > >The one max_page check that comes to my mind is the one that Xapi > >uses. That is it has a daemon that sets the max_pages of all the > >guests at some value so that it can squeeze in as many guests as > >possible. It also balloons pages out of a guest to make space if > >need to launch. The heurestic of how many pages or the ratio > >of max/min looks to be proportional (so to make space for 1GB > >for a guest, and say we have 10 guests, we will subtract > >101MB from each guest - the extra 1MB is for extra overhead). > >This depends on one hypercall that 'xl' or 'xm' toolstack do not > >use - which sets the max_pages. > > > >That code makes certain assumptions - that the guest will not go/up down > >in the ballooning once the toolstack has decreed how much > >memory the guest should use. It also assumes that the operations > >are semi-atomic - and to make it so as much as it can - it executes > >these operations in serial. > > No, the xapi code does no such assumptions. After it tells a guest > to balloon down, it watches to see what actually happens, and has > heuristics to deal with "non-cooperative guests". It does assume > that if it sets max_pages lower than or equal to the current amount > of used memory, that the hypervisor will not allow the guest to > balloon up -- but that's a pretty safe assumption. A guest can > balloon down if it wants to, but as xapi does not consider that > memory free, it will never use it. Thanks for the clarification. I am not that fluent in the OCaml code. > > BTW, I don't know if you realize this: Originally Xen would return > an error if you tried to set max_pages below tot_pages. But as a > result of the DMC work, it was seen as useful to allow the toolstack > to tell the hypervisor once, "Once the VM has ballooned down to X, > don't let it balloon up above X anymore." > > >This goes back to the problem statement - if we try to parallize > >this we run in the problem that the amount of memory we thought > >we free is not true anymore. The start of this email has a good > >description of some of the issues. > > > >In essence, the max_pages does work - _if_ one does these operations > >in serial. We are trying to make this work in parallel and without > >any failures - for that we - one way that is quite simplistic > >is the claim hypercall. It sets up a 'stake' of the amount of > >memory that the hypervisor should reserve. This way other > >guests creations/ ballonning do not infringe on the 'claimed' amount. > > I'm not sure what you mean by "do these operations in serial" in > this context. Each of your "reservation hypercalls" has to happen > in serial. If we had a user-space daemon that was in charge of > freeing up or reserving memory, each request to that daemon would > happen in serial as well. But once the allocation / reservation > happened, the domain builds could happen in parallel. > > >I believe with this hypercall the Xapi can be made to do its operations > >in parallel as well. > > xapi can already boot guests in parallel when there's enough memory > to do so -- what operations did you have in mind? That - the booting. My understading (wrongly) was that it did it in serial. > > I haven't followed all of the discussion (for reasons mentioned > above), but I think the alternative to Dan's solution is something > like below. Maybe you can tell me why it's not very suitable: > > Have one place in the user-space -- either in the toolstack, or a > separate daemon -- that is responsible for knowing all the places > where memory might be in use. Memory can be in use either by Xen, > or by one of several VMs, or in a tmem pool. > > In your case, when not creating VMs, it can remove all limitations > -- allow the guests or tmem to grow or shrink as much as they want. We don't have those limitations right now. > > When a request comes in for a certain amount of memory, it will go > and set each VM's max_pages, and the max tmem pool size. It can > then check whether there is enough free memory to complete the > allocation or not (since there's a race between checking how much > memory a guest is using and setting max_pages). If that succeeds, > it can return "success". If, while that VM is being built, another > request comes in, it can again go around and set the max sizes > lower. It has to know how much of the memory is "reserved" for the > first guest being built, but if there's enough left after that, it > can return "success" and allow the second VM to start being built. > > After the VMs are built, the toolstack can remove the limits again > if it wants, again allowing the free flow of memory. This sounds to me like what Xapi does? > > Do you see any problems with this scheme? All it requires is for > the toolstack to be able to temporarliy set limits on both guests > ballooning up and on tmem allocating more than a certain amount of > memory. We already have mechanisms for the first, so if we had a > "max_pages" for tmem, then you'd have all the tools you need to > implement it. Of the top of my hat the thing that come in my mind are: - The 'lock' over the memory usage (so the tmem freeze + maxpages set) looks to solve the launching in parallel of guests. It will allow us to launch multiple guests - but it will also suppressing the tmem asynchronous calls and having to balloon up/down the guests. The claim hypercall does not do any of those and gives a definite 'yes' or 'no'. - Complex code that has to keep track of this in the user-space. It also has to know of the extra 'reserved' space that is associated with a guest. I am not entirely sure how that would couple with PCI passthrough. The claim hypercall is fairly simple - albeit having it extended to do Super pages and 32-bit guests could make this longer. - I am not sure whether the toolstack can manage all the memory allocation. It sounds like it could but I am just wondering if there are some extra corners that we hadn't thought off. - Latency. With the locks being placed on the pools of memory the existing workload can be negatively affected. Say that this means we need to balloon down a couple hundred guests, then launch the new guest. This process of 'lower all of them by X', lets check the 'free amount'. Oh nope - not enougth - lets do this again. That would delay the creation process. The claim hypercall will avoid all of that by just declaring: "This is how much you will get." without having to balloon the rest of the guests. Here is how I see what your toolstack would do: [serial] 1). Figure out how much memory we need for X guests. 2). round-robin existing guests to decrease their memory consumption (if they can be ballooned down). Or this can be exectued in parallel for the guests. 3). check if the amount of free memory is at least X [this check has to be done in serial] [parallel] 4). launch multiple guests at the same time. The claim hypercall would avoid the '3' part b/c it is inherently part of the Xen's MM bureaucracy. It would allow: [parallel] 1). claim hypercall for X guest. 2). if any of the claim's return 0 (so success), then launch guest 3). if the errno was -ENOMEM then: [serial] 3a). round-robin existing guests to decrease their memory consumption if allowed. Goto 1). So the 'error-case' only has to run in the slow-serial case. - This still has the race issue - how much memory you see vs the moment you launch it. Granted you can avoid it by having a "fudge" factor (so when a guest says it wants 1G you know it actually needs an extra 100MB on top of the 1GB or so). The claim hypercall would count all of that for you so you don't have to race. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.