Xen project Mailing List

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

To: Tim Deegan <tim@xxxxxxx>

From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>

Date: Tue, 22 Jan 2013 11:22:41 -0800 (PST)

Cc: "Keir \(Xen.org\)" <keir@xxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>

Delivery-date: Tue, 22 Jan 2013 19:23:39 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

> From: Tim Deegan [mailto:tim@xxxxxxx] > Subject: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of > problem and alternate > solutions > > Hi, Hi Tim -- It's probably worth correcting a few of your points below, even if only for the xen-devel archives and posterity... > If I were given a self-ballooning system and asked to support it, I'd be > looking at other things first, and probably solving the delayed failure > of VM creation as a side-effect. Agreed. These other things were looked at in 2009 when tmem was added into Xen and prototyped for Linux. And the delayed failure was solved poorly with a hack in 2010 and is being looked at again in 2012/2013 with the intent of solving it correctly. > For example: > - the lack of policy. If we assume all VMs have the same admin, > so we can ignore malicious attackers, a buggy guest or guests > can still starve out well-behaved ones. And because it implicitly > relies on all OSes having an equivalent measure of how much they > 'need' memory, on a host with a mix of guest OSes, the aggressive > ones will starve the others. With tmem, a malicious attacker can never get more memory than the original maxmem assigned by the host administrator when the guest is launched. This is also true of any non-tmem guests running (e.g. proprietary Windows). And the architecture of tmem takes into account the difference between memory a guest "needs" vs memory it "wants". Though this is a basic OS concept that exists in some form in all OS's, AFAIK it has never been exposed outside of the OS (e.g. to a hypervisor) because, in a physical system, RAM is RAM and the only limit is the total amount of physical RAM in the system. Tmem changes in the guest kernel expose the needs/wants information and tmem in the hypervisor defines very simple carrots and sticks to keep guests in line by offering, under well-defined constraints, to keep and manage certain pages of data for the guest. While it is true of any resource sharing mechanism (including CPU and I/O scheduling under Xen) that the "must" demand for the resource may exceed the total available resource, just as with CPU scheduling, resource demand can be controlled by a few simple policy variables that default to reasonable values and are enforced, as necessary, in the hypervisor. Just as with CPU schedulers and I/O schedulers, different workloads may over time expose weaknesses, but that doesn't mean we throw away our CPU and I/O schedulers and partition those resources instead. Nor should we do so with RAM. All this has been implemented in Xen for years and the Linux-side is now shipping. I would very much welcome input and improvements. But it is very frustrating when people say, on the one hand, that "it can't be done" or "it won't work" or "it's too hard", while on the other hand those same people are saying "I don't have time to understand tmem". > For example: > - the lack of fairness: when a storm of activity hits an idle system, > whichever VMs get busy first will get all the memory. True, but only up to the policy limits built into tmem (i.e not "all"). Also true of CPU scheduling up to the policy limits built into the CPU scheduler. (BTW, tmem optionally supports caps and weights too.) > For example: > - allocating _all_ memory with no slack makes the system more vulnerable > to any bugs in the rest of xen where allocation failure isn't handled > cleanly. There shouldn't be any, but I bet there are. Once tmem has been running for awhile, it works in an eternal state of "no slack". IIRC there was a bug or two worked through years ago. The real issue has always been fragmentation and non-resilience of failed allocation of higher-order pages, but Jan (as of 4.1?) has removed all of those issues from Xen. So tmem is using ALL the memory in the system. Keir (and Jan) wrote a very solid memory manager and it works very well even under stress. > For example: > - there's no way of forcing a new VM into a 'full' system; the admin must > wait and hope for the existing VMs to shrink. (If there were such > a system, it would solve the delayed-failure problem because you'd > just use it to enforce the Not true at all. With tmem, the "want" pages of all the guests (plus any "fallow" pages that might be truly free at the moment for various reasons) is the source of pages for adding a new VM. By definition, the hypervisor can "free" any or all of these pages when the toolstack tells the hypervisor to allocate memory for a new guest. No waiting necessary. That's how the claim_pages hypercall works so cleanly and quickly. (And, sorry to sound like a broken record, but I think it's worth emphasizing and re-emphasizing, this is not a blue sky proposal. All of this code is already working in the Xen hypervisor today.) > > > Or, how about actually moving towards a memory scheduler like you > > > suggested -- for example by integrating memory allocation more tightly > > > with tmem. There could be an xsm-style hook in the allocator for > > > tmem-enabled domains. That way tmem would have complete control over > > > all memory allocations for the guests under its control, and it could > > > implement a shared upper limit. Potentially in future the tmem > > > interface could be extended to allow it to force guests to give back > > > more kinds of memory, so that it could try to enforce fairness (e.g. if > > > two VMs are busy, why should the one that spiked first get to keep all > > > the RAM?) or other nice scheduler-like properties. > > > > Tmem (plus selfballooning), unchanged, already does some of this. > > While I would be interested in discussing better solutions, the > > now four-year odyssey of pushing what I thought were relatively > > simple changes upstream into Linux has left a rather sour taste > > in my mouth, so rather than consider any solution that requires > > more guest kernel changes [...] > > I don't mean that you'd have to do all of that now, but if you were > considering moving in that direction, an easy first step would be to add > a hook allowing tmem to veto allocations for VMs under its control. > That would let tmem have proper control over its client VMs (so it can > solve the delayed-failure race for you), while at the same time being a > constructive step towards a more complete memory scheduler. While you are using different words, you are describing what tmem does today. Tmem does have control and uses the existing hypervisor mechanisms and the existing hypervisor lock for memory allocation. That's why it's so clean to solve the "delayed-failure race" using the same lock. Dan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.