Xen project Mailing List

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

From: Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>

Date: Fri, 11 Jan 2013 11:13:07 -0500

Cc: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, "Keir \(Xen.org\)" <keir@xxxxxxx>, Ian Campbell <ian.campbell@xxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>

Delivery-date: Fri, 11 Jan 2013 16:14:33 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Jan 11, 2013, at 11:03 AM, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote: > Heya, > > Much appreciate your input, and below are my responses. >>>>> A) In Linux, a privileged user can write to a sysfs file which writes >>>>> to the balloon driver which makes hypercalls from the guest kernel to >>>> >>>> A fairly bizarre limitation of a balloon-based approach to memory >>>> management. Why on earth should the guest be allowed to change the size of >>>> its balloon, and therefore its footprint on the host. This may be >>>> justified with arguments pertaining to the stability of the in-guest >>>> workload. What they really reveal are limitations of ballooning. But the >>>> inadequacy of the balloon in itself doesn't automatically translate into >>>> justifying the need for a new hyper call. >>> >>> Why is this a limitation? Why shouldn't the guest the allowed to change >>> its memory usage? It can go up and down as it sees fit. >> >> No no. Can the guest change its cpu utilization outside scheduler >> constraints? NIC/block dev quotas? Why should an unprivileged guest be able >> to take a massive s*it over the host controller's memory allocation, at the >> guest's whim? > > There is a limit to what it can do. It is not an uncontrolled guest > going mayhem - it does it stuff within the parameters of the guest config. > Within in my mind also implies the 'tmem' doing extra things in the > hypervisor. > >> >> I'll be happy with a balloon the day I see an OS that can't be rooted :) >> >> Obviously this points to a problem with sharing & paging. And this is why I >> still spam this thread. More below. >> >>> And if it goes down and it gets better performance - well, why shouldn't >>> it do it? >>> >>> I concur it is odd - but it has been like that for decades. >> >> Heh. Decades … one? > > Still - a decade. >>> >>> >>>> >>>>> the hypervisor, which adjusts the domain memory footprint, which changes >>>>> the number of free pages _without_ the toolstack knowledge. >>>>> The toolstack controls constraints (essentially a minimum and maximum) >>>>> which the hypervisor enforces. The toolstack can ensure that the >>>>> minimum and maximum are identical to essentially disallow Linux from >>>>> using this functionality. Indeed, this is precisely what Citrix's >>>>> Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always >>>>> has complete control and, so, knowledge of any domain memory >>>>> footprint changes. But DMC is not prescribed by the toolstack, >>>> >>>> Neither is enforcing min==max. This was my argument when previously >>>> commenting on this thread. The fact that you have enforcement of a maximum >>>> domain allocation gives you an excellent tool to keep a domain's >>>> unsupervised growth at bay. The toolstack can choose how fine-grained, how >>>> often to be alerted and stall the domain. > > That would also do the trick - but there are penalties to it. > > If one just wants to launch multiple guests and "freeze" all the other guests > from using the balloon driver - that can certainly be done. > > But that is a half-way solution (in my mind). Dan's idea is that you wouldn't > even need that and can just allocate without having to worry about the other > guests at all - b/c you have reserved enough memory in the hypervisor (host) > to > launch the guest. Konrad: Ok, what happens when a guest is stalled because it cannot allocate more pages due to existing claims? Exactly the same that happens when it can't grow because it has hit d->max_pages. > >>> >>> There is a down-call (so events) to the tool-stack from the hypervisor when >>> the guest tries to balloon in/out? So the need for this problem arose >>> but the mechanism to deal with it has been shifted to the user-space >>> then? What to do when the guest does this in/out balloon at freq >>> intervals? >>> >>> I am missing actually the reasoning behind wanting to stall the domain? >>> Is that to compress/swap the pages that the guest requests? Meaning >>> an user-space daemon that does "things" and has ownership >>> of the pages? >> >> The (my) reasoning is that this enables control over unsupervised growth. I >> was being facetious a couple lines above. Paging and sharing also have the >> same problem with badly behaved guests. So this is where you stop these >> guys, allow the toolstack to catch a breath, and figure out what to do with >> this domain (more RAM? page out? foo?). > > But what if we do not even have to have the toolstack to catch a breath. The > goal > here is for it not to be involved in this and let the hypervisor deal with > unsupervised growth as it is better equiped to do so - and it is the ultimate > judge whether the guest can grow wildly or not. > > I mean why make the toolstack become CPU bound when you can just set > the hypervisor to take this extra information in an account and you avoid > the CPU-bound problem altogether. > >> >> All your questions are very valid, but they are policy in toolstack-land. >> Luckily the hypervisor needs no knowledge of that. > > My thinking is that some policy (say how much the guests can grow) is > something > that the host sets. And the hypervisor is the engine that takes these values > in account and runs with it. > > I think you are advocating that the "engine" and policy should be both > in the user-land. > > .. snip.. >>>> Great care has been taken for this statement to not be exactly true. The >>>> hypervisor discards one of two pages that the toolstack tells it to (and >>>> patches the physmap of the VM previously pointing to the discard page). It >>>> doesn't merge, nor does it look into contents. The hypervisor doesn't care >>>> about the page contents. This is deliberate, so as to avoid spurious >>>> claims of "you are using technique X!" >>>> >>> >>> Is the toolstack (or a daemon in userspace) doing this? I would >>> have thought that there would be some optimization to do this >>> somewhere? >> >> You could optimize but then you are baking policy where it does not belong. >> This is what KSM did, which I dislike. Seriously, does the kernel need to >> scan memory to find duplicates? Can't something else do it given suitable >> interfaces? Now any other form of sharing policy that tries to use >> VMA_MERGEABLE is SOL. Tim, Gregor and I, at different points in time, tried >> to avoid this. I don't know that it was a conscious or deliberate effort, >> but it worked out that way. > > OK, I think I understand you - you are advocating for user-space > because the combination of policy/engine can be done there. > > Dan's and mine thinking is to piggyback on the hypervisors' MM engine > and just provide a means of tweaking one value. In some ways that > is simialar to making sysctls in the kernel to tell the MM how to > behave. > > .. snip.. >>> That code makes certain assumptions - that the guest will not go/up down >>> in the ballooning once the toolstack has decreed how much >>> memory the guest should use. It also assumes that the operations >>> are semi-atomic - and to make it so as much as it can - it executes >>> these operations in serial. >>> >>> This goes back to the problem statement - if we try to parallize >>> this we run in the problem that the amount of memory we thought >>> we free is not true anymore. The start of this email has a good >>> description of some of the issues. >> >> Just set max_pages (bad name...) everywhere as needed to make room. Then >> kick tmem (everywhere, in parallel) to free memory. Wait until enough is >> free …. Allocate your domain(s, in parallel). If any vcpus become stalled >> because a tmem guest driver is trying to allocate beyond max_pages, you need >> to adjust your allocations. As usual. > > > Versus just one "reserve" that would remove the need for most of this. > That is - if we can not "reserve" we would fall-back to the mechanism you > stated, but if there is enough memory we do not have to do the "wait" > game (which on a 1TB takes forever and makes launching guests sometimes > take minutes) - and can launch the guest without having to worry > about slow-path. > .. snip. The "wait" could be literally zero in a common case. And if not, because there is not enough free ram, the claim would have failed. > >>>> >>> >>> I believe Dan is saying is that it is not enabled by default. >>> Meaning it does not get executed in by /etc/init.d/xencommons and >>> as such it never gets run (or does it now?) - unless one knows >>> about it - or it is enabled by default in a product. But perhaps >>> we are both mistaken? Is it enabled by default now on den-unstable? >> >> I'm a bit lost … what is supposed to be enabled? A sharing daemon? A paging >> daemon? Neither daemon requires wait queue work, batch allocations, etc. I >> can't figure out what this portion of the conversation is about. > > The xenshared daemon. That's not in the tree. Unbeknownst to me. Would appreciate to know more. Or is it a symbolic placeholder in this conversation? Andres >> >> Having said that, thanks for the thoughtful follow-up > > Thank you for your response! _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.