[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions



On Jan 11, 2013, at 11:03 AM, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> 
wrote:

> Heya,
> 
> Much appreciate your input, and below are my responses.
>>>>> A) In Linux, a privileged user can write to a sysfs file which writes
>>>>> to the balloon driver which makes hypercalls from the guest kernel to
>>>> 
>>>> A fairly bizarre limitation of a balloon-based approach to memory 
>>>> management. Why on earth should the guest be allowed to change the size of 
>>>> its balloon, and therefore its footprint on the host. This may be 
>>>> justified with arguments pertaining to the stability of the in-guest 
>>>> workload. What they really reveal are limitations of ballooning. But the 
>>>> inadequacy of the balloon in itself doesn't automatically translate into 
>>>> justifying the need for a new hyper call.
>>> 
>>> Why is this a limitation? Why shouldn't the guest the allowed to change
>>> its memory usage? It can go up and down as it sees fit.
>> 
>> No no. Can the guest change its cpu utilization outside scheduler 
>> constraints? NIC/block dev quotas? Why should an unprivileged guest be able 
>> to take a massive s*it over the host controller's memory allocation, at the 
>> guest's whim?
> 
> There is a limit to what it can do. It is not an uncontrolled guest
> going mayhem - it does it stuff within the parameters of the guest config.
> Within in my mind also implies the 'tmem' doing extra things in the 
> hypervisor.
> 
>> 
>> I'll be happy with a balloon the day I see an OS that can't be rooted :)
>> 
>> Obviously this points to a problem with sharing & paging. And this is why I 
>> still spam this thread. More below.
>> 
>>> And if it goes down and it gets better performance - well, why shouldn't
>>> it do it?
>>> 
>>> I concur it is odd - but it has been like that for decades.
>> 
>> Heh. Decades … one?
> 
> Still - a decade.
>>> 
>>> 
>>>> 
>>>>> the hypervisor, which adjusts the domain memory footprint, which changes 
>>>>> the number of free pages _without_ the toolstack knowledge.
>>>>> The toolstack controls constraints (essentially a minimum and maximum)
>>>>> which the hypervisor enforces.  The toolstack can ensure that the
>>>>> minimum and maximum are identical to essentially disallow Linux from
>>>>> using this functionality.  Indeed, this is precisely what Citrix's
>>>>> Dynamic Memory Controller (DMC) does: enforce min==max so that DMC always 
>>>>> has complete control and, so, knowledge of any domain memory
>>>>> footprint changes.  But DMC is not prescribed by the toolstack,
>>>> 
>>>> Neither is enforcing min==max. This was my argument when previously 
>>>> commenting on this thread. The fact that you have enforcement of a maximum 
>>>> domain allocation gives you an excellent tool to keep a domain's 
>>>> unsupervised growth at bay. The toolstack can choose how fine-grained, how 
>>>> often to be alerted and stall the domain.
> 
> That would also do the trick - but there are penalties to it.
> 
> If one just wants to launch multiple guests and "freeze" all the other guests
> from using the balloon driver - that can certainly be done.
> 
> But that is a half-way solution (in my mind). Dan's idea is that you wouldn't
> even need that and can just allocate without having to worry about the other
> guests at all - b/c you have reserved enough memory in the hypervisor (host) 
> to
> launch the guest.

Konrad:
Ok, what happens when a guest is stalled because it cannot allocate more pages 
due to existing claims? Exactly the same that happens when it can't grow 
because it has hit d->max_pages.

> 
>>> 
>>> There is a down-call (so events) to the tool-stack from the hypervisor when
>>> the guest tries to balloon in/out? So the need for this problem arose
>>> but the mechanism to deal with it has been shifted to the user-space
>>> then? What to do when the guest does this in/out balloon at freq
>>> intervals?
>>> 
>>> I am missing actually the reasoning behind wanting to stall the domain?
>>> Is that to compress/swap the pages that the guest requests? Meaning
>>> an user-space daemon that does "things" and has ownership
>>> of the pages?
>> 
>> The (my) reasoning is that this enables control over unsupervised growth. I 
>> was being facetious a couple lines above. Paging and sharing also have the 
>> same problem with badly behaved guests. So this is where you stop these 
>> guys, allow the toolstack to catch a breath, and figure out what to do with 
>> this domain (more RAM? page out? foo?).
> 
> But what if we do not even have to have the toolstack to catch a breath. The 
> goal
> here is for it not to be involved in this and let the hypervisor deal with
> unsupervised growth as it is better equiped to do so - and it is the ultimate
> judge whether the guest can grow wildly or not.
> 
> I mean why make the toolstack become CPU bound when you can just set
> the hypervisor to take this extra information in an account and you avoid
> the CPU-bound problem altogether.
> 
>> 
>> All your questions are very valid, but they are policy in toolstack-land. 
>> Luckily the hypervisor needs no knowledge of that.
> 
> My thinking is that some policy (say how much the guests can grow) is 
> something
> that the host sets. And the hypervisor is the engine that takes these values
> in account and runs with it.
> 
> I think you are advocating that the "engine" and policy should be both
> in the user-land.
> 
> .. snip..
>>>> Great care has been taken for this statement to not be exactly true. The 
>>>> hypervisor discards one of two pages that the toolstack tells it to (and 
>>>> patches the physmap of the VM previously pointing to the discard page). It 
>>>> doesn't merge, nor does it look into contents. The hypervisor doesn't care 
>>>> about the page contents. This is deliberate, so as to avoid spurious 
>>>> claims of "you are using technique X!"
>>>> 
>>> 
>>> Is the toolstack (or a daemon in userspace) doing this? I would
>>> have thought that there would be some optimization to do this
>>> somewhere?
>> 
>> You could optimize but then you are baking policy where it does not belong. 
>> This is what KSM did, which I dislike. Seriously, does the kernel need to 
>> scan memory to find duplicates? Can't something else do it given suitable 
>> interfaces? Now any other form of sharing policy that tries to use 
>> VMA_MERGEABLE is SOL. Tim, Gregor and I, at different points in time, tried 
>> to avoid this. I don't know that it was a conscious or deliberate effort, 
>> but it worked out that way.
> 
> OK, I think I understand you - you are advocating for user-space
> because the combination of policy/engine can be done there.
> 
> Dan's and mine thinking is to piggyback on the hypervisors' MM engine
> and just provide a means of tweaking one value. In some ways that
> is simialar to making sysctls in the kernel to tell the MM how to
> behave.
> 
> .. snip..
>>> That code makes certain assumptions - that the guest will not go/up down
>>> in the ballooning once the toolstack has decreed how much
>>> memory the guest should use. It also assumes that the operations
>>> are semi-atomic - and to make it so as much as it can - it executes
>>> these operations in serial.
>>> 
>>> This goes back to the problem statement - if we try to parallize
>>> this we run in the problem that the amount of memory we thought
>>> we free is not true anymore. The start of this email has a good
>>> description of some of the issues.
>> 
>> Just set max_pages (bad name...) everywhere as needed to make room. Then 
>> kick tmem (everywhere, in parallel) to free memory. Wait until enough is 
>> free …. Allocate your domain(s, in parallel). If any vcpus become stalled 
>> because a tmem guest driver is trying to allocate beyond max_pages, you need 
>> to adjust your allocations. As usual.
> 
> 
> Versus just one "reserve" that would remove the need for most of this.
> That is - if we can not "reserve" we would fall-back to the mechanism you
> stated, but if there is enough memory we do not have to do the "wait"
> game (which on a 1TB takes forever and makes launching guests sometimes
> take minutes) - and can launch the guest without having to worry
> about slow-path.
> .. snip.

The "wait" could be literally zero in a common case. And if not, because there 
is not enough free ram, the claim would have failed.

> 
>>>> 
>>> 
>>> I believe Dan is saying is that it is not enabled by default.
>>> Meaning it does not get executed in by /etc/init.d/xencommons and
>>> as such it never gets run (or does it now?) - unless one knows
>>> about it - or it is enabled by default in a product. But perhaps
>>> we are both mistaken? Is it enabled by default now on den-unstable?
>> 
>> I'm a bit lost … what is supposed to be enabled? A sharing daemon? A paging 
>> daemon? Neither daemon requires wait queue work, batch allocations, etc. I 
>> can't figure out what this portion of the conversation is about.
> 
> The xenshared daemon.
That's not in the tree. Unbeknownst to me. Would appreciate to know more. Or is 
it a symbolic placeholder in this conversation?

Andres

>> 
>> Having said that, thanks for the thoughtful follow-up
> 
> Thank you for your response!


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.