[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] domain creation vs querying free memory (xend and xl)

Hi Andres --

First, the primary target of page-sharing is HVM proprietary/legacy
guests, correct?  So, as I said, we are starting from different
planets.  I'm not arguing that a toolstack-memory-controller
won't be sufficient for your needs, especially in a single server
environment, only that the work required to properly ensure that:

> >> The toolstack has (or should definitely have) a non-racy view
> >> of the memory of the host

is unnecessary if you (and the toolstack) take a slightly broader
dynamic view of memory management.  IMHO that broader view
(which requires the "memory reservation" hypercall) both encompasses
tmem and IMHO greatly simplifies memory management in the presence
of page-unsharing.  I.e. it allows the toolstack to NOT have
a non-racy view of the memory of the host.

So, if you don't mind, I will take this opportunity to
ask some questions about page-sharing stuff, in the context
of the toolstack-memory-controller and/or memory reservation

> >> Domains can be cajoled into obedience via the max_pages tweak -- which I 
> >> profoundly dislike. If
> >> anything we should change the hypervisor to have a "current_allowance" or 
> >> similar field with a more
> >> obvious meaning. The abuse of max_pages makes me cringe. Not to say I 
> >> disagree with its usefulness.
> >
> > Me cringes too.  Though I can see from George's view that it makes
> > perfect sense.  Since the toolstack always controls exactly how
> > much memory is assigned to a domain and since it can cache the
> > "original max", current allowance and the hypervisors view of
> > max_pages must always be the same.
> No. There is room for slack. max_pages (or current_allowance) simply sets an 
> upper bound, which if met
> will trigger the need for memory management intervention.

I think we agree if we change my "must always be the same" to
"must always be essentially the same, ignoring some fudge factor".

Which begs the questions: How does one determine how big the
fudge factor is, what happens if it is not big enough, and if
it is too big, doesn't that potentially add up to a lot of
wasted space?

> > By "ex machina" do you mean "without the toolstack's knowledge"?
> >
> > Then how does page-unsharing work?  Does every page-unshare done by
> > the hypervisor require serial notification/permission of the toolstack?
> No of course not. But if you want to keep a domain at bay you keep its 
> max_pages where you want it to
> stop growing. And at that point the domain will fall asleep (not 100% there 
> hypervisor-wise yet but
> Real Soon Now (T)), and a synchronous notification will be sent to a listener.
> At that point it's again a memory management decision. Should I increase the 
> domain's reservation,
> page something out, etc? There is a range of possibilities that are not 
> germane to the core issue of
> enforcing memory limits.

Maybe we need to dive deep into page-sharing accounting for
a moment here:

When a page is shared say, by 1000 different VMs, does it get
"billed" to all VMs?  If no (which makes the most sense to me),
how is the toolstack informed that there is now 999 free
pages available so that it can use them in, say, a new domain?
Does the hypervisor notification wait until there is sufficient
pages (say, a bucket's worth)?  If yes, what's the point of
sharing if the hypervisor now has some free memory but the
the freed memory is still "billed"; and are there data
structures in the hypervisor to track this so that unsharing
does proper accounting too?

Now suppose 10000 pages are shared by 1000 different VMs at
domain launch (scenario: an online class is being set up by
a cloud user) and then the VMs suddenly get very active
and require a lot of CoWing (say the online class just
got underway).  What's the profile of interaction between
the hypervisor and toolstack?

Maybe you've got this all figured out (whether implemented or
not) and are convinced it is scalable (or don't care because the
target product is a small single system), but I'd imagine the internal
hypervisor vs toolstack accounting/notifications will get very
very messy and have concerns about scalability and memory waste.

> > Or is this "batched", in which case a pool is necessary, isn't it?
> > (Not sure what you mean by "no need for a pool" and then "toolstack
> > ensures there is something set apart"... what's the difference?)
> I am under the impression there is a proposal floating for a 
> hypervisor-maintained pool of pages to
> immediately relief un-sharing. Much like there is now for PoD (the pod 
> cache). This is what I think is
> not necessary.

I agree it is not necessary, but don't understand who manages
the "slop" (unallocated free pages) and how a pool is different
from a "bucket" (to use your term from further down in your reply).

> > My point is, whether there is no pool or a pool that sometimes
> > runs dry, are you really going to put the toolstack in the hypervisor's
> > path for allocating a page so that the hypervisor can allocate
> > a new page for CoW to fulfill an unshare?
> Absolutely not.

Good to hear.  But this begs answers to the previous questions.
Mainly: How does it all work then so that the toolstack and
hypervisor are "in sync" about the number of available pages
such that the toolstack never wrongly determines that there
is enough free space to launch a domain and (by the time
it tries to use the free space) there really isn't?

If they can't remain in sync (at least within a single "bucket",
across the entire system, not one bucket per domain), then
isn't something like the proposed "memory reservation"
hypercall still required?

> >> Something that I struggle with here is the notion that we need to extend 
> >> the hypervisor for any
> aspect
> >> of the discussion we've had so far. I just don't see that. The toolstack 
> >> has (or should definitely
> >> have) a non-racy view of the memory of the host. Reservations are 
> >> therefore notions the toolstack
> >> manages.
> >
> > In a perfect world where the toolstack has an oracle for the
> > precise time-varying memory requirements for all guests, I
> > would agree.
> With the mechanism outlined, the toolstack needs to make coarse-grained 
> infrequent decisions. There is
> a possibility for pathological misbehavior -- I think there is always that 
> possibility. Correctness is
> preserved, at worst, performance will be hurt.

IMHO, performance will be hurt not only for the pathological cases.
Memory will also needlessly be wasted.  But, for Windows, I don't
have a better solution, and it will probably be no worse than Microsoft's

> It's really important to keep things separate in this discussion. The 
> toolstack+hypervisor are
> enabling (1) control over how memory is allocated to what (2) control over a 
> domain's ability to grow
> its footprint unsupervised (3) control over a domain's footprint with PV 
> mechanisms from within, or
> externally.
> Performance is not up to the toolstack but to the memory manager magic the 
> toolstack enables with (3).

Good dichotomy (though not entirely perfect on my planet).

> > In that world, there's no need for a CPU scheduler either...
> > the toolstack can decide exactly when to assign each VCPU for
> > each VM onto each PCPU, and when to stop and reassign.
> > And then every PCPU would be maximally utilized, right?
> >
> > My point: Why would you resource-manage CPUs differently from
> > memory?  The demand of real-world workloads varies dramatically
> > for both... don't you want both to be managed dynamically,
> > whenever possible?
> >
> > If yes (dynamic is good), in order for the toolstack's view of
> > memory to be non-racy, doesn't every hypervisor page allocation
> > need to be serialized with the toolstack granting notification/permission?
> Once you bucketize RAM and know you will get synchronous kicks as buckets 
> fill up, then you have a
> non-racy view. If you choose buckets of width one...

 ... e.g. tmem, which is saving one page of data at high frequency

> >> I further think the pod cache could be converted to this model. Why have 
> >> specific per-domain lists
> of
> >> cached pages in the hypervisor? Get them back from the heap! Obviously 
> >> places a decoupled
> requirement
> >> of certain toolstack features. But allows to throw away a lot of complex 
> >> code.
> >
> > IIUC in George's (Xapi) model (or using Tim's phrase, "balloon-to-fit")
> > the heap is "always" empty because the toolstack has assigned all memory.
> I don't think that's what they mean. Nor is it what I mean. The toolstack may 
> chunk memory up into
> abstract buckets. It can certainly assert that its bucketized view matches 
> the hypervisor view. Pages
> flow from the heap to each domain -- but the bucket "domain X" will not 
> overflow unsupervised.

Right, but it is the "underflow" I am concerned with.

I don't know if that is what they mean by "balloon-to-fit" (or exactly
what you mean), but I think we are all trying to optimize the use of
a fixed amount of RAM among some number of VMs.  To me, a corollary
of that is that the size of the heap is always as small "as possible".
And another corollary is that there aren't a bunch of empty pools
of free pages lying about waiting for rare events to happen.  And
one more corollary is that, to the extent possible, guests aren't
"wasting" memory.

> > So I'm still confused... where does "page unshare" get memory from
> > and how does it notify and/or get permission from the toolstack?
> Re sharing, as it should be clear by now, the answer is "it doesn't matter". 
> If unsharing cannot be
> satisfied form the heap, then memory management in dom0 is invoked. 
> Heavy-weight, but it means you've
> hit an admin-imposed limit.

Well it *does* matter if that fallback (unsharing cannot be
satisfied from the heap) happens too frequently.
> Please note that this notion of limits and enforcement is sparingly applied 
> today, to the best of my
> knowledge. But imho it'd be great to meaningfully work towards it.

Agreed.  There's lots of policy questions around all of our different
mechanism "planets", so I hope this discussion meaningfully helps!

Thanks for the great discussion!


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.