[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

On 02/01/13 21:59, Konrad Rzeszutek Wilk wrote:
Thanks for the clarification. I am not that fluent in the OCaml code.

I'm not fluent in OCaml either, I'm mainly going from memory based on the discussions I had with the author when it was being designed, as well as discussions with the xapi team when dealing with bugs at later points.

When a request comes in for a certain amount of memory, it will go
and set each VM's max_pages, and the max tmem pool size.  It can
then check whether there is enough free memory to complete the
allocation or not (since there's a race between checking how much
memory a guest is using and setting max_pages).  If that succeeds,
it can return "success".  If, while that VM is being built, another
request comes in, it can again go around and set the max sizes
lower.  It has to know how much of the memory is "reserved" for the
first guest being built, but if there's enough left after that, it
can return "success" and allow the second VM to start being built.

After the VMs are built, the toolstack can remove the limits again
if it wants, again allowing the free flow of memory.
This sounds to me like what Xapi does?

No, AFAIK xapi always sets the max_pages to what it wants the guest to be using at any given time. I talked about removing the limits (and about operating without limits in the normal case) because it seems like something that Oracle wants (having to do with tmem).
Do you see any problems with this scheme?  All it requires is for
the toolstack to be able to temporarliy set limits on both guests
ballooning up and on tmem allocating more than a certain amount of
memory.  We already have mechanisms for the first, so if we had a
"max_pages" for tmem, then you'd have all the tools you need to
implement it.
Of the top of my hat the thing that come in my mind are:
  - The 'lock' over the memory usage (so the tmem freeze + maxpages set)
    looks to solve the launching in parallel of guests.
    It will allow us to launch multiple guests - but it will also
    suppressing the tmem asynchronous calls and having to balloon up/down
    the guests. The claim hypercall does not do any of those and
    gives a definite 'yes' or 'no'.

So when you say, "tmem freeze", are you specifically talking about not allowing tmem to allocate more memory (what I called a "max_pages" for tmem)? Or is there more to it?

Secondly, just to clarify: when a guest is using memory from the tmem pool, is that added to tot_pages?

I'm not sure what "gives a definite yes or no" is supposed to mean -- the scheme I described also gives a definite yes or no.

In any case, your point about ballooning is taken: if we set max_pages for a VM and just leave it there while VMs are being built, then VMs cannot balloon up, even if there is "free" memory (i.e., memory that will not be used for the currently-building VM), and cannot be moved *bewteen* VMs either (i.e., by ballooning down one and ballooning the other up). Both of these be done by extending the toolstack with a memory model (see below), but that adds an extra level of complication.

  - Complex code that has to keep track of this in the user-space.
    It also has to know of the extra 'reserved' space that is associated
    with a guest. I am not entirely sure how that would couple with
    PCI passthrough. The claim hypercall is fairly simple - albeit
    having it extended to do Super pages and 32-bit guests could make this

What do you mean by the extra 'reserved' space? And what potential issues are there with PCI passthrough?

To be accepted, the reservation hypercall will certainly have to be extended to do superpages and 32-bit guests, so that's the case we should be considering.

  - I am not sure whether the toolstack can manage all the memory
    allocation. It sounds like it could but I am just wondering if there
    are some extra corners that we hadn't thought off.

Wouldn't the same argument apply to the reservation hypercall? Suppose that there was enough domain memory but not enough Xen heap memory, or enough of some other resource -- the hypercall might succeed, but then the domain build still fail at some later point when the other resource allocation failed.

  - Latency. With the locks being placed on the pools of memory the
    existing workload can be negatively affected. Say that this means we
    need to balloon down a couple hundred guests, then launch the new
    guest. This process of 'lower all of them by X', lets check the
    'free amount'. Oh nope - not enougth - lets do this again. That would
    delay the creation process.

    The claim hypercall will avoid all of that by just declaring:
    "This is how much you will get." without having to balloon the rest
    of the guests.

    Here is how I see what your toolstack would do:

        1). Figure out how much memory we need for X guests.
        2). round-robin existing guests to decrease their memory
            consumption (if they can be ballooned down). Or this
            can be exectued in parallel for the guests.
        3). check if the amount of free memory is at least X
            [this check has to be done in serial]
        4). launch multiple guests at the same time.

    The claim hypercall would avoid the '3' part b/c it is inherently
    part of the Xen's MM bureaucracy. It would allow:

        1). claim hypercall for X guest.
        2). if any of the claim's return 0 (so success), then launch guest
        3). if the errno was -ENOMEM then:
         3a). round-robin existing guests to decrease their memory
              consumption if allowed. Goto 1).

    So the 'error-case' only has to run in the slow-serial case.
Hmm, I don't think what you wrote about mine is quite right. Here's what I had in mind for mine (let me call it "limit-and-check"):

1). Set limits on all guests, and tmem, and see how much memory is left.
2) Read free memory
2a) Claim memory for each guest from freshly-calculated pool of free memory.
3) For each claim that can be satisfied, launch a guest
4) If there are guests that can't be satisfied with the current free memory, then:
4a) round-robin existing guests to decrease their memory consumption if allowed. Goto 2.
5) Remove limits on guests.

Note that 1 would only be done for the first such "request", and 5 would only be done after all such requests have succeeded or failed. Also note that steps 1 and 5 are only necessary if you want to go without such limits -- xapi doesn't do them, because it always keeps max_pages set to what it wants the guest to be using.

Also, note that the "claiming" (2a for mine above and 1 for yours) has to be serialized with other "claims" in both cases (in the reservation hypercall case, with a lock inside the hypervisor), but that the building can begin in parallel with the "claiming" in both cases.

But I think I do see what you're getting at. The "free memory" measurement has to be taken when the system is in a "quiescent" state -- or at least a "grow only" state -- otherwise it's meaningless. So #4a should really be:

4a) Round-robin existing guests to decrease their memory consumption if allowed. 4b) Wait for currently-building guests to finish building (if any), then go to #2.

So suppose the following cases, in which several requests for guest creation come in over a short period of time (not necessarily all at once): A. There is enough memory for all requested VMs to be built without ballooning / something else B. There is enough for some, but not all of the VMs to be built without ballooning / something else

In case A, then I think "limit-and-check" and "reservation hypercall" should perform the same. For each new request that comes in, the toolstack can say, "Well, when I checked I had 64GiB free; then I started to build a 16GiB VM. So I should have 48GiB left, enough to build this 32GiB VM." "Well, when I checked I had 64GiB free; then I started to build a 16GiB VM and a 32GiB VM, so I should have 16GiB left, enough to be able to build this 16GiB VM."

The main difference comes in case B. The "reservation hypercall" method will not have to wait until all existing guests have finished building to be able to start subsequent guests; but "limit-and-check" would have to wait until the currently-building guests are finished before doing another check.

This limitation doesn't apply to xapi, because it doesn't use the hypervisor's free memory as a measure of the memory it has available to it. Instead, it keeps an internal model of the free memory the hypervisor has available. This is based on MAX(current_target, tot_pages) of each guest (where "current_target" for a domain in the process of being built is the amount of memory it will have eventually). We might call this the "model" approach.

We could extend "limit-and-check" to "limit-check-and-model" (i.e., estimate how much memory is really free after ballooning based on how much the guests' tot_pages), or "limit-model" (basically, fully switch to a xapi-style "model" approach while you're doing domain creation). That would be significantly more complicated. On the other hand, a lot of the work has already been done by the XenServer team, and (I believe) the code in question is all GPL'ed, so Oracle could just take the algorithms and adapt them with just a bit if tweaking (and a bit of code translation). It seems to me that he "model" approach brings a lot of other benefits as well.

But at any rate -- without debating the value or cost of the "model" approach, would you agree with my analysis and conclusions? Namely:

a. "limit-and-check" and "reservation hypercall" are similar wrt guest creation when there is enough memory currently free to build all requested guests b. "limit-and-check" may be slower if some guests can succeed in being built but others must wait for memory to be freed up, since the "check" has to wait for current guests to finish building c. (From further back) One downside of a pure "limit-and-check" approach is that while VMs are being built, VMs cannot increase in size, even if there is "free" memory (not being used to build the currently-building domain(s)) or if another VM can be ballooned down. d. "model"-based approaches can mitigate b and c, at the cost of a more complicated algorithm

  - This still has the race issue - how much memory you see vs the
    moment you launch it. Granted you can avoid it by having a "fudge"
    factor (so when a guest says it wants 1G you know it actually
    needs an extra 100MB on top of the 1GB or so). The claim hypercall
    would count all of that for you so you don't have to race.
I'm sorry, what race / fudge factor are you talking about?


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.