[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
From: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
Date: Mon, 14 Jan 2013 18:28:48 +0000
Cc: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, "Keir \(Xen.org\)" <keir@xxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx>, "Tim \(Xen.org\)" <tim@xxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Konrad Rzeszutek Wilk <konrad@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
Delivery-date: Mon, 14 Jan 2013 18:35:56 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 02/01/13 21:59, Konrad Rzeszutek Wilk wrote:

Thanks for the clarification. I am not that fluent in the OCaml code.

I'm not fluent in OCaml either, I'm mainly going from memory based onthe discussions I had with the author when it was being designed, aswell as discussions with the xapi team when dealing with bugs at laterpoints.

When a request comes in for a certain amount of memory, it will go
and set each VM's max_pages, and the max tmem pool size.  It can
then check whether there is enough free memory to complete the
allocation or not (since there's a race between checking how much
memory a guest is using and setting max_pages).  If that succeeds,
it can return "success".  If, while that VM is being built, another
request comes in, it can again go around and set the max sizes
lower.  It has to know how much of the memory is "reserved" for the
first guest being built, but if there's enough left after that, it
can return "success" and allow the second VM to start being built.

After the VMs are built, the toolstack can remove the limits again
if it wants, again allowing the free flow of memory.

This sounds to me like what Xapi does?

No, AFAIK xapi always sets the max_pages to what it wants the guest tobe using at any given time. I talked about removing the limits (andabout operating without limits in the normal case) because it seems likesomething that Oracle wants (having to do with tmem).

Do you see any problems with this scheme?  All it requires is for
the toolstack to be able to temporarliy set limits on both guests
ballooning up and on tmem allocating more than a certain amount of
memory.  We already have mechanisms for the first, so if we had a
"max_pages" for tmem, then you'd have all the tools you need to
implement it.

Of the top of my hat the thing that come in my mind are:
  - The 'lock' over the memory usage (so the tmem freeze + maxpages set)
    looks to solve the launching in parallel of guests.
    It will allow us to launch multiple guests - but it will also
    suppressing the tmem asynchronous calls and having to balloon up/down
    the guests. The claim hypercall does not do any of those and
    gives a definite 'yes' or 'no'.

So when you say, "tmem freeze", are you specifically talking about notallowing tmem to allocate more memory (what I called a "max_pages" fortmem)? Or is there more to it?

Secondly, just to clarify: when a guest is using memory from the tmempool, is that added to tot_pages?

I'm not sure what "gives a definite yes or no" is supposed to mean --the scheme I described also gives a definite yes or no.

In any case, your point about ballooning is taken: if we set max_pagesfor a VM and just leave it there while VMs are being built, then VMscannot balloon up, even if there is "free" memory (i.e., memory thatwill not be used for the currently-building VM), and cannot be moved*bewteen* VMs either (i.e., by ballooning down one and ballooning theother up). Both of these be done by extending the toolstack with amemory model (see below), but that adds an extra level of complication.

  - Complex code that has to keep track of this in the user-space.
    It also has to know of the extra 'reserved' space that is associated
    with a guest. I am not entirely sure how that would couple with
    PCI passthrough. The claim hypercall is fairly simple - albeit
    having it extended to do Super pages and 32-bit guests could make this
    longer.

What do you mean by the extra 'reserved' space? And what potentialissues are there with PCI passthrough?

To be accepted, the reservation hypercall will certainly have to beextended to do superpages and 32-bit guests, so that's the case weshould be considering.

  - I am not sure whether the toolstack can manage all the memory
    allocation. It sounds like it could but I am just wondering if there
    are some extra corners that we hadn't thought off.

Wouldn't the same argument apply to the reservation hypercall? Supposethat there was enough domain memory but not enough Xen heap memory, orenough of some other resource -- the hypercall might succeed, but thenthe domain build still fail at some later point when the other resourceallocation failed.

  - Latency. With the locks being placed on the pools of memory the
    existing workload can be negatively affected. Say that this means we
    need to balloon down a couple hundred guests, then launch the new
    guest. This process of 'lower all of them by X', lets check the
    'free amount'. Oh nope - not enougth - lets do this again. That would
    delay the creation process.

    The claim hypercall will avoid all of that by just declaring:
    "This is how much you will get." without having to balloon the rest
    of the guests.

    Here is how I see what your toolstack would do:

      [serial]
        1). Figure out how much memory we need for X guests.
        2). round-robin existing guests to decrease their memory
            consumption (if they can be ballooned down). Or this
            can be exectued in parallel for the guests.
        3). check if the amount of free memory is at least X
            [this check has to be done in serial]
      [parallel]
        4). launch multiple guests at the same time.

    The claim hypercall would avoid the '3' part b/c it is inherently
    part of the Xen's MM bureaucracy. It would allow:

      [parallel]
        1). claim hypercall for X guest.
        2). if any of the claim's return 0 (so success), then launch guest
        3). if the errno was -ENOMEM then:
      [serial]
         3a). round-robin existing guests to decrease their memory
              consumption if allowed. Goto 1).

    So the 'error-case' only has to run in the slow-serial case.

Hmm, I don't think what you wrote about mine is quite right. Here'swhat I had in mind for mine (let me call it "limit-and-check"):


[serial]
1). Set limits on all guests, and tmem, and see how much memory is left.
2) Read free memory
[parallel]
2a) Claim memory for each guest from freshly-calculated pool of free memory.
3) For each claim that can be satisfied, launch a guest

4) If there are guests that can't be satisfied with the current freememory, then:

[serial]

4a) round-robin existing guests to decrease their memory consumption ifallowed. Goto 2.

5) Remove limits on guests.

Note that 1 would only be done for the first such "request", and 5 wouldonly be done after all such requests have succeeded or failed. Alsonote that steps 1 and 5 are only necessary if you want to go withoutsuch limits -- xapi doesn't do them, because it always keeps max_pagesset to what it wants the guest to be using.

Also, note that the "claiming" (2a for mine above and 1 for yours) hasto be serialized with other "claims" in both cases (in the reservationhypercall case, with a lock inside the hypervisor), but that thebuilding can begin in parallel with the "claiming" in both cases.

But I think I do see what you're getting at. The "free memory"measurement has to be taken when the system is in a "quiescent" state --or at least a "grow only" state -- otherwise it's meaningless. So #4ashould really be:

4a) Round-robin existing guests to decrease their memory consumption ifallowed.4b) Wait for currently-building guests to finish building (if any), thengo to #2.

So suppose the following cases, in which several requests for guestcreation come in over a short period of time (not necessarily all at once):A. There is enough memory for all requested VMs to be built withoutballooning / something elseB. There is enough for some, but not all of the VMs to be built withoutballooning / something else

In case A, then I think "limit-and-check" and "reservation hypercall"should perform the same. For each new request that comes in, thetoolstack can say, "Well, when I checked I had 64GiB free; then Istarted to build a 16GiB VM. So I should have 48GiB left, enough tobuild this 32GiB VM." "Well, when I checked I had 64GiB free; then Istarted to build a 16GiB VM and a 32GiB VM, so I should have 16GiB left,enough to be able to build this 16GiB VM."

The main difference comes in case B. The "reservation hypercall" methodwill not have to wait until all existing guests have finished buildingto be able to start subsequent guests; but "limit-and-check" would haveto wait until the currently-building guests are finished before doinganother check.

This limitation doesn't apply to xapi, because it doesn't use thehypervisor's free memory as a measure of the memory it has available toit. Instead, it keeps an internal model of the free memory thehypervisor has available. This is based on MAX(current_target,tot_pages) of each guest (where "current_target" for a domain in theprocess of being built is the amount of memory it will haveeventually). We might call this the "model" approach.

We could extend "limit-and-check" to "limit-check-and-model" (i.e.,estimate how much memory is really free after ballooning based on howmuch the guests' tot_pages), or "limit-model" (basically, fully switchto a xapi-style "model" approach while you're doing domain creation).That would be significantly more complicated. On the other hand, a lotof the work has already been done by the XenServer team, and (I believe)the code in question is all GPL'ed, so Oracle could just take thealgorithms and adapt them with just a bit if tweaking (and a bit of codetranslation). It seems to me that he "model" approach brings a lot ofother benefits as well.

But at any rate -- without debating the value or cost of the "model"approach, would you agree with my analysis and conclusions? Namely:

a. "limit-and-check" and "reservation hypercall" are similar wrt guestcreation when there is enough memory currently free to build allrequested guestsb. "limit-and-check" may be slower if some guests can succeed in beingbuilt but others must wait for memory to be freed up, since the "check"has to wait for current guests to finish buildingc. (From further back) One downside of a pure "limit-and-check" approachis that while VMs are being built, VMs cannot increase in size, even ifthere is "free" memory (not being used to build the currently-buildingdomain(s)) or if another VM can be ballooned down.d. "model"-based approaches can mitigate b and c, at the cost of a morecomplicated algorithm

  - This still has the race issue - how much memory you see vs the
    moment you launch it. Granted you can avoid it by having a "fudge"
    factor (so when a guest says it wants 1G you know it actually
    needs an extra 100MB on top of the 1GB or so). The claim hypercall
    would count all of that for you so you don't have to race.

I'm sorry, what race / fudge factor are you talking about?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  - From: Konrad Rzeszutek Wilk

References:
- Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
  - From: Konrad Rzeszutek Wilk

Prev by Date: Re: [Xen-devel] [User Question] Correct XSM/FLASK ruleset for oxenstored
Next by Date: [Xen-devel] [PULL] xen_disk fixes and improvements 2013-01-14
Previous by thread: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
Next by thread: Re: [Xen-devel] Proposed XENMEM_claim_pages hypercall: Analysis of problem and alternate solutions
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.