[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/6] mmu: Introduce XENMEM_claim_pages (subop of memory ops).



This patch is fine by me at this point.

Acked-by: Keir Fraser <keir@xxxxxxx>

On 04/03/2013 17:47, "Konrad Rzeszutek Wilk" <konrad.wilk@xxxxxxxxxx> wrote:

> From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
> 
> When guests memory consumption is volatile (multiple guests
> ballooning up/down) we are presented with the problem of
> being able to determine exactly how much memory there is
> for allocation of new guests without negatively impacting
> existing guests. Note that the existing models (xapi, xend)
> drive the memory consumption from the tool-stack and assume
> that the guest will eventually hit the memory target. Other
> models, such as the dynamic memory utilized by tmem, do this
> differently - the guest drivers the memory consumption (up
> to the d->max_pages ceiling). With dynamic memory model, the
> guest frequently can balloon up and down as it sees fit.
> This presents the problem to the toolstack that it does not
> know atomically how much free memory there is (as the information
> gets stale the moment the d->tot_pages information is provided
> to the tool-stack), and hence when starting a guest can fail
> during the memory creation process. Especially if the process
> is done in parallel. In a nutshell what we need is a atomic
> value of all domains tot_pages during the allocation of guests.
> Naturally holding a lock for such a long time is unacceptable.
> Hence the goal of this hypercall is to attempt to atomically and very
> quickly determine if there are sufficient pages available in the
> system and, if so, "set aside" that quantity of pages for future
> allocations by that domain.  Unlike an existing hypercall such as
> increase_reservation or populate_physmap, specific physical
> pageframes are not assigned to the domain because this
> cannot be done sufficiently quickly (especially for very large
> allocations in an arbitrarily fragmented system) and so the
> existing mechanisms result in classic time-of-check-time-of-use
> (TOCTOU) races.  One can think of claiming as similar to a
> "lazy" allocation, but subsequent hypercalls are required
> to do the actual physical pageframe allocation.
> 
> Note that one of effects of this hypercall is that from the
> perspective of other running guests -  suddenly there is
> a new guest occupying X amount of pages. This means that when
> we try to balloon up they will hit the system-wide ceiling of
> available free memory (if the total sum of the existing d->max_pages
>> = host memory). This is OK - as that is part of the overcommit.
> What we DO NOT want to do is dictate their ceiling should be
> (d->max_pages) as that is risky and can lead to guests OOM-ing.
> It is something the guest needs to figure out.
> 
> In order for a toolstack to "get" information about whether
> a domain has a claim and, if so, how large, and also for
> the toolstack to measure the total system-wide claim, a
> second subop has been added and exposed through domctl
> and libxl (see "xen: XENMEM_claim_pages: xc").
> 
> == Alternative solutions ==
> There has been a variety of discussion whether the problem
> hypercall is solving can be done in user-space, such as:
>  - For all the existing guest, set their d->max_pages temporarily
>    to d->tot_pages and create the domain. This forces those
>    domains to stay at their current consumption level (fyi, this is what
>    the tmem freeze call is doing). The disadvantage of this is
>    that needlessly forces the guests to stay at the memory usage
>    instead of allowing it to decide the optimal target.
>  - Account only using d->max_pages of how much free memory there is.
>    This ignores ballooning changes and any over-commit scenario. This
>    is similar to the scenario where the sum of all d->max_pages (and
>    the one to be allocated now) on the host is smaller than the available
>    free memory. As such it ignores the over-commit problem.
>  - Provide a ring/FIFO along with event channel to notify an userspace
>    daemon of guests memory consumption. This daemon can then provide
>    up-to-date information to the toolstack of how much free memory
>    there is. This duplicates what the hypervisor is already doing and
>    introduced latency issues and catching breath for the toolstack as there
>    might be millions of these updates on heavily used machine. There might
>    not be any quiescent state ever and the toolstack will heavily consume
>    CPU cycles and not ever provide up-to-date information.
> 
> It has been noted that this claim mechanism solves the
> underlying problem (slow failure of domain creation) for
> a large class of domains but not all, specifically not
> handling (but also not making the problem worse for) PV
> domains that specify the "superpages" flag, and 32-bit PV
> domains on large RAM systems.  These will be addressed at a
> later time.
> 
> Code overview:
> 
> Though the hypercall simply does arithmetic within locks,
> some of the semantics in the code may be a bit subtle.
> 
> The key variables (d->unclaimed_pages and total_unclaimed_pages)
> starts at zero if no claim has yet been staked for any domain.
> (Perhaps a better name is "claimed_but_not_yet_possessed" but that's
> a bit unwieldy.)  If no claim hypercalls are executed, there
> should be no impact on existing usage.
> 
> When a claim is successfully staked by a domain, it is like a
> watermark but there is no record kept of the size of the claim.
> Instead, d->unclaimed_pages is set to the difference between
> d->tot_pages and the claim.  When d->tot_pages increases or decreases,
> d->unclaimed_pages atomically decreases or increases.  Once
> d->unclaimed_pages reaches zero, the claim is satisfied and
> d->unclaimed pages stays at zero -- unless a new claim is
> subsequently staked.
> 
> The systemwide variable total_unclaimed_pages is always the sum
> of d->unclaimed_pages, across all domains.  A non-domain-
> specific heap allocation will fail if total_unclaimed_pages
> exceeds free (plus, on tmem enabled systems, freeable) pages.
> 
> Claim semantics may be modified by flags.  The initial
> implementation has three flag, which discerns whether the
> caller would like tmem freeable pages to be considered
> in determining whether or not the claim can be successfully
> staked. Future flags may, for example, specify that the
> implementation has one flag, which discerns whether the
> caller would like tmem freeable pages to be considered
> in determining whether or not the claim can be successfully
> staked. Future flags may, for example, specify that the
> caller would like the claim to be successful only if there
> are sufficient pages available on a single node (per Dario's
> suggestion).
> 
> A claim can be cancelled by requesting a claim with the
> flag being zero (XENMEM_CLAIMF_ignored).
> 
> A second subop returns the total outstanding claimed pages
> systemwide.
> 
> Note: Save/restore/migrate may need to be modified,
> else it can be documented that all claims are cancelled.
> 
> ==
> This patch of the proposed XENMEM_claim_pages hypercall/subop, takes
> into account review feedback from Jan and Keir and IanC and Matthew Daley,
> plus some fixes found via runtime debugging.
> 
> v9->v10:
>  - Altered the API a bit (new flags, etc).
> 
> v8->v9:
> - Drop snippets already merged by Keir
>   (26269:21a5b181f8ad and 26270:03cb71bc32f9)
> 
> v7->v8:
> - Replace missing tot_pages adjustment [JBeulich]
> - Combine domain_inc/dec_tot_pages into one function [JBeulich]
> - Added a few more clarifying comments [djm]
> 
> v6->v7:
> - Need co-existence with low_mem_virq [JBeulich]
> 
> v5->v6:
> - Fix post-inc problem [mattjd]
> 
> v4->v5:
> - Split tools part into separate patch [JBeulich]
> - Minor coding style fixups [JBeulich]
> - Use rcu_lock_domain_by_id, not _target version [JBeulich]
> 
> v3->v4: (please ignore v3)
> - Process error, sent stale patch, sorry [djm]
> 
> v2->v3:
> - Add per-domain and global "get" for unclaimed info [JBeulich]
> - New hypercall(s) should fail for unpriv callers [IanC]
> - Non-zero extent_order disallowed [JBeulich]
> - Remove bonehead ASSERTs [JBeulich]
> - Need not hold heaplock for decrease case too [JBeulich]
> - More descriptive failure return values [JBeulich]
> - Don't allow a claim to exceed max_pages [IanC]
> - Subops must be in correct ifdef block in memory.c [keir]
> 
> v1->v2:
> - Add reset-to-zero page claim in domain_kill [JBeulich]
> - Proper handling of struct passed to hypercall [JBeulich]
> - Fix alloc_heap_pages when a domain has a claim [JBeulich]
> - Need not hold heap_lock if !d->unclaimed_pages [keir]
> - Fix missed tot_pages call in donate_page [djm]
> - Remove domain_reset_unclaimed_pages; use set with zero [djm]
> - Bugfixes found through testing in set_unclaimed [djm]
> - More comments in code [djm]
> - Code formatting fixes [djm]
> ===
> 
> Signed-off-by: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> ---
>  xen/common/domain.c         |   1 +
>  xen/common/domctl.c         |   1 +
>  xen/common/memory.c         |  31 +++++++++++
>  xen/common/page_alloc.c     | 126
> +++++++++++++++++++++++++++++++++++++++++++-
>  xen/include/public/domctl.h |   3 +-
>  xen/include/public/memory.h |  47 ++++++++++++++++-
>  xen/include/xen/mm.h        |   4 ++
>  xen/include/xen/sched.h     |   1 +
>  8 files changed, 209 insertions(+), 5 deletions(-)
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index b360de1..90ce40b 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -507,6 +507,7 @@ int domain_kill(struct domain *d)
>          evtchn_destroy(d);
>          gnttab_release_mappings(d);
>          tmem_destroy(d->tmem);
> +        domain_set_unclaimed_pages(d, 0, 0);
>          d->tmem = NULL;
>          /* fallthrough */
>      case DOMDYING_dying:
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> index b7f6619..98d6f50 100644
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -154,6 +154,7 @@ void getdomaininfo(struct domain *d, struct
> xen_domctl_getdomaininfo *info)
>  
>      info->tot_pages         = d->tot_pages;
>      info->max_pages         = d->max_pages;
> +    info->unclaimed_pages   = d->unclaimed_pages;
>      info->shr_pages         = atomic_read(&d->shr_pages);
>      info->paged_pages       = atomic_read(&d->paged_pages);
>      info->shared_info_frame = mfn_to_gmfn(d, virt_to_mfn(d->shared_info));
> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index 08550ef..c27bba5 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -712,6 +712,37 @@ long do_memory_op(unsigned long cmd,
> XEN_GUEST_HANDLE_PARAM(void) arg)
>          break;
>      }
>  
> +    case XENMEM_claim_pages:
> +        if ( !IS_PRIV(current->domain) )
> +            return -EPERM;
> +
> +        if ( copy_from_guest(&reservation, arg, 1) )
> +            return -EFAULT;
> +
> +        if ( !guest_handle_is_null(reservation.extent_start) )
> +            return -EINVAL;
> +
> +        if ( reservation.extent_order != 0 )
> +            return -EINVAL;
> +
> +        d = rcu_lock_domain_by_id(reservation.domid);
> +        if ( d == NULL )
> +            return -EINVAL;
> +
> +        rc = domain_set_unclaimed_pages(d, reservation.nr_extents,
> +                                        reservation.mem_flags);
> +
> +        rcu_unlock_domain(d);
> +
> +        break;
> +
> +    case XENMEM_get_unclaimed_pages:
> +        if ( !IS_PRIV(current->domain) )
> +            return -EPERM;
> +
> +        rc = get_total_unclaimed_pages();
> +        break;
> +
>      default:
>          rc = arch_memory_op(op, arg);
>          break;
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index 9e9fb15..2442aba 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -252,11 +252,124 @@ static long midsize_alloc_zone_pages;
>  #define MIDSIZE_ALLOC_FRAC 128
>  
>  static DEFINE_SPINLOCK(heap_lock);
> +static long total_unclaimed_pages; /* total outstanding claims by all domains
> */
>  
>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>  {
> +    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
> +
>      ASSERT(spin_is_locked(&d->page_alloc_lock));
> -    return d->tot_pages += pages;
> +    d->tot_pages += pages;
> +
> +    /*
> +     * can test d->claimed_pages race-free because it can only change
> +     * if d->page_alloc_lock and heap_lock are both held, see also
> +     * domain_set_unclaimed_pages below
> +     */
> +    if ( !d->unclaimed_pages )
> +        goto out;
> +
> +    spin_lock(&heap_lock);
> +    /* adjust domain unclaimed_pages; may not go negative */
> +    dom_before = d->unclaimed_pages;
> +    dom_after = dom_before - pages;
> +    if ( (dom_before > 0) && (dom_after < 0) )
> +        dom_claimed = 0;
> +    else
> +        dom_claimed = dom_after;
> +    d->unclaimed_pages = dom_claimed;
> +    /* flag accounting bug if system unclaimed_pages would go negative */
> +    sys_before = total_unclaimed_pages;
> +    sys_after = sys_before - (dom_before - dom_claimed);
> +    BUG_ON((sys_before > 0) && (sys_after < 0));
> +    total_unclaimed_pages = sys_after;
> +    spin_unlock(&heap_lock);
> +
> +out:
> +    return d->tot_pages;
> +}
> +
> +int domain_set_unclaimed_pages(struct domain *d, unsigned long pages,
> +                                unsigned long flags)
> +{
> +    int ret = -ENOMEM;
> +    unsigned long claim, avail_pages;
> +
> +    /*
> +     * Sanity check.
> +     */
> +    switch ( flags ) {
> +    case XENMEM_CLAIMF_normal:
> +    case XENMEM_CLAIMF_tmem:
> +        if ( pages == 0 )
> +            return -EINVAL;
> +        break;
> +    case XENMEM_CLAIMF_cancel:
> +        pages = 0;
> +        break;
> +    default:
> +    return -ENOSYS;
> +    }
> +
> +    /*
> +     * take the domain's page_alloc_lock, else all d->tot_page adjustments
> +     * must always take the global heap_lock rather than only in the much
> +     * rarer case that d->unclaimed_pages is non-zero
> +     */
> +    spin_lock(&d->page_alloc_lock);
> +    spin_lock(&heap_lock);
> +
> +    /* pages==0 means "unset" the claim. */
> +    if ( pages == 0 )
> +    {
> +        total_unclaimed_pages -= d->unclaimed_pages;
> +        d->unclaimed_pages = 0;
> +        ret = 0;
> +        goto out;
> +    }
> +
> +    /* only one active claim per domain please */
> +    if ( d->unclaimed_pages )
> +    {
> +        ret = -EINVAL;
> +        goto out;
> +    }
> +
> +    /* disallow a claim not exceeding current tot_pages or above max_pages */
> +    if ( (pages <= d->tot_pages) || (pages > d->max_pages) )
> +    {
> +        ret = -EINVAL;
> +        goto out;
> +    }
> +
> +    /* how much memory is available? */
> +    avail_pages = total_avail_pages;
> +    if ( flags & XENMEM_CLAIMF_tmem )
> +        avail_pages += tmem_freeable_pages();
> +    avail_pages -= total_unclaimed_pages;
> +
> +    /*
> +     * note, if domain has already allocated memory before making a claim
> +     * then the claim must take tot_pages into account
> +     */
> +    claim = pages - d->tot_pages;
> +    if ( claim > avail_pages )
> +        goto out;
> +
> +    /* yay, claim fits in available memory, stake the claim, success! */
> +    d->unclaimed_pages = claim;
> +    total_unclaimed_pages += d->unclaimed_pages;
> +    ret = 0;
> +
> +out:
> +    spin_unlock(&heap_lock);
> +    spin_unlock(&d->page_alloc_lock);
> +    return ret;
> +}
> +
> +long get_total_unclaimed_pages(void)
> +{
> +    return total_unclaimed_pages;
>  }
>  
>  static unsigned long init_node_heap(int node, unsigned long mfn,
> @@ -397,7 +510,7 @@ static void __init setup_low_mem_virq(void)
>  static void check_low_mem_virq(void)
>  {
>      unsigned long avail_pages = total_avail_pages +
> -        (opt_tmem ? tmem_freeable_pages() : 0);
> +        (opt_tmem ? tmem_freeable_pages() : 0) - total_unclaimed_pages;
>  
>      if ( unlikely(avail_pages <= low_mem_virq_th) )
>      {
> @@ -466,6 +579,15 @@ static struct page_info *alloc_heap_pages(
>      spin_lock(&heap_lock);
>  
>      /*
> +     * Claimed memory is considered unavailable unless the request
> +     * is made by a domain with sufficient unclaimed pages.
> +     */
> +    if ( (total_unclaimed_pages + request >
> +          total_avail_pages + tmem_freeable_pages()) &&
> +          (d == NULL || d->unclaimed_pages < request) )
> +        goto not_found;
> +
> +    /*
>       * TMEM: When available memory is scarce due to tmem absorbing it, allow
>       * only mid-size allocations to avoid worst of fragmentation issues.
>       * Others try tmem pools then fail.  This is a workaround until all
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> index deb19db..efae421 100644
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -36,7 +36,7 @@
>  #include "grant_table.h"
>  #include "hvm/save.h"
>  
> -#define XEN_DOMCTL_INTERFACE_VERSION 0x00000008
> +#define XEN_DOMCTL_INTERFACE_VERSION 0x00000009
>  
>  /*
>   * NB. xen_domctl.domain is an IN/OUT parameter for this operation.
> @@ -95,6 +95,7 @@ struct xen_domctl_getdomaininfo {
>      uint32_t flags;              /* XEN_DOMINF_* */
>      uint64_aligned_t tot_pages;
>      uint64_aligned_t max_pages;
> +    uint64_aligned_t unclaimed_pages;
>      uint64_aligned_t shr_pages;
>      uint64_aligned_t paged_pages;
>      uint64_aligned_t shared_info_frame; /* GMFN of shared_info struct */
> diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
> index 1c5ca19..5c264ca 100644
> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -68,6 +68,8 @@ struct xen_memory_reservation {
>       *   IN:  GPFN bases of extents to populate with memory
>       *   OUT: GMFN bases of extents that were allocated
>       *   (NB. This command also updates the mach_to_phys translation table)
> +     * XENMEM_claim_pages:
> +     *   IN: must be zero
>       */
>      XEN_GUEST_HANDLE(xen_pfn_t) extent_start;
>  
> @@ -430,10 +432,51 @@ typedef struct xen_mem_sharing_op xen_mem_sharing_op_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
>  
>  /*
> - * Reserve ops for future/out-of-tree "claim" patches (Oracle)
> + * Attempt to stake a claim for a domain on a quantity of pages
> + * of system RAM, but _not_ assign specific pageframes.  Only
> + * arithmetic is performed so the hypercall is very fast and need
> + * not be preemptible, thus sidestepping time-of-check-time-of-use
> + * races for memory allocation.  Returns 0 if the hypervisor page
> + * allocator has atomically and successfully claimed the requested
> + * number of pages, else non-zero.
> + *
> + * Any domain may have only one active claim.  When sufficient memory
> + * has been allocated to resolve the claim, the claim silently expires.
> + * Claiming zero pages effectively resets any outstanding claim and
> + * is always successful.
> + *
> + * Note that a valid claim may be staked even after memory has been
> + * allocated for a domain.  In this case, the claim is not incremental,
> + * i.e. if the domain's tot_pages is 3, and a claim is staked for 10,
> + * only 7 additional pages are claimed.
> + *
> + * Caller must be privileged or the hypercall fails.
>   */
>  #define XENMEM_claim_pages                  24
> -#define XENMEM_get_unclaimed_pages          25
> +
> +/*
> + * XENMEM_claim_pages flags:
> + *  normal: claim is successful only if sufficient free pages
> + *    are available.
> + *  tmem: claim is successful only if sufficient free pages
> + *    are available and (if tmem is enabled) hypervisor
> + *    may also consider tmem "freeable" pages to satisfy the claim.
> + *  cancel: cancel the outstanding claim for the domain.
> + */
> +#define XENMEM_CLAIMF_ignored                0
> +
> +#define _XENMEM_CLAIMF_normal                1
> +#define XENMEM_CLAIMF_normal                 (1U<<_XENMEM_CLAIMF_normal)
> +#define _XENMEM_CLAIMF_tmem                  2
> +#define XENMEM_CLAIMF_tmem                   (1U<<_XENMEM_CLAIMF_tmem)
> +#define _XENMEM_CLAIMF_cancel                3
> +#define XENMEM_CLAIMF_cancel                 (1U<<_XENMEM_CLAIMF_cancel)
> +/*
> + * Get the number of pages currently claimed (but not yet "possessed")
> + * across all domains.  The caller must be privileged but otherwise
> + * the call never fails.
> + */
> +#define XENMEM_get_unclaimed_pages            25
>  
>  #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
>  
> diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
> index 2f701f5..f4ab991 100644
> --- a/xen/include/xen/mm.h
> +++ b/xen/include/xen/mm.h
> @@ -49,7 +49,11 @@ void free_xenheap_pages(void *v, unsigned int order);
>  #define alloc_xenheap_page() (alloc_xenheap_pages(0,0))
>  #define free_xenheap_page(v) (free_xenheap_pages(v,0))
>  
> +/* Claim handling */
>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages);
> +int domain_set_unclaimed_pages(
> +    struct domain *d, unsigned long pages, unsigned long flags);
> +long get_total_unclaimed_pages(void);
>  
>  /* Domain suballocator. These functions are *not* interrupt-safe.*/
>  void init_domheap_pages(paddr_t ps, paddr_t pe);
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index e108436..9c5263f 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -242,6 +242,7 @@ struct domain
>      struct page_list_head page_list;  /* linked list */
>      struct page_list_head xenpage_list; /* linked list (size xenheap_pages)
> */
>      unsigned int     tot_pages;       /* number of pages currently possesed
> */
> +    unsigned int     unclaimed_pages; /* pages claimed but not possessed
> */
>      unsigned int     max_pages;       /* maximum value for tot_pages
> */
>      atomic_t         shr_pages;       /* number of shared pages
> */
>      atomic_t         paged_pages;     /* number of paged-out pages
> */



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.