[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 1/6] mmu: Introduce XENMEM_claim_pages (subop of memory ops).
This patch is fine by me at this point. Acked-by: Keir Fraser <keir@xxxxxxx> On 04/03/2013 17:47, "Konrad Rzeszutek Wilk" <konrad.wilk@xxxxxxxxxx> wrote: > From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx> > > When guests memory consumption is volatile (multiple guests > ballooning up/down) we are presented with the problem of > being able to determine exactly how much memory there is > for allocation of new guests without negatively impacting > existing guests. Note that the existing models (xapi, xend) > drive the memory consumption from the tool-stack and assume > that the guest will eventually hit the memory target. Other > models, such as the dynamic memory utilized by tmem, do this > differently - the guest drivers the memory consumption (up > to the d->max_pages ceiling). With dynamic memory model, the > guest frequently can balloon up and down as it sees fit. > This presents the problem to the toolstack that it does not > know atomically how much free memory there is (as the information > gets stale the moment the d->tot_pages information is provided > to the tool-stack), and hence when starting a guest can fail > during the memory creation process. Especially if the process > is done in parallel. In a nutshell what we need is a atomic > value of all domains tot_pages during the allocation of guests. > Naturally holding a lock for such a long time is unacceptable. > Hence the goal of this hypercall is to attempt to atomically and very > quickly determine if there are sufficient pages available in the > system and, if so, "set aside" that quantity of pages for future > allocations by that domain. Unlike an existing hypercall such as > increase_reservation or populate_physmap, specific physical > pageframes are not assigned to the domain because this > cannot be done sufficiently quickly (especially for very large > allocations in an arbitrarily fragmented system) and so the > existing mechanisms result in classic time-of-check-time-of-use > (TOCTOU) races. One can think of claiming as similar to a > "lazy" allocation, but subsequent hypercalls are required > to do the actual physical pageframe allocation. > > Note that one of effects of this hypercall is that from the > perspective of other running guests - suddenly there is > a new guest occupying X amount of pages. This means that when > we try to balloon up they will hit the system-wide ceiling of > available free memory (if the total sum of the existing d->max_pages >> = host memory). This is OK - as that is part of the overcommit. > What we DO NOT want to do is dictate their ceiling should be > (d->max_pages) as that is risky and can lead to guests OOM-ing. > It is something the guest needs to figure out. > > In order for a toolstack to "get" information about whether > a domain has a claim and, if so, how large, and also for > the toolstack to measure the total system-wide claim, a > second subop has been added and exposed through domctl > and libxl (see "xen: XENMEM_claim_pages: xc"). > > == Alternative solutions == > There has been a variety of discussion whether the problem > hypercall is solving can be done in user-space, such as: > - For all the existing guest, set their d->max_pages temporarily > to d->tot_pages and create the domain. This forces those > domains to stay at their current consumption level (fyi, this is what > the tmem freeze call is doing). The disadvantage of this is > that needlessly forces the guests to stay at the memory usage > instead of allowing it to decide the optimal target. > - Account only using d->max_pages of how much free memory there is. > This ignores ballooning changes and any over-commit scenario. This > is similar to the scenario where the sum of all d->max_pages (and > the one to be allocated now) on the host is smaller than the available > free memory. As such it ignores the over-commit problem. > - Provide a ring/FIFO along with event channel to notify an userspace > daemon of guests memory consumption. This daemon can then provide > up-to-date information to the toolstack of how much free memory > there is. This duplicates what the hypervisor is already doing and > introduced latency issues and catching breath for the toolstack as there > might be millions of these updates on heavily used machine. There might > not be any quiescent state ever and the toolstack will heavily consume > CPU cycles and not ever provide up-to-date information. > > It has been noted that this claim mechanism solves the > underlying problem (slow failure of domain creation) for > a large class of domains but not all, specifically not > handling (but also not making the problem worse for) PV > domains that specify the "superpages" flag, and 32-bit PV > domains on large RAM systems. These will be addressed at a > later time. > > Code overview: > > Though the hypercall simply does arithmetic within locks, > some of the semantics in the code may be a bit subtle. > > The key variables (d->unclaimed_pages and total_unclaimed_pages) > starts at zero if no claim has yet been staked for any domain. > (Perhaps a better name is "claimed_but_not_yet_possessed" but that's > a bit unwieldy.) If no claim hypercalls are executed, there > should be no impact on existing usage. > > When a claim is successfully staked by a domain, it is like a > watermark but there is no record kept of the size of the claim. > Instead, d->unclaimed_pages is set to the difference between > d->tot_pages and the claim. When d->tot_pages increases or decreases, > d->unclaimed_pages atomically decreases or increases. Once > d->unclaimed_pages reaches zero, the claim is satisfied and > d->unclaimed pages stays at zero -- unless a new claim is > subsequently staked. > > The systemwide variable total_unclaimed_pages is always the sum > of d->unclaimed_pages, across all domains. A non-domain- > specific heap allocation will fail if total_unclaimed_pages > exceeds free (plus, on tmem enabled systems, freeable) pages. > > Claim semantics may be modified by flags. The initial > implementation has three flag, which discerns whether the > caller would like tmem freeable pages to be considered > in determining whether or not the claim can be successfully > staked. Future flags may, for example, specify that the > implementation has one flag, which discerns whether the > caller would like tmem freeable pages to be considered > in determining whether or not the claim can be successfully > staked. Future flags may, for example, specify that the > caller would like the claim to be successful only if there > are sufficient pages available on a single node (per Dario's > suggestion). > > A claim can be cancelled by requesting a claim with the > flag being zero (XENMEM_CLAIMF_ignored). > > A second subop returns the total outstanding claimed pages > systemwide. > > Note: Save/restore/migrate may need to be modified, > else it can be documented that all claims are cancelled. > > == > This patch of the proposed XENMEM_claim_pages hypercall/subop, takes > into account review feedback from Jan and Keir and IanC and Matthew Daley, > plus some fixes found via runtime debugging. > > v9->v10: > - Altered the API a bit (new flags, etc). > > v8->v9: > - Drop snippets already merged by Keir > (26269:21a5b181f8ad and 26270:03cb71bc32f9) > > v7->v8: > - Replace missing tot_pages adjustment [JBeulich] > - Combine domain_inc/dec_tot_pages into one function [JBeulich] > - Added a few more clarifying comments [djm] > > v6->v7: > - Need co-existence with low_mem_virq [JBeulich] > > v5->v6: > - Fix post-inc problem [mattjd] > > v4->v5: > - Split tools part into separate patch [JBeulich] > - Minor coding style fixups [JBeulich] > - Use rcu_lock_domain_by_id, not _target version [JBeulich] > > v3->v4: (please ignore v3) > - Process error, sent stale patch, sorry [djm] > > v2->v3: > - Add per-domain and global "get" for unclaimed info [JBeulich] > - New hypercall(s) should fail for unpriv callers [IanC] > - Non-zero extent_order disallowed [JBeulich] > - Remove bonehead ASSERTs [JBeulich] > - Need not hold heaplock for decrease case too [JBeulich] > - More descriptive failure return values [JBeulich] > - Don't allow a claim to exceed max_pages [IanC] > - Subops must be in correct ifdef block in memory.c [keir] > > v1->v2: > - Add reset-to-zero page claim in domain_kill [JBeulich] > - Proper handling of struct passed to hypercall [JBeulich] > - Fix alloc_heap_pages when a domain has a claim [JBeulich] > - Need not hold heap_lock if !d->unclaimed_pages [keir] > - Fix missed tot_pages call in donate_page [djm] > - Remove domain_reset_unclaimed_pages; use set with zero [djm] > - Bugfixes found through testing in set_unclaimed [djm] > - More comments in code [djm] > - Code formatting fixes [djm] > === > > Signed-off-by: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > --- > xen/common/domain.c | 1 + > xen/common/domctl.c | 1 + > xen/common/memory.c | 31 +++++++++++ > xen/common/page_alloc.c | 126 > +++++++++++++++++++++++++++++++++++++++++++- > xen/include/public/domctl.h | 3 +- > xen/include/public/memory.h | 47 ++++++++++++++++- > xen/include/xen/mm.h | 4 ++ > xen/include/xen/sched.h | 1 + > 8 files changed, 209 insertions(+), 5 deletions(-) > > diff --git a/xen/common/domain.c b/xen/common/domain.c > index b360de1..90ce40b 100644 > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -507,6 +507,7 @@ int domain_kill(struct domain *d) > evtchn_destroy(d); > gnttab_release_mappings(d); > tmem_destroy(d->tmem); > + domain_set_unclaimed_pages(d, 0, 0); > d->tmem = NULL; > /* fallthrough */ > case DOMDYING_dying: > diff --git a/xen/common/domctl.c b/xen/common/domctl.c > index b7f6619..98d6f50 100644 > --- a/xen/common/domctl.c > +++ b/xen/common/domctl.c > @@ -154,6 +154,7 @@ void getdomaininfo(struct domain *d, struct > xen_domctl_getdomaininfo *info) > > info->tot_pages = d->tot_pages; > info->max_pages = d->max_pages; > + info->unclaimed_pages = d->unclaimed_pages; > info->shr_pages = atomic_read(&d->shr_pages); > info->paged_pages = atomic_read(&d->paged_pages); > info->shared_info_frame = mfn_to_gmfn(d, virt_to_mfn(d->shared_info)); > diff --git a/xen/common/memory.c b/xen/common/memory.c > index 08550ef..c27bba5 100644 > --- a/xen/common/memory.c > +++ b/xen/common/memory.c > @@ -712,6 +712,37 @@ long do_memory_op(unsigned long cmd, > XEN_GUEST_HANDLE_PARAM(void) arg) > break; > } > > + case XENMEM_claim_pages: > + if ( !IS_PRIV(current->domain) ) > + return -EPERM; > + > + if ( copy_from_guest(&reservation, arg, 1) ) > + return -EFAULT; > + > + if ( !guest_handle_is_null(reservation.extent_start) ) > + return -EINVAL; > + > + if ( reservation.extent_order != 0 ) > + return -EINVAL; > + > + d = rcu_lock_domain_by_id(reservation.domid); > + if ( d == NULL ) > + return -EINVAL; > + > + rc = domain_set_unclaimed_pages(d, reservation.nr_extents, > + reservation.mem_flags); > + > + rcu_unlock_domain(d); > + > + break; > + > + case XENMEM_get_unclaimed_pages: > + if ( !IS_PRIV(current->domain) ) > + return -EPERM; > + > + rc = get_total_unclaimed_pages(); > + break; > + > default: > rc = arch_memory_op(op, arg); > break; > diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c > index 9e9fb15..2442aba 100644 > --- a/xen/common/page_alloc.c > +++ b/xen/common/page_alloc.c > @@ -252,11 +252,124 @@ static long midsize_alloc_zone_pages; > #define MIDSIZE_ALLOC_FRAC 128 > > static DEFINE_SPINLOCK(heap_lock); > +static long total_unclaimed_pages; /* total outstanding claims by all domains > */ > > unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > { > + long dom_before, dom_after, dom_claimed, sys_before, sys_after; > + > ASSERT(spin_is_locked(&d->page_alloc_lock)); > - return d->tot_pages += pages; > + d->tot_pages += pages; > + > + /* > + * can test d->claimed_pages race-free because it can only change > + * if d->page_alloc_lock and heap_lock are both held, see also > + * domain_set_unclaimed_pages below > + */ > + if ( !d->unclaimed_pages ) > + goto out; > + > + spin_lock(&heap_lock); > + /* adjust domain unclaimed_pages; may not go negative */ > + dom_before = d->unclaimed_pages; > + dom_after = dom_before - pages; > + if ( (dom_before > 0) && (dom_after < 0) ) > + dom_claimed = 0; > + else > + dom_claimed = dom_after; > + d->unclaimed_pages = dom_claimed; > + /* flag accounting bug if system unclaimed_pages would go negative */ > + sys_before = total_unclaimed_pages; > + sys_after = sys_before - (dom_before - dom_claimed); > + BUG_ON((sys_before > 0) && (sys_after < 0)); > + total_unclaimed_pages = sys_after; > + spin_unlock(&heap_lock); > + > +out: > + return d->tot_pages; > +} > + > +int domain_set_unclaimed_pages(struct domain *d, unsigned long pages, > + unsigned long flags) > +{ > + int ret = -ENOMEM; > + unsigned long claim, avail_pages; > + > + /* > + * Sanity check. > + */ > + switch ( flags ) { > + case XENMEM_CLAIMF_normal: > + case XENMEM_CLAIMF_tmem: > + if ( pages == 0 ) > + return -EINVAL; > + break; > + case XENMEM_CLAIMF_cancel: > + pages = 0; > + break; > + default: > + return -ENOSYS; > + } > + > + /* > + * take the domain's page_alloc_lock, else all d->tot_page adjustments > + * must always take the global heap_lock rather than only in the much > + * rarer case that d->unclaimed_pages is non-zero > + */ > + spin_lock(&d->page_alloc_lock); > + spin_lock(&heap_lock); > + > + /* pages==0 means "unset" the claim. */ > + if ( pages == 0 ) > + { > + total_unclaimed_pages -= d->unclaimed_pages; > + d->unclaimed_pages = 0; > + ret = 0; > + goto out; > + } > + > + /* only one active claim per domain please */ > + if ( d->unclaimed_pages ) > + { > + ret = -EINVAL; > + goto out; > + } > + > + /* disallow a claim not exceeding current tot_pages or above max_pages */ > + if ( (pages <= d->tot_pages) || (pages > d->max_pages) ) > + { > + ret = -EINVAL; > + goto out; > + } > + > + /* how much memory is available? */ > + avail_pages = total_avail_pages; > + if ( flags & XENMEM_CLAIMF_tmem ) > + avail_pages += tmem_freeable_pages(); > + avail_pages -= total_unclaimed_pages; > + > + /* > + * note, if domain has already allocated memory before making a claim > + * then the claim must take tot_pages into account > + */ > + claim = pages - d->tot_pages; > + if ( claim > avail_pages ) > + goto out; > + > + /* yay, claim fits in available memory, stake the claim, success! */ > + d->unclaimed_pages = claim; > + total_unclaimed_pages += d->unclaimed_pages; > + ret = 0; > + > +out: > + spin_unlock(&heap_lock); > + spin_unlock(&d->page_alloc_lock); > + return ret; > +} > + > +long get_total_unclaimed_pages(void) > +{ > + return total_unclaimed_pages; > } > > static unsigned long init_node_heap(int node, unsigned long mfn, > @@ -397,7 +510,7 @@ static void __init setup_low_mem_virq(void) > static void check_low_mem_virq(void) > { > unsigned long avail_pages = total_avail_pages + > - (opt_tmem ? tmem_freeable_pages() : 0); > + (opt_tmem ? tmem_freeable_pages() : 0) - total_unclaimed_pages; > > if ( unlikely(avail_pages <= low_mem_virq_th) ) > { > @@ -466,6 +579,15 @@ static struct page_info *alloc_heap_pages( > spin_lock(&heap_lock); > > /* > + * Claimed memory is considered unavailable unless the request > + * is made by a domain with sufficient unclaimed pages. > + */ > + if ( (total_unclaimed_pages + request > > + total_avail_pages + tmem_freeable_pages()) && > + (d == NULL || d->unclaimed_pages < request) ) > + goto not_found; > + > + /* > * TMEM: When available memory is scarce due to tmem absorbing it, allow > * only mid-size allocations to avoid worst of fragmentation issues. > * Others try tmem pools then fail. This is a workaround until all > diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h > index deb19db..efae421 100644 > --- a/xen/include/public/domctl.h > +++ b/xen/include/public/domctl.h > @@ -36,7 +36,7 @@ > #include "grant_table.h" > #include "hvm/save.h" > > -#define XEN_DOMCTL_INTERFACE_VERSION 0x00000008 > +#define XEN_DOMCTL_INTERFACE_VERSION 0x00000009 > > /* > * NB. xen_domctl.domain is an IN/OUT parameter for this operation. > @@ -95,6 +95,7 @@ struct xen_domctl_getdomaininfo { > uint32_t flags; /* XEN_DOMINF_* */ > uint64_aligned_t tot_pages; > uint64_aligned_t max_pages; > + uint64_aligned_t unclaimed_pages; > uint64_aligned_t shr_pages; > uint64_aligned_t paged_pages; > uint64_aligned_t shared_info_frame; /* GMFN of shared_info struct */ > diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h > index 1c5ca19..5c264ca 100644 > --- a/xen/include/public/memory.h > +++ b/xen/include/public/memory.h > @@ -68,6 +68,8 @@ struct xen_memory_reservation { > * IN: GPFN bases of extents to populate with memory > * OUT: GMFN bases of extents that were allocated > * (NB. This command also updates the mach_to_phys translation table) > + * XENMEM_claim_pages: > + * IN: must be zero > */ > XEN_GUEST_HANDLE(xen_pfn_t) extent_start; > > @@ -430,10 +432,51 @@ typedef struct xen_mem_sharing_op xen_mem_sharing_op_t; > DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t); > > /* > - * Reserve ops for future/out-of-tree "claim" patches (Oracle) > + * Attempt to stake a claim for a domain on a quantity of pages > + * of system RAM, but _not_ assign specific pageframes. Only > + * arithmetic is performed so the hypercall is very fast and need > + * not be preemptible, thus sidestepping time-of-check-time-of-use > + * races for memory allocation. Returns 0 if the hypervisor page > + * allocator has atomically and successfully claimed the requested > + * number of pages, else non-zero. > + * > + * Any domain may have only one active claim. When sufficient memory > + * has been allocated to resolve the claim, the claim silently expires. > + * Claiming zero pages effectively resets any outstanding claim and > + * is always successful. > + * > + * Note that a valid claim may be staked even after memory has been > + * allocated for a domain. In this case, the claim is not incremental, > + * i.e. if the domain's tot_pages is 3, and a claim is staked for 10, > + * only 7 additional pages are claimed. > + * > + * Caller must be privileged or the hypercall fails. > */ > #define XENMEM_claim_pages 24 > -#define XENMEM_get_unclaimed_pages 25 > + > +/* > + * XENMEM_claim_pages flags: > + * normal: claim is successful only if sufficient free pages > + * are available. > + * tmem: claim is successful only if sufficient free pages > + * are available and (if tmem is enabled) hypervisor > + * may also consider tmem "freeable" pages to satisfy the claim. > + * cancel: cancel the outstanding claim for the domain. > + */ > +#define XENMEM_CLAIMF_ignored 0 > + > +#define _XENMEM_CLAIMF_normal 1 > +#define XENMEM_CLAIMF_normal (1U<<_XENMEM_CLAIMF_normal) > +#define _XENMEM_CLAIMF_tmem 2 > +#define XENMEM_CLAIMF_tmem (1U<<_XENMEM_CLAIMF_tmem) > +#define _XENMEM_CLAIMF_cancel 3 > +#define XENMEM_CLAIMF_cancel (1U<<_XENMEM_CLAIMF_cancel) > +/* > + * Get the number of pages currently claimed (but not yet "possessed") > + * across all domains. The caller must be privileged but otherwise > + * the call never fails. > + */ > +#define XENMEM_get_unclaimed_pages 25 > > #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */ > > diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h > index 2f701f5..f4ab991 100644 > --- a/xen/include/xen/mm.h > +++ b/xen/include/xen/mm.h > @@ -49,7 +49,11 @@ void free_xenheap_pages(void *v, unsigned int order); > #define alloc_xenheap_page() (alloc_xenheap_pages(0,0)) > #define free_xenheap_page(v) (free_xenheap_pages(v,0)) > > +/* Claim handling */ > unsigned long domain_adjust_tot_pages(struct domain *d, long pages); > +int domain_set_unclaimed_pages( > + struct domain *d, unsigned long pages, unsigned long flags); > +long get_total_unclaimed_pages(void); > > /* Domain suballocator. These functions are *not* interrupt-safe.*/ > void init_domheap_pages(paddr_t ps, paddr_t pe); > diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h > index e108436..9c5263f 100644 > --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -242,6 +242,7 @@ struct domain > struct page_list_head page_list; /* linked list */ > struct page_list_head xenpage_list; /* linked list (size xenheap_pages) > */ > unsigned int tot_pages; /* number of pages currently possesed > */ > + unsigned int unclaimed_pages; /* pages claimed but not possessed > */ > unsigned int max_pages; /* maximum value for tot_pages > */ > atomic_t shr_pages; /* number of shared pages > */ > atomic_t paged_pages; /* number of paged-out pages > */ _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |