[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v6 04/10] xen: Introduce XENMEM_soft_reset operation



"Jan Beulich" <JBeulich@xxxxxxxx> writes:

>>>> On 13.05.15 at 11:49, <vkuznets@xxxxxxxxxx> wrote:
>> --- a/xen/common/memory.c
>> +++ b/xen/common/memory.c
>> @@ -580,6 +580,234 @@ static long 
>> memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>>      return rc;
>>  }
>>  
>> +static long 
>> memory_soft_reset(XEN_GUEST_HANDLE_PARAM(xen_memory_soft_reset_t) arg)
>> +{
>> +    long rc = 0;
>
> The way it's coded right now, I don't see the need for this or the
> function return type being "long" (but see the comments on the
> public header change).
>
>> +    struct xen_memory_soft_reset softr;
>> +    struct domain *source_d, *dest_d;
>> +    unsigned long mfn, mfn_new, gmfn, last_gmfn, count;
>> +    unsigned int order;
>> +    p2m_type_t p2mt;
>> +    struct page_info *page, *new_page, *next_page;
>> +    int drop_dom_ref;
>> +
>> +    if ( copy_from_guest(&softr, arg, 1) )
>> +        return -EFAULT;
>> +
>> +    if ( softr.source_domid == softr.dest_domid )
>> +        return -EINVAL;
>> +
>> +    source_d = rcu_lock_domain_by_any_id(softr.source_domid);
>> +    if ( source_d == NULL )
>> +    {
>> +        rc = -ESRCH;
>> +        goto fail_early;
>> +    }
>> +
>> +    dest_d = rcu_lock_domain_by_any_id(softr.dest_domid);
>> +    if ( dest_d == NULL )
>> +    {
>> +        rc = -ESRCH;
>> +        rcu_unlock_domain(source_d);
>> +        goto fail_early;
>> +    }
>> +
>> +    if ( dest_d->is_dying )
>> +    {
>> +        rc = -EINVAL;
>> +        goto fail;
>> +    }
>> +
>> +    if ( !source_d->is_dying )
>> +    {
>> +        /*
>> +         * Make sure no allocation/remapping for the source domain is 
>> ongoing
>> +         * and set is_dying flag to prevent such actions in future.
>> +         */
>> +        spin_lock(&source_d->page_alloc_lock);
>> +        source_d->is_dying = DOMDYING_locked;
>
> You need to repeat the is_dying check with the lock held, and act
> accordingly if it isn't zero anymore.

Yep, missed that.

>
> Furthermore I don't see how this prevents there being further
> changes to the GFN <-> MFN mappings for the guest (yet you rely
> on them to be stable below afaict).
>

As you suggested below we can complement that by pausing both source and
destination domains here. In that case these domains won't be changing
their mappings by themselves but it would still be possible for the
hypervisor to change something. We do have !d->is_dying check in many
places though ... In theory we could have taken the p2m_lock() for both
domains but I'm not sure all stuff I need here will cope with p2m_lock()
already being held, this lock is x86-specific and I'm not sure we want
it in the first place. I'd be very grateful for some additional ideas on
how to make it safer.

>> +        spin_unlock(&source_d->page_alloc_lock);
>> +    }
>> +
>> +    last_gmfn = domain_get_maximum_gpfn(source_d);
>> +    gmfn = softr.gmfn_start;
>> +    while ( gmfn <= last_gmfn )
>
> By now you should have done a privilege check.
>

I'm adding xsm_memory_soft_reset() in PATCH 05/10, hope it's sufficient.

>> +    {
>> +        page = get_page_from_gfn(source_d, gmfn, &p2mt, 0);
>> +        if ( unlikely(page == NULL) )
>> +        {
>> +#ifdef CONFIG_X86
>> +            count = 0;
>> +            while ( p2m_is_pod(p2mt) )
>> +            {
>> +                count++;
>> +                if ( gmfn + count > last_gmfn )
>> +                    break;
>> +                page = get_page_from_gfn(source_d, gmfn + count, &p2mt, 0);
>> +                if ( page )
>> +                {
>> +                    put_page(page);
>> +                    page = NULL;
>
> Why? You don't use the value further down.
>

Leftovers, will get rid of it in v7.

>> +                    break;
>> +                }
>> +            }
>> +
>> +            if ( !count )
>> +                gmfn++;
>> +
>> +            while ( count )
>> +            {
>> +                order = get_order_from_pages(count);
>> +                if ( (1ul << order) > count )
>> +                    order -= 1;
>
> --
>
>> +                rc = guest_physmap_mark_populate_on_demand(dest_d, gmfn, 
>> order);
>> +                if ( rc )
>> +                    goto fail;
>> +                count -= 1ul << order;
>> +                gmfn += 1ul << order;
>> +            }
>> +#else
>> +            gmfn++;
>> +#endif
>> +            goto preempt_check;
>> +        }
>> +
>> +        mfn = page_to_mfn(page);
>> +        if ( unlikely(!mfn_valid(mfn)) )
>> +        {
>> +            put_page(page);
>> +            gmfn++;
>> +            goto preempt_check;
>
> I guess this should never happen, and hence issuing a warning may be
> warranted?
>

Yea, not sure what should is be though. "page with invalid MFN!" or something.

>> +        }
>> +
>> +        next_page = page;
>> +
>> +        /*
>> +         * A normal page is supposed to have count_info = 2 (1 from the 
>> domain
>> +         * and 1 from get_page_from_gfn() above). All other pages need to be
>> +         * copied.
>> +         */
>
> I'm afraid this isn't precise enough: _PGC_allocated implies one
> refcount, i.e. if that flag turns out to be clear, you should expect
> just 1. And then I'd wonder whether a !_PGC_allocated page
> needs any handling here in the first place (or perhaps, once the
> GFN <-> MFN mapping is ensured to be stable, this case can't
> happen anyway).

A !_PGC_allocated page should never happen here I suppose but to be on
the safe side we can add _PGC_allocated check here, I think we need to
skip a page or even fail the hypercall in case this check fails.

>
>> +        count = 0;
>> +        while ( next_page && !is_xen_heap_page(next_page) &&
>> +                (next_page->count_info & PGC_count_mask) == 2 &&
>> +                page_to_mfn(next_page) == mfn + count )
>> +        {
>> +            count++;
>> +            drop_dom_ref = 0;
>> +            spin_lock(&source_d->page_alloc_lock);
>> +            page_set_owner(next_page, NULL);
>> +            page_list_del(next_page, &source_d->page_list);
>> +            source_d->tot_pages -= 1;
>
> --
>
>> +            if ( unlikely(source_d->tot_pages == 0) )
>> +                drop_dom_ref = 1;
>> +            spin_unlock(&source_d->page_alloc_lock);
>> +            put_page(next_page);
>> +            if ( drop_dom_ref )
>> +                put_domain(source_d);
>> +
>> +            if ( unlikely(assign_pages(dest_d, next_page, 0, 0)) )
>> +            {
>> +                printk(XENLOG_G_INFO "Failed to assign Dom%d's MFN %lx"
>> +                       " to Dom%d\n", source_d->domain_id, mfn,
>> +                       dest_d->domain_id);
>> +                rc = -EFAULT;
>
> A better (more distinct) error code should be used here. And I
> suppose printing the GFN would be more helpful for debugging
> possible issues than printing the MFN.

As I can see assign_pages() fails in two cases:
1) Over-allocating
2) Domain is dying (this is never supposed to happen as we're holding
the lock).
So actually both MFN and GFN are a bit irrelevant here.

>> +                goto fail;
>> +            }
>> +
>> +            if ( unlikely(gmfn + count > last_gmfn) )
>> +            {
>> +                next_page = NULL;
>> +                break;
>> +            }
>> +
>> +            next_page = get_page_from_gfn(source_d, gmfn + count, &p2mt, 0);
>> +        }
>> +
>> +        if ( next_page && count )
>> +                put_page(next_page);
>> +
>> +        if ( !count && (page->count_info & PGC_count_mask) != 2 )
>
> Re-reading the field seems wrong here: What if it was > 2 above,
> but is == 2 now?
>
> Also together with the "else if" below I think this would better be
> structured
>
>         if ( count )
>         {
>             if ( next_page && count )
>                 put_page(next_page);
>         }
>         else if ( (page->count_info & PGC_count_mask) != 2 )
>         ...
>         else
>
>> +        {
>> +            new_page = alloc_domheap_page(dest_d, 0);
>> +            if ( unlikely(new_page == NULL) )
>> +            {
>> +                printk(XENLOG_G_INFO "Failed to alloc a page to replace"
>> +                       " Dom%d's GFN %lx (MFN %lx) for Dom %d\n",
>> +                       source_d->domain_id, gmfn, mfn, dest_d->domain_id);
>> +                rc = -ENOMEM;
>> +                put_page(page);
>> +                goto fail;
>> +            }
>> +            mfn_new = page_to_mfn(new_page);
>> +            copy_domain_page(mfn_new, mfn);
>> +            mfn = mfn_new;
>> +            put_page(page);
>> +            if ( guest_physmap_add_page(dest_d, gmfn, mfn, 0) )
>> +            {
>> +                printk(XENLOG_G_INFO "Failed to add new GFN %lx"
>> +                       " (MFN %lx) to Dom%d\n",
>> +                       gmfn, mfn, dest_d->domain_id);
>> +                rc = -EFAULT;
>> +                goto fail;
>> +            }
>> +            gmfn++;
>> +            softr.nr_transferred++;
>> +        }
>> +        else if ( !count )
>> +        {
>> +            put_page(page);
>> +            gmfn++;
>> +            goto preempt_check;
>> +        }
>> +
>> +        while ( count )
>> +        {
>> +            order = get_order_from_pages(count);
>> +            if ( (1ul << order) > count )
>> +                order -= 1;
>> +
>> +            guest_physmap_remove_page(source_d, gmfn, mfn, order);
>> +
>> +            if ( guest_physmap_add_page(dest_d, gmfn, mfn, order) )
>
> Considering the non-atomicity between the assign_pages() much
> further up and this call, shouldn't the domain be paused while its
> state is inconsistent?

Yes, actually they're already paused by the toolstack but we'd better
enforce this in the hypervisor. Will fix in v7.

>
>> --- a/xen/include/public/memory.h
>> +++ b/xen/include/public/memory.h
>> @@ -573,7 +573,39 @@ struct xen_vnuma_topology_info {
>>  typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t;
>>  DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);
>>  
>> -/* Next available subop number is 27 */
>> +/*
>> + * Transfer all memory pages from one domain to the other. Pages are 
>> unmapped
>> + * from the source domain and mapped at the same GFNs to the destination
>> + * domain. This hypercall has a side-effect of moving the source domain to 
>> the
>> + * 'dying' state.
>
> Namely due to this, shouldn't this rather be a domctl?
>

Probably, it looked like a memory op in the very beginning but unless we
decide to ensure that domain's mappings don't change by some other
mechanism we'd better make this a domctl destroying the original domain
at the very end.

>> + * If a particular page is mapped more then once in the source domain a new
>> + * empty page is being allocated for the destination domain and the content 
>> is
>> + * being copied. The original page remains mapped to the source domain.
>> + *
>> + * The caller has to be priviliged and it is supposed to set gmfn_start to 
>> 0,
>> + * this field is required for the hypercall continuation.
>
> If it was kept to be a mem-op - mem-ops have a mechanism to be
> continued without custom structure fields.
>
>> +#define XENMEM_soft_reset                   27
>> +struct xen_memory_soft_reset {
>> +    /*
>> +     * [IN] Memory transfer details.
>> +     */
>> +    domid_t source_domid; /* steal pages from */
>> +    domid_t dest_domid;   /* assign pages to */
>> +
>> +    xen_pfn_t gmfn_start; /* start from gmfn */
>> +
>> +    /*
>> +     * [OUT] Number of transfered pages including new allocations.
>> +     */
>> +    xen_ulong_t nr_transferred;
>
> It's not really clear to me what use this is.

I use it to detect a PoD domain for which I need to populate cache but I
think I can live without it.

>
> Jan

-- 
  Vitaly

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.