[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access



On Apr 30, 2014, at 6:20 PM, "Aravindh Puthiyaparambil (aravindp)" 
<aravindp@xxxxxxxxx> wrote:

>>>>> +        default:
>>>>> +            *flags &= ~(_PAGE_NX_BIT);
>>>>> +            *flags |= (_PAGE_RW|_PAGE_PRESENT);
>>>>> +            break;
>>>>> +    }
>>>> 
>>>> Tim will have the final say here, but I'd suggest dropping all these
>>>> needless parentheses (I see only one case where they're really
>>>> needed).
>>> 
>>> OK, I will drop all the needless ones.  Could you please point out the
>>> case they are needed? I couldn't find anything in the CODING_STYLE
>>> that dictates what should be done.
>> 
>> Convention is to use them when operator precedence isn't considered
>> obvious. What "obvious" here is differs between people, but at least all
>> precedence rules defined by school mathematics should generally not require
>> parenthesizing. Which in the case above means no parentheses around the
>> right side expression of assignment operators.
> 
> OK, I understand and will fix them appropriately. 
> 
>>>>> +
>>>>> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
>>>>> +        return -ENOENT;
>>>>> +
>>>>> +    gfn = get_gpfn_from_mfn(mfn_x(mfn));
>>>> 
>>>> You get "gfn" passed in and blindly overwrite it here? _If_ you need
>>>> to do this lookup, shouldn't you instead check it matches the passed in
>> one?
>>> 
>>> The "gfn" and "mfn" passed in to the function should be the same as it
>>> is a PV guest. That is why I am overwriting "gfn" to get the "gpfn"
>>> from the machine_to_phys_mapping.
>> 
>> Since this overwrites/discards a function parameter, this surely is worth a
>> comment (or a pseudo-comment like ASSERT(gfn == mfn_x(mfn))).
> 
> I will add the ASSERT and a comment for it.
> 
>>>>> +
>>>>> +    /*
>>>>> +     * Values with the MSB set denote MFNs that aren't really part of the
>>>>> +     * domain's pseudo-physical memory map (e.g., the shared info
>> frame).
>>>>> +     * Nothing to do here.
>>>>> +     */
>>>>> +    if ( unlikely(!VALID_M2P(gfn)) )
>>>>> +        return 0;
>>>>> +
>>>>> +    if ( gfn > (d->tot_pages - 1) )
>>>>> +        return -EINVAL;
>>>> 
>>>> Hardly - a PV domain can easily make its address space sparse, and
>>>> iirc pv-ops Linux even does so by default to avoid PFN/MFN collisions
>>>> on MMIO space. (And as a side note, this would better be "gfn >=
>>>> d->tot_pages".)
>>> 
>>> I was using this as a guard against going out of bounds in the access
>>> lookup table which is created based on the domains tot_pages. Please
>>> let me know what I should be using instead of tot_pages when doing this.
>> 
>> Afaict For PV guests there's absolutely nothing you can use to bound the
>> range.
>>>>> +    paging_lock(d);
>>>>> +
>>>>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>>>>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>>>>> +    access_table_page =
>>>> map_domain_page(mfn_x(access_lookup_table[table_idx]));
>>>>> +    access_table_page[page_idx] = p2ma;
>>>>> +    unmap_domain_page(access_table_page);
>>>>> +
>>>>> +    if ( sh_remove_all_mappings(d->vcpu[0], mfn) )
>>>> 
>>>> Is there anything guaranteeing d->vcpu and d->vcpu[0] being non- NULL?
>>> 
>>> I am not sure. I see shadow_track_dirty_vram() calling this without a check.
>>> Maybe I should check and call it only if d->vcpu is not null.
>> 
>> You ought to check both, at least via ASSERT().
> 
> OK, I will add the ASSERTs.
> 
>>>>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>>>>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>>>>> +
>>>>> +    access_table_page =
>>>> map_domain_page(mfn_x(access_lookup_table[table_idx]));
>>>>> +
>>>>> +    /* This is a hint to take the default permissions */
>>>>> +    if ( access_table_page[page_idx] == p2m_access_n )
>>>>> +        access_table_page[page_idx] = p2m->default_access;
>>>> 
>>>> We're in "get" here - why does that modify any global state?
>>> 
>>> I do not initialize the access lookup table to the default value. But
>>> the table is zeroed as part of initialization when each page is
>>> allocated using shadow_alloc_p2m_page(). I then use the "p2m_access_n"
>>> / 0 value as hint that the default value should be returned as
>>> p2m_access_n is not valid for PV domains. If you prefer that I
>>> initialize the table to the default value, I will gladly do so.
>> 
>> I think implying _n to mean "default" is at least a latent bug anyway (since 
>> that
>> precludes setting that type). And _if_ "get" really does any modification, 
>> that
>> minimally would require a big comment.
> 
> For PV domains access_n is precluded.
> 
>> Furthermore, with it being possible for the default type to change, the net
>> effect might easily be different between setting the default type at
>> initialization time or when the first lookup happens.
> 
> I am not sure that I follow your concern here. Setting default access does 
> not mean that the access table gets cleared. So if a particular MFN entry's 
> access value changed and the default value was changed, then the MFN should 
> retain the access value it was changed to and not the default value. The new 
> default value will only be returned for MFNs for which set_entry was not 
> called on and I think that is correct.
> 
>>>>> +int p2m_mem_access_init(struct p2m_domain *p2m) {
>>>>> +    struct domain *d = p2m->domain;
>>>>> +    mfn_t *access_lookup_table;
>>>>> +    uint32_t nr_access_table_pages;
>>>>> +    uint32_t ctr;
>>>>> +
>>>>> +    nr_access_table_pages = get_domain_nr_access_table_pages(d);
>>>>> +    access_lookup_table = xzalloc_array(mfn_t,
>>>>> + nr_access_table_pages);
>>>> 
>>>> This surely is going to be an order > 0 allocation, and you even risk
>>>> it being an order > MAX_ORDER one. The former is disallowed at runtime
>>>> by convention/agreement, the latter is an outright bug.
>>> 
>>> Sorry, I was not aware of the convention. What should I be doing here
>>> instead?
>> 
>> Especially with the above note about GFN space being effectively unbounded
>> for PV guests, you need to find a different data structure to represent the
>> information you need. Perhaps something similar to what log-dirty mode
>> uses, except that you would need more levels in the extreme case, and hence
>> it might be worthwhile making the actually used number of levels dynamic?
> 
> I am a little confused here. In my initial discussions about this I was told 
> that PV guests have bounded and contiguous memory. This is why I took the 
> approach of an indexable array. 
> 
> " OTOH, all you need is a byte per pfn, and the great thing is that in PV 
> domains, the physmap is bounded and continuous. Unlike HVM and its PCI holes, 
> etc, which demand the sparse tree structure. So you can allocate an easily 
> indexable array, notwithstanding super page concerns (I think/hope)."
> 
> http://lists.xenproject.org/archives/html/xen-devel/2013-11/msg03860.html
> 
> Did I misunderstand something here? 
> PS: I have CCed Andres too so that he can clarify.
I can muddy things up further.

In the good olden days that was the case. It seems that since linux 3.x, pv 
dom0 is defined with a physmap as large as the host, and plenty of holes in it 
which are plugged with the m2p override. I am unfortunately unhelpfully hazy on 
details. A situation like this will lead to max_pfn being very different from 
tot_pages. This is common in hvm.

Apologies for lack of precision
Andres
> 
>>>>> @@ -1414,6 +1419,8 @@ long p2m_set_mem_access(struct domain *d,
>>>> unsigned long pfn, uint32_t nr,
>>>>>     if ( pfn == ~0ul )
>>>>>     {
>>>>>         p2m->default_access = a;
>>>>> +        if ( is_pv_domain(d) )
>>>>> +            return p2m_mem_access_set_default(p2m);
>>>> 
>>>> Is it really correct to set p2m->default_access _before_ calling that
>>>> function? Perhaps it wouldn't be correct doing it the other way around
>>>> either - I suppose you need to hold the necessary lock across both
>>>> operations.
>>> 
>>> I do not see the problem here. p2m_mem_access_set_default() just blows
>>> the shadows away after taking the paging_lock. There is no actual
>>> setting of default permission for all pages in the PV case.
>> 
>> But the function having a return value means it might fail, in which case you
>> shouldn't have set the new default.
> 
> Good point, I will fix that.
> 
>>>>> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v,
>>>> mfn_t gmfn)
>>>>>         if ( !(shadow_mode_external(v->domain)
>>>>>                && (page->count_info & PGC_count_mask) <= 3
>>>>>                && ((page->u.inuse.type_info & PGT_count_mask)
>>>>> -                   == !!is_xen_heap_page(page))) )
>>>>> +                       == !!is_xen_heap_page(page)))
>>>>> +                    && !(shadow_mode_mem_access(v->domain)) )
>>>> 
>>>> You're breaking indentation, there are pointless parentheses again,
>>>> but most importantly - why?
>>> 
>>> Sorry, I meant to ask a question about this in my patch message and
>>> mark this as a workaround. I am seeing the message on all
>>> sh_remove_all_mappings() and I was not able to figure out why this was
>>> happening. I just added this as a work around. I was hoping you or Tim
>> would shed more light on this.
>> 
>> I'm afraid that without detail on which pages the triggers on, and you at 
>> least
>> having spent some time finding out where the stray/extra references may be
>> coming from it's going to be hard to help.
> 
> Here are some of the prints I saw. I typically saw it for every set_entry() 
> call.
> 
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 33960: c=8000000000000002 t=7400000000000000 <most common>
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 134b8: c=8000000000000038 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 13344: c=8000000000000003 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 13449: c=8000000000000028 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 1394b: c=8000000000000013 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 137b5: c=8000000000000003 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 33fa6: c=8000000000000002 t=7400000000000000
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 1345d: c=8000000000000028 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 33410: c=8000000000000034 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 33412: c=8000000000000039 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 33953: c=8000000000000011 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 13a63: c=8000000000000028 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 393f2: c=800000000000003a t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 1335b: c=8000000000000028 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 134bc: c=8000000000000036 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 393f0: c=8000000000000014 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 134bd: c=8000000000000008 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 1354e: c=8000000000000010 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 33443: c=8000000000000013 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 
> 39f74: c=8000000000000008 t=7400000000000001
> 
>>>>> +            /*
>>>>> +             * Do not police writes to guest memory emanating from the 
>>>>> Xen
>>>>> +             * kernel. Trying to do so will cause the same pagefault to 
>>>>> occur
>>>>> +             * over and over again with an event being sent to the access
>>>>> +             * listener for each fault. If the access listener's vcpu is 
>>>>> not
>>>>> +             * scheduled during this time, the violation is never 
>>>>> resolved and
>>>>> +             * will eventually end with the host crashing.
>>>>> +             */
>>>>> +            if ( (violation && access_w) &&
>>>>> +                 (regs->eip >= XEN_VIRT_START && regs->eip <=
>> XEN_VIRT_END) )
>>>>> +            {
>>>>> +                violation = 0;
>>>>> +                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
>>>>> +                                    p2m_ram_rw, p2m_access_rw);
>>>>> +            }
>>>> 
>>>> This looks more like a hack than a proper solution - shouldn't the
>>>> listener be allowed to know of hypervisor side accesses?
>>> 
>>> Ideally the listener should know of the hypervisor side accesses but I
>>> ran in to cases where the listener would not get to run and the
>>> pagefault would be tried over and over, eventually causing the host to
>>> crash. I guess more detail about the problem is needed. I will send
>>> out a separate email regarding this, CCing you and Tim.
>> 
>> And if it's intended as a workaround until proper resolution only, you 
>> probably
>> should say so explicitly in the comment, avoiding the need for reviewers to
>> comment on this being a problem.
> 
> Sorry, I will be more explicit about such things in the future.
> 
>>>>> +/* Number of access table pages for a PV domain */ #define
>>>>> +get_domain_nr_access_table_pages(d) \
>>>>> +        DIV_ROUND_UP(P2M_ACCESS_SIZE * (d->tot_pages - 1),
>>>>> +PAGE_SIZE)
>>>> 
>>>> And once again. I wonder whether this code was tested on a suitably
>>>> big pv-ops PV guest.
>>> 
>>> The largest PV domain I tried this with was 28GB.
>> 
>> Which ought to have had GFNs larger than ->tot_pages, so I wonder how this
>> worked.
> 
> It would have had MFNs larger than tot_pages. I don't understand how it would 
> have had GFNs larger than tot_pages.
> 
> Thanks,
> Aravindh
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.