[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing



> -----Original Message-----
> From: xen-devel-bounces@xxxxxxxxxxxxx [mailto:xen-devel-
> bounces@xxxxxxxxxxxxx] On Behalf Of Ian Campbell
> Sent: Thursday, October 17, 2013 7:30 PM
> To: Jaeyong Yoo
> Cc: 'Stefano Stabellini'; xen-devel@xxxxxxxxxxxxx
> Subject: Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for
> dirty page tracing
> 
> On Thu, 2013-10-17 at 19:12 +0900, Jaeyong Yoo wrote:
> 
> > > > @@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d)
> > > >      unmap_domain_page_global(d->arch.dirty.second_lvl);
> > > >  }
> > > >
> > > > +/* routine for dirty-page tracing
> > > > + *
> > > > + * On first write, it page faults, its entry is changed to
> > > > +read-write,
> > > > + * and on retry the write succeeds.
> > > > + *
> > > > + * for locating p2m of the faulting entry, we use virtual-linear
> > > > +page
> > > table.
> > > > + */
> > > > +void handle_page_fault(struct domain *d, paddr_t addr) {
> > > > +    lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr);
> > > > +    if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 )
> > >
> > > I think you need to distinguish p2m entries which are r/o due to log
> > > dirty from those which are r/o due to wanting them to be r/o.
> > >
> > > Maybe we don't have the latter right now? We will eventually need
> > > p2m types I think.
> >
> > Yes, that should be distinguished. Actually, that was in my mind and
> > apparently I failed to remember. For doing this, I'm thinking to use
> > un-used field in pte as an indicator of 'wanting to be r/o' and check
> this indicator in permission fault.
> 
> Yes, I think we can use the 4 avail bits to store a p2m type. ISTR
> discussing this before, but perhaps it wasn't with you!
> 
> >
> > >
> > > > +    {
> > > > +        lpae_t pte = *vlp2m_pte;
> > > > +        pte.p2m.write = 1;
> > > > +        write_pte(vlp2m_pte, pte);
> > > > +        flush_tlb_local();
> > > > +
> > > > +        /* in order to remove mappings in cleanup stage */
> > > > +        add_mapped_vaddr(d, addr);
> > >
> > > No locks or atomic operations here? How are races with the tools
> > > reading the dirty bitmap addressed?  If it is by clever ordering of
> > > the checks and pte writes then I think a comment would be in order.
> >
> > Actually, the lock is inside the add_mapped_vaddr function.
> 
> But that only covers the bitmap update, not the pte frobbing.

I also locked with the same lock at the get_dirty_bitmap which is reading
the bitmap.

> 
> >
> > >
> > > > +    }
> > > > +}
> > > > +
> > > >  /*
> > > >   * Local variables:
> > > >   * mode: C
> > > > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c index
> > > > 2d09fef..3b2b11a 100644
> > > > --- a/xen/arch/arm/p2m.c
> > > > +++ b/xen/arch/arm/p2m.c
> > > > @@ -6,6 +6,8 @@
> > > >  #include <xen/bitops.h>
> > > >  #include <asm/flushtlb.h>
> > > >  #include <asm/gic.h>
> > > > +#include <xen/guest_access.h>
> > > > +#include <xen/pfn.h>
> > > >
> > > >  void dump_p2m_lookup(struct domain *d, paddr_t addr)  { @@ -408,6
> > > > +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned
> > > > +long
> > > gpfn)
> > > >      return p >> PAGE_SHIFT;
> > > >  }
> > > >
> > > > +#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) -
> sizeof(int))\
> > > > +                        / sizeof(unsigned long)
> > > > +
> > > > +/* An array-based linked list for storing virtual addresses
> > > > +(i.e., the 3rd
> > > > + * level table mapping) that should be destroyed after live
> > > > +migration */ struct mapped_va_node {
> > > > +    struct page_info *next;
> > > > +    struct mapped_va_node *mvn_next;
> > > > +    int items;
> > > > +    unsigned long vaddrs[MAX_VA_PER_NODE]; };
> > > > +
> > > > +int add_mapped_vaddr(struct domain *d, unsigned long va) {
> > >
> > > This function seems awefuly expensive (contains allocations etc) for
> > > a function called on every page fault during a migration.
> > >
> > > Why are you remembering the full va of every fault when a single bit
> > > per page would do?
> > >
> > > I think you can allocate a bitmap when logdirty is enabled. At a bit
> > > per page it shouldn't be too huge.
> > >
> > > You might be able to encode this in the p2m which you walk when the
> > > tools ask for the bit map (i.e. p2m.type==writeable p2m.write==0 =>
> > > dirty), but I think a bitmap would do the trick and be easier to
> implement.
> >
> > There are three candidates:
> >
> > 1) bitmap
> > 2) encoding into p2m
> > 3) linked list of VAs.
> >
> > And, two functions for dirty-page tracing:
> >
> > A) dirty-page handling (At trap)
> > B) getting dirty bitmap (for toolstack)
> >
> > 1) and 2) have advantage for A) but have to search the full range of
> > pages at B) to find out which page is dirtied and set the page as
> > read-only for next dirty detection. On the otherhand, 3) does not need
> to search the full range at B).
> > I prefer 3) due to the 'non-full range search' but I understand your
> concern.
> > Maybe I can do some measurement for each method to see which one better.
> 
> If you have a bitmap then you can do a scan for each set bit, and the
> offset of each bit corresponds to the guest pfn, which allows you to
> directly calculate the address of the pte in the vlpt.
> 
> I don't think doing a scan for each set bit is going to compare
> unfavourably to walking a linked list. Certainly not once the memory
> allocator gets involved.

Oh, do you mean using something like "find_next_zero_bit"? At the first
time,
I thought searching every bit-by-bit but, I just see findbit.S and it looks
quite optimized. 

> 
> > > > +/* get the dirty bitmap from the linked list stored by
> > > > +add_mapped_vaddr
> > > > + * and also clear the mapped vaddrs linked list */ static void
> > > > +get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) {
> > > > +    struct page_info *page_head;
> > > > +    struct mapped_va_node *mvn_head;
> > > > +    struct page_info *tmp_page;
> > > > +    struct mapped_va_node *tmp_mvn;
> > > > +    vaddr_t gma_start = 0;
> > > > +    vaddr_t gma_end = 0;
> > > > +
> > > > +    /* this hypercall is called from domain 0, and we don't know
> > > > + which
> > > guest's
> > > > +     * vlpt is mapped in xen_second, so, to be sure, we restore
> > > > + vlpt
> > > here */
> > > > +    restore_vlpt(d);
> > >
> > > AFAICT you don't actually use the vlpt here, just the domains list
> > > of dirty addresses (which as above could become a bitmap)
> >
> > I think vlpt is still needed because at this stage, we have to reset
> > the write permission of dirtied pages. Maybe some tricks for efficiently
> doing this?
> 
> Ah, of course.
> 
> Do you need to be careful about the context switch of a dom0 vcpu which
> has a foreign vlpt loaded? I suppose you unload it at the end of this
> function so it is safe because Xen doesn't preempt vcpus in hypervisor
> mode.


That is my concern now. For safety, unloading is necessary but it requires
flushing vlpt which is not a cheap operation.

Actually, there is one more concern:
Since we only context switch the vlpt area with 'migrating domains' when two
domains (one migrating and the other not) are context switched, the
non-migrating
domain have access to the foreign vlpt. In order to prevent the access of
foreign vlpt area, do you think unloading of vlpt when switching into
non-migrating
domains also necessary?

> 
> > >
> > > > +    struct p2m_domain *p2m = &d->arch.p2m;
> > > > +    vaddr_t ram_base;
> > > > +    int i1, i2, i3;
> > > > +    int first_index, second_index, third_index;
> > > > +    lpae_t *first = __map_domain_page(p2m->first_level);
> > > > +    lpae_t pte, *second = NULL, *third = NULL;
> > > > +
> > > > +    get_gma_start_end(d, &ram_base, NULL);
> > > > +
> > > > +    first_index = first_table_offset((uint64_t)ram_base);
> > > > +    second_index = second_table_offset((uint64_t)ram_base);
> > > > +    third_index = third_table_offset((uint64_t)ram_base);
> > > > +
> > > > +    BUG_ON( !first && "Can't map first level p2m." );
> > > > +
> > > > +    spin_lock(&p2m->lock);
> > > > +
> > > > +    for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 )
> > > > +    {
> > > > +        lpae_walk_t first_pte = first[i1].walk;
> > > > +        if ( !first_pte.valid || !first_pte.table )
> > > > +            goto out;
> > > > +
> > > > +        second = map_domain_page(first_pte.base);
> > > > +        BUG_ON( !second && "Can't map second level p2m.");
> > > > +        for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 )
> > > > +        {
> > > > +            lpae_walk_t second_pte = second[i2].walk;
> > > > +
> > > > +            /* a nasty hack for vlpt due to the difference
> > > > +             * of page table entry layout between p2m and pt */
> > > > +            second[i2].pt.ro = 0;
> > >
> > > What is the hack here?
> > >
> > > The issue is that the p2m second level, which is a table entry in
> > > the p2m is installed into the Xen page tables where it ends up the
> > > third level, treated as a block entry.
> > >
> > > That's OK, because for both PT and P2M bits 2..10 are ignored for
> > > table entries. So I think rather than this hack here we should
> > > simply ensure that our non-leaf p2m's have the correct bits in them
> > > to be used as p2m table entries.
> >
> > Oh, I see. So, should I put something like this in create_p2m_entries
> function?
> 
> Probably p2m_create_table or maybe mfn_to_p2m entry would be the right
> place.

OK.

> 
> Ian.
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.