[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing



On Thu, 2013-10-17 at 19:12 +0900, Jaeyong Yoo wrote:

> > > @@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d)
> > >      unmap_domain_page_global(d->arch.dirty.second_lvl);
> > >  }
> > >
> > > +/* routine for dirty-page tracing
> > > + *
> > > + * On first write, it page faults, its entry is changed to
> > > +read-write,
> > > + * and on retry the write succeeds.
> > > + *
> > > + * for locating p2m of the faulting entry, we use virtual-linear page
> > table.
> > > + */
> > > +void handle_page_fault(struct domain *d, paddr_t addr) {
> > > +    lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr);
> > > +    if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 )
> > 
> > I think you need to distinguish p2m entries which are r/o due to log dirty
> > from those which are r/o due to wanting them to be r/o.
> > 
> > Maybe we don't have the latter right now? We will eventually need p2m
> > types I think.
> 
> Yes, that should be distinguished. Actually, that was in my mind and 
> apparently
> I failed to remember. For doing this, I'm thinking to use un-used field in pte
> as an indicator of 'wanting to be r/o' and check this indicator in permission 
> fault. 

Yes, I think we can use the 4 avail bits to store a p2m type. ISTR
discussing this before, but perhaps it wasn't with you!

> 
> > 
> > > +    {
> > > +        lpae_t pte = *vlp2m_pte;
> > > +        pte.p2m.write = 1;
> > > +        write_pte(vlp2m_pte, pte);
> > > +        flush_tlb_local();
> > > +
> > > +        /* in order to remove mappings in cleanup stage */
> > > +        add_mapped_vaddr(d, addr);
> > 
> > No locks or atomic operations here? How are races with the tools reading
> > the dirty bitmap addressed?  If it is by clever ordering of the checks and
> > pte writes then I think a comment would be in order.
> 
> Actually, the lock is inside the add_mapped_vaddr function. 

But that only covers the bitmap update, not the pte frobbing.

> 
> > 
> > > +    }
> > > +}
> > > +
> > >  /*
> > >   * Local variables:
> > >   * mode: C
> > > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c index
> > > 2d09fef..3b2b11a 100644
> > > --- a/xen/arch/arm/p2m.c
> > > +++ b/xen/arch/arm/p2m.c
> > > @@ -6,6 +6,8 @@
> > >  #include <xen/bitops.h>
> > >  #include <asm/flushtlb.h>
> > >  #include <asm/gic.h>
> > > +#include <xen/guest_access.h>
> > > +#include <xen/pfn.h>
> > >
> > >  void dump_p2m_lookup(struct domain *d, paddr_t addr)  { @@ -408,6
> > > +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned long
> > gpfn)
> > >      return p >> PAGE_SHIFT;
> > >  }
> > >
> > > +#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) - sizeof(int))\
> > > +                        / sizeof(unsigned long)
> > > +
> > > +/* An array-based linked list for storing virtual addresses (i.e.,
> > > +the 3rd
> > > + * level table mapping) that should be destroyed after live migration
> > > +*/ struct mapped_va_node {
> > > +    struct page_info *next;
> > > +    struct mapped_va_node *mvn_next;
> > > +    int items;
> > > +    unsigned long vaddrs[MAX_VA_PER_NODE]; };
> > > +
> > > +int add_mapped_vaddr(struct domain *d, unsigned long va) {
> > 
> > This function seems awefuly expensive (contains allocations etc) for a
> > function called on every page fault during a migration.
> > 
> > Why are you remembering the full va of every fault when a single bit per
> > page would do?
> > 
> > I think you can allocate a bitmap when logdirty is enabled. At a bit per
> > page it shouldn't be too huge.
> > 
> > You might be able to encode this in the p2m which you walk when the tools
> > ask for the bit map (i.e. p2m.type==writeable p2m.write==0 => dirty), but
> > I think a bitmap would do the trick and be easier to implement.
> 
> There are three candidates:
> 
> 1) bitmap
> 2) encoding into p2m
> 3) linked list of VAs. 
> 
> And, two functions for dirty-page tracing:
> 
> A) dirty-page handling (At trap)
> B) getting dirty bitmap (for toolstack)
> 
> 1) and 2) have advantage for A) but have to search the full range of pages at 
> B)
> to find out which page is dirtied and set the page as read-only for next dirty
> detection. On the otherhand, 3) does not need to search the full range at B).
> I prefer 3) due to the 'non-full range search' but I understand your concern.
> Maybe I can do some measurement for each method to see which one better. 

If you have a bitmap then you can do a scan for each set bit, and the
offset of each bit corresponds to the guest pfn, which allows you to
directly calculate the address of the pte in the vlpt.

I don't think doing a scan for each set bit is going to compare
unfavourably to walking a linked list. Certainly not once the memory
allocator gets involved.

> > > +/* get the dirty bitmap from the linked list stored by
> > > +add_mapped_vaddr
> > > + * and also clear the mapped vaddrs linked list */ static void
> > > +get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) {
> > > +    struct page_info *page_head;
> > > +    struct mapped_va_node *mvn_head;
> > > +    struct page_info *tmp_page;
> > > +    struct mapped_va_node *tmp_mvn;
> > > +    vaddr_t gma_start = 0;
> > > +    vaddr_t gma_end = 0;
> > > +
> > > +    /* this hypercall is called from domain 0, and we don't know which
> > guest's
> > > +     * vlpt is mapped in xen_second, so, to be sure, we restore vlpt
> > here */
> > > +    restore_vlpt(d);
> > 
> > AFAICT you don't actually use the vlpt here, just the domains list of
> > dirty addresses (which as above could become a bitmap)
> 
> I think vlpt is still needed because at this stage, we have to reset the 
> write permission of dirtied pages. Maybe some tricks for efficiently doing 
> this?

Ah, of course.

Do you need to be careful about the context switch of a dom0 vcpu which
has a foreign vlpt loaded? I suppose you unload it at the end of this
function so it is safe because Xen doesn't preempt vcpus in hypervisor
mode.

> > 
> > > +    struct p2m_domain *p2m = &d->arch.p2m;
> > > +    vaddr_t ram_base;
> > > +    int i1, i2, i3;
> > > +    int first_index, second_index, third_index;
> > > +    lpae_t *first = __map_domain_page(p2m->first_level);
> > > +    lpae_t pte, *second = NULL, *third = NULL;
> > > +
> > > +    get_gma_start_end(d, &ram_base, NULL);
> > > +
> > > +    first_index = first_table_offset((uint64_t)ram_base);
> > > +    second_index = second_table_offset((uint64_t)ram_base);
> > > +    third_index = third_table_offset((uint64_t)ram_base);
> > > +
> > > +    BUG_ON( !first && "Can't map first level p2m." );
> > > +
> > > +    spin_lock(&p2m->lock);
> > > +
> > > +    for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 )
> > > +    {
> > > +        lpae_walk_t first_pte = first[i1].walk;
> > > +        if ( !first_pte.valid || !first_pte.table )
> > > +            goto out;
> > > +
> > > +        second = map_domain_page(first_pte.base);
> > > +        BUG_ON( !second && "Can't map second level p2m.");
> > > +        for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 )
> > > +        {
> > > +            lpae_walk_t second_pte = second[i2].walk;
> > > +
> > > +            /* a nasty hack for vlpt due to the difference
> > > +             * of page table entry layout between p2m and pt */
> > > +            second[i2].pt.ro = 0;
> > 
> > What is the hack here?
> > 
> > The issue is that the p2m second level, which is a table entry in the p2m
> > is installed into the Xen page tables where it ends up the third level,
> > treated as a block entry.
> > 
> > That's OK, because for both PT and P2M bits 2..10 are ignored for table
> > entries. So I think rather than this hack here we should simply ensure
> > that our non-leaf p2m's have the correct bits in them to be used as p2m
> > table entries.
> 
> Oh, I see. So, should I put something like this in create_p2m_entries 
> function?

Probably p2m_create_table or maybe mfn_to_p2m entry would be the right
place.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.