Xen project Mailing List

Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing

To: Jaeyong Yoo <jaeyong.yoo@xxxxxxxxxxx>

From: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

Date: Thu, 17 Oct 2013 11:30:25 +0100

Cc: 'Stefano Stabellini' <stefano.stabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Thu, 17 Oct 2013 10:30:49 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Thu, 2013-10-17 at 19:12 +0900, Jaeyong Yoo wrote: > > > @@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d) > > > unmap_domain_page_global(d->arch.dirty.second_lvl); > > > } > > > > > > +/* routine for dirty-page tracing > > > + * > > > + * On first write, it page faults, its entry is changed to > > > +read-write, > > > + * and on retry the write succeeds. > > > + * > > > + * for locating p2m of the faulting entry, we use virtual-linear page > > table. > > > + */ > > > +void handle_page_fault(struct domain *d, paddr_t addr) { > > > + lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr); > > > + if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 ) > > > > I think you need to distinguish p2m entries which are r/o due to log dirty > > from those which are r/o due to wanting them to be r/o. > > > > Maybe we don't have the latter right now? We will eventually need p2m > > types I think. > > Yes, that should be distinguished. Actually, that was in my mind and > apparently > I failed to remember. For doing this, I'm thinking to use un-used field in pte > as an indicator of 'wanting to be r/o' and check this indicator in permission > fault. Yes, I think we can use the 4 avail bits to store a p2m type. ISTR discussing this before, but perhaps it wasn't with you! > > > > > > + { > > > + lpae_t pte = *vlp2m_pte; > > > + pte.p2m.write = 1; > > > + write_pte(vlp2m_pte, pte); > > > + flush_tlb_local(); > > > + > > > + /* in order to remove mappings in cleanup stage */ > > > + add_mapped_vaddr(d, addr); > > > > No locks or atomic operations here? How are races with the tools reading > > the dirty bitmap addressed? If it is by clever ordering of the checks and > > pte writes then I think a comment would be in order. > > Actually, the lock is inside the add_mapped_vaddr function. But that only covers the bitmap update, not the pte frobbing. > > > > > > + } > > > +} > > > + > > > /* > > > * Local variables: > > > * mode: C > > > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c index > > > 2d09fef..3b2b11a 100644 > > > --- a/xen/arch/arm/p2m.c > > > +++ b/xen/arch/arm/p2m.c > > > @@ -6,6 +6,8 @@ > > > #include <xen/bitops.h> > > > #include <asm/flushtlb.h> > > > #include <asm/gic.h> > > > +#include <xen/guest_access.h> > > > +#include <xen/pfn.h> > > > > > > void dump_p2m_lookup(struct domain *d, paddr_t addr) { @@ -408,6 > > > +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned long > > gpfn) > > > return p >> PAGE_SHIFT; > > > } > > > > > > +#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) - sizeof(int))\ > > > + / sizeof(unsigned long) > > > + > > > +/* An array-based linked list for storing virtual addresses (i.e., > > > +the 3rd > > > + * level table mapping) that should be destroyed after live migration > > > +*/ struct mapped_va_node { > > > + struct page_info *next; > > > + struct mapped_va_node *mvn_next; > > > + int items; > > > + unsigned long vaddrs[MAX_VA_PER_NODE]; }; > > > + > > > +int add_mapped_vaddr(struct domain *d, unsigned long va) { > > > > This function seems awefuly expensive (contains allocations etc) for a > > function called on every page fault during a migration. > > > > Why are you remembering the full va of every fault when a single bit per > > page would do? > > > > I think you can allocate a bitmap when logdirty is enabled. At a bit per > > page it shouldn't be too huge. > > > > You might be able to encode this in the p2m which you walk when the tools > > ask for the bit map (i.e. p2m.type==writeable p2m.write==0 => dirty), but > > I think a bitmap would do the trick and be easier to implement. > > There are three candidates: > > 1) bitmap > 2) encoding into p2m > 3) linked list of VAs. > > And, two functions for dirty-page tracing: > > A) dirty-page handling (At trap) > B) getting dirty bitmap (for toolstack) > > 1) and 2) have advantage for A) but have to search the full range of pages at > B) > to find out which page is dirtied and set the page as read-only for next dirty > detection. On the otherhand, 3) does not need to search the full range at B). > I prefer 3) due to the 'non-full range search' but I understand your concern. > Maybe I can do some measurement for each method to see which one better. If you have a bitmap then you can do a scan for each set bit, and the offset of each bit corresponds to the guest pfn, which allows you to directly calculate the address of the pte in the vlpt. I don't think doing a scan for each set bit is going to compare unfavourably to walking a linked list. Certainly not once the memory allocator gets involved. > > > +/* get the dirty bitmap from the linked list stored by > > > +add_mapped_vaddr > > > + * and also clear the mapped vaddrs linked list */ static void > > > +get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) { > > > + struct page_info *page_head; > > > + struct mapped_va_node *mvn_head; > > > + struct page_info *tmp_page; > > > + struct mapped_va_node *tmp_mvn; > > > + vaddr_t gma_start = 0; > > > + vaddr_t gma_end = 0; > > > + > > > + /* this hypercall is called from domain 0, and we don't know which > > guest's > > > + * vlpt is mapped in xen_second, so, to be sure, we restore vlpt > > here */ > > > + restore_vlpt(d); > > > > AFAICT you don't actually use the vlpt here, just the domains list of > > dirty addresses (which as above could become a bitmap) > > I think vlpt is still needed because at this stage, we have to reset the > write permission of dirtied pages. Maybe some tricks for efficiently doing > this? Ah, of course. Do you need to be careful about the context switch of a dom0 vcpu which has a foreign vlpt loaded? I suppose you unload it at the end of this function so it is safe because Xen doesn't preempt vcpus in hypervisor mode. > > > > > + struct p2m_domain *p2m = &d->arch.p2m; > > > + vaddr_t ram_base; > > > + int i1, i2, i3; > > > + int first_index, second_index, third_index; > > > + lpae_t *first = __map_domain_page(p2m->first_level); > > > + lpae_t pte, *second = NULL, *third = NULL; > > > + > > > + get_gma_start_end(d, &ram_base, NULL); > > > + > > > + first_index = first_table_offset((uint64_t)ram_base); > > > + second_index = second_table_offset((uint64_t)ram_base); > > > + third_index = third_table_offset((uint64_t)ram_base); > > > + > > > + BUG_ON( !first && "Can't map first level p2m." ); > > > + > > > + spin_lock(&p2m->lock); > > > + > > > + for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 ) > > > + { > > > + lpae_walk_t first_pte = first[i1].walk; > > > + if ( !first_pte.valid || !first_pte.table ) > > > + goto out; > > > + > > > + second = map_domain_page(first_pte.base); > > > + BUG_ON( !second && "Can't map second level p2m."); > > > + for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 ) > > > + { > > > + lpae_walk_t second_pte = second[i2].walk; > > > + > > > + /* a nasty hack for vlpt due to the difference > > > + * of page table entry layout between p2m and pt */ > > > + second[i2].pt.ro = 0; > > > > What is the hack here? > > > > The issue is that the p2m second level, which is a table entry in the p2m > > is installed into the Xen page tables where it ends up the third level, > > treated as a block entry. > > > > That's OK, because for both PT and P2M bits 2..10 are ignored for table > > entries. So I think rather than this hack here we should simply ensure > > that our non-leaf p2m's have the correct bits in them to be used as p2m > > table entries. > > Oh, I see. So, should I put something like this in create_p2m_entries > function? Probably p2m_create_table or maybe mfn_to_p2m entry would be the right place. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.