Xen project Mailing List

Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for dirty page tracing

To: 'Ian Campbell' <Ian.Campbell@xxxxxxxxxx>

From: Jaeyong Yoo <jaeyong.yoo@xxxxxxxxxxx>

Date: Thu, 17 Oct 2013 20:05:58 +0900

Cc: 'Stefano Stabellini' <stefano.stabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Thu, 17 Oct 2013 11:06:18 +0000

Dlp-filter: Pass

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: Ac7LJA5UcLNmlU3/QLy8i1GmFA5ZFQAALRnw

> -----Original Message----- > From: xen-devel-bounces@xxxxxxxxxxxxx [mailto:xen-devel- > bounces@xxxxxxxxxxxxx] On Behalf Of Ian Campbell > Sent: Thursday, October 17, 2013 7:30 PM > To: Jaeyong Yoo > Cc: 'Stefano Stabellini'; xen-devel@xxxxxxxxxxxxx > Subject: Re: [Xen-devel] [PATCH v4 8/9] xen/arm: Implement hypercall for > dirty page tracing > > On Thu, 2013-10-17 at 19:12 +0900, Jaeyong Yoo wrote: > > > > > @@ -1460,6 +1459,28 @@ void cleanup_vlpt(struct domain *d) > > > > unmap_domain_page_global(d->arch.dirty.second_lvl); > > > > } > > > > > > > > +/* routine for dirty-page tracing > > > > + * > > > > + * On first write, it page faults, its entry is changed to > > > > +read-write, > > > > + * and on retry the write succeeds. > > > > + * > > > > + * for locating p2m of the faulting entry, we use virtual-linear > > > > +page > > > table. > > > > + */ > > > > +void handle_page_fault(struct domain *d, paddr_t addr) { > > > > + lpae_t *vlp2m_pte = get_vlpt_3lvl_pte(addr); > > > > + if ( vlp2m_pte->p2m.valid && vlp2m_pte->p2m.write == 0 ) > > > > > > I think you need to distinguish p2m entries which are r/o due to log > > > dirty from those which are r/o due to wanting them to be r/o. > > > > > > Maybe we don't have the latter right now? We will eventually need > > > p2m types I think. > > > > Yes, that should be distinguished. Actually, that was in my mind and > > apparently I failed to remember. For doing this, I'm thinking to use > > un-used field in pte as an indicator of 'wanting to be r/o' and check > this indicator in permission fault. > > Yes, I think we can use the 4 avail bits to store a p2m type. ISTR > discussing this before, but perhaps it wasn't with you! > > > > > > > > > > + { > > > > + lpae_t pte = *vlp2m_pte; > > > > + pte.p2m.write = 1; > > > > + write_pte(vlp2m_pte, pte); > > > > + flush_tlb_local(); > > > > + > > > > + /* in order to remove mappings in cleanup stage */ > > > > + add_mapped_vaddr(d, addr); > > > > > > No locks or atomic operations here? How are races with the tools > > > reading the dirty bitmap addressed? If it is by clever ordering of > > > the checks and pte writes then I think a comment would be in order. > > > > Actually, the lock is inside the add_mapped_vaddr function. > > But that only covers the bitmap update, not the pte frobbing. I also locked with the same lock at the get_dirty_bitmap which is reading the bitmap. > > > > > > > > > > + } > > > > +} > > > > + > > > > /* > > > > * Local variables: > > > > * mode: C > > > > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c index > > > > 2d09fef..3b2b11a 100644 > > > > --- a/xen/arch/arm/p2m.c > > > > +++ b/xen/arch/arm/p2m.c > > > > @@ -6,6 +6,8 @@ > > > > #include <xen/bitops.h> > > > > #include <asm/flushtlb.h> > > > > #include <asm/gic.h> > > > > +#include <xen/guest_access.h> > > > > +#include <xen/pfn.h> > > > > > > > > void dump_p2m_lookup(struct domain *d, paddr_t addr) { @@ -408,6 > > > > +410,304 @@ unsigned long gmfn_to_mfn(struct domain *d, unsigned > > > > +long > > > gpfn) > > > > return p >> PAGE_SHIFT; > > > > } > > > > > > > > +#define MAX_VA_PER_NODE (PAGE_SIZE - 2 * sizeof(void *) - > sizeof(int))\ > > > > + / sizeof(unsigned long) > > > > + > > > > +/* An array-based linked list for storing virtual addresses > > > > +(i.e., the 3rd > > > > + * level table mapping) that should be destroyed after live > > > > +migration */ struct mapped_va_node { > > > > + struct page_info *next; > > > > + struct mapped_va_node *mvn_next; > > > > + int items; > > > > + unsigned long vaddrs[MAX_VA_PER_NODE]; }; > > > > + > > > > +int add_mapped_vaddr(struct domain *d, unsigned long va) { > > > > > > This function seems awefuly expensive (contains allocations etc) for > > > a function called on every page fault during a migration. > > > > > > Why are you remembering the full va of every fault when a single bit > > > per page would do? > > > > > > I think you can allocate a bitmap when logdirty is enabled. At a bit > > > per page it shouldn't be too huge. > > > > > > You might be able to encode this in the p2m which you walk when the > > > tools ask for the bit map (i.e. p2m.type==writeable p2m.write==0 => > > > dirty), but I think a bitmap would do the trick and be easier to > implement. > > > > There are three candidates: > > > > 1) bitmap > > 2) encoding into p2m > > 3) linked list of VAs. > > > > And, two functions for dirty-page tracing: > > > > A) dirty-page handling (At trap) > > B) getting dirty bitmap (for toolstack) > > > > 1) and 2) have advantage for A) but have to search the full range of > > pages at B) to find out which page is dirtied and set the page as > > read-only for next dirty detection. On the otherhand, 3) does not need > to search the full range at B). > > I prefer 3) due to the 'non-full range search' but I understand your > concern. > > Maybe I can do some measurement for each method to see which one better. > > If you have a bitmap then you can do a scan for each set bit, and the > offset of each bit corresponds to the guest pfn, which allows you to > directly calculate the address of the pte in the vlpt. > > I don't think doing a scan for each set bit is going to compare > unfavourably to walking a linked list. Certainly not once the memory > allocator gets involved. Oh, do you mean using something like "find_next_zero_bit"? At the first time, I thought searching every bit-by-bit but, I just see findbit.S and it looks quite optimized. > > > > > +/* get the dirty bitmap from the linked list stored by > > > > +add_mapped_vaddr > > > > + * and also clear the mapped vaddrs linked list */ static void > > > > +get_dirty_bitmap(struct domain *d, uint8_t *bitmap[]) { > > > > + struct page_info *page_head; > > > > + struct mapped_va_node *mvn_head; > > > > + struct page_info *tmp_page; > > > > + struct mapped_va_node *tmp_mvn; > > > > + vaddr_t gma_start = 0; > > > > + vaddr_t gma_end = 0; > > > > + > > > > + /* this hypercall is called from domain 0, and we don't know > > > > + which > > > guest's > > > > + * vlpt is mapped in xen_second, so, to be sure, we restore > > > > + vlpt > > > here */ > > > > + restore_vlpt(d); > > > > > > AFAICT you don't actually use the vlpt here, just the domains list > > > of dirty addresses (which as above could become a bitmap) > > > > I think vlpt is still needed because at this stage, we have to reset > > the write permission of dirtied pages. Maybe some tricks for efficiently > doing this? > > Ah, of course. > > Do you need to be careful about the context switch of a dom0 vcpu which > has a foreign vlpt loaded? I suppose you unload it at the end of this > function so it is safe because Xen doesn't preempt vcpus in hypervisor > mode. That is my concern now. For safety, unloading is necessary but it requires flushing vlpt which is not a cheap operation. Actually, there is one more concern: Since we only context switch the vlpt area with 'migrating domains' when two domains (one migrating and the other not) are context switched, the non-migrating domain have access to the foreign vlpt. In order to prevent the access of foreign vlpt area, do you think unloading of vlpt when switching into non-migrating domains also necessary? > > > > > > > > + struct p2m_domain *p2m = &d->arch.p2m; > > > > + vaddr_t ram_base; > > > > + int i1, i2, i3; > > > > + int first_index, second_index, third_index; > > > > + lpae_t *first = __map_domain_page(p2m->first_level); > > > > + lpae_t pte, *second = NULL, *third = NULL; > > > > + > > > > + get_gma_start_end(d, &ram_base, NULL); > > > > + > > > > + first_index = first_table_offset((uint64_t)ram_base); > > > > + second_index = second_table_offset((uint64_t)ram_base); > > > > + third_index = third_table_offset((uint64_t)ram_base); > > > > + > > > > + BUG_ON( !first && "Can't map first level p2m." ); > > > > + > > > > + spin_lock(&p2m->lock); > > > > + > > > > + for ( i1 = first_index; i1 < LPAE_ENTRIES*2; ++i1 ) > > > > + { > > > > + lpae_walk_t first_pte = first[i1].walk; > > > > + if ( !first_pte.valid || !first_pte.table ) > > > > + goto out; > > > > + > > > > + second = map_domain_page(first_pte.base); > > > > + BUG_ON( !second && "Can't map second level p2m."); > > > > + for ( i2 = second_index; i2 < LPAE_ENTRIES; ++i2 ) > > > > + { > > > > + lpae_walk_t second_pte = second[i2].walk; > > > > + > > > > + /* a nasty hack for vlpt due to the difference > > > > + * of page table entry layout between p2m and pt */ > > > > + second[i2].pt.ro = 0; > > > > > > What is the hack here? > > > > > > The issue is that the p2m second level, which is a table entry in > > > the p2m is installed into the Xen page tables where it ends up the > > > third level, treated as a block entry. > > > > > > That's OK, because for both PT and P2M bits 2..10 are ignored for > > > table entries. So I think rather than this hack here we should > > > simply ensure that our non-leaf p2m's have the correct bits in them > > > to be used as p2m table entries. > > > > Oh, I see. So, should I put something like this in create_p2m_entries > function? > > Probably p2m_create_table or maybe mfn_to_p2m entry would be the right > place. OK. > > Ian. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.