Xen project Mailing List

[Xen-devel] Re: RFC Patch for the "x86-64, mm: Put early page table high"

To: Jeremy Fitzhardinge <jeremy@xxxxxxxx>

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Fri, 29 Apr 2011 17:33:37 -0400

Cc: xen-devel@xxxxxxxxxxxxxxxxxxx, Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>

Delivery-date: Fri, 29 Apr 2011 14:34:47 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On Fri, Apr 29, 2011 at 01:58:33PM -0700, Jeremy Fitzhardinge wrote: > On 04/29/2011 08:46 AM, Konrad Rzeszutek Wilk wrote: > > Without a fix for this 2.6.39-rc5 fails during bootup. > > > > It fails such early, you only Xen telling you that the domain crashed. > > > > There is one patch to fix this, and the last word was here: > > http://marc.info/?i=4DAF0ECB.8060009@xxxxxxxx > > > > But after nothing had been heard from Peter. > > > > So I decided to look at this from a different perspective: why not > > contain the workaround inside Xen early bootup code. > > > > I've tested this code as DomU (1G, 2G, 3G), Dom0 (8G, 4G, 1000M) > > and they all booted fine. Hadn't compile tested it on 32-bit and > > I think it will actually complain about it. Let me fix that. > > > > See attached patch (also present in stable/bug-fixes-for-rc5) which also > > has the "xen: mask_rw_pte mark RO all pagetable pages up to pgt_buf_top" > > integrated in. > > Well, if we can hide the fix away in our code, then that has obvious > advantages. But I worry that this change is pretty closely dependent on > how the other code works, and would be fragile in the face of further > changes to that code (esp since there's no obvious coupling between the > two, so anyone changing the arch/x86 code wouldn't have any clues to > look at the corresponding Xen code). True. I am hoping to remove this in 2.6.40 though... > > > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c > > index aef7af9..a54c7c2 100644 > > --- a/arch/x86/xen/mmu.c > > +++ b/arch/x86/xen/mmu.c > > @@ -1463,6 +1463,115 @@ static int xen_pgd_alloc(struct mm_struct *mm) > > return ret; > > } > > > > +#ifdef CONFIG_X86_64 > > +static __initdata u64 __last_pgt_set_rw = 0; > > +static __initdata u64 __pgt_buf_start = 0; > > +static __initdata u64 __pgt_buf_end = 0; > > +static __initdata u64 __pgt_buf_top = 0; > > +/* > > + * As a consequence of the commit: > > + * > > + * commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e > > + * Author: Yinghai Lu <yinghai@xxxxxxxxxx> > > + * Date: Fri Dec 17 16:58:28 2010 -0800 > > + * > > + * x86-64, mm: Put early page table high > > + * > > + * at some point init_memory_mapping is going to reach the pagetable pages > > + * area and map those pages too (mapping them as normal memory that falls > > + * in the range of addresses passed to init_memory_mapping as argument). > > + * Some of those pages are already pagetable pages (they are in the range > > + * pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and > > + * everything is fine. > > + * Some of these pages are not pagetable pages yet (they fall in the range > > + * pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they > > + * are going to be mapped RW. When these pages become pagetable pages and > > + * are hooked into the pagetable, xen will find that the guest has already > > + * a RW mapping of them somewhere and fail the operation. > > + * The reason Xen requires pagetables to be RO is that the hypervisor needs > > + * to verify that the pagetables are valid before using them. The > > validation > > + * operations are called "pinning" (more details in arch/x86/xen/mmu.c). > > + * > > + * In order to fix the issue we mark all the pages in the entire range > > + * pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation > > + * is completed only the range pgt_buf_start-pgt_buf_end is reserved by > > + * init_memory_mapping. Hence the kernel is going to crash as soon as one > > + * of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those > > + * ranges are RO). > > + * > > + * For this reason, this function is introduced which is called _after_ > > I think name "mark_rw_past_pg" explicitly here, since "this" is somewhat > ambiguous. <nods> > > > + * the init_memory_mapping has completed (in a perfect world we would > > + * call this function from init_memory_mapping, but lets ignore that). > > + * > > + * Because we are called _after_ init_memory_mapping the pgt_buf_[start, > > + * end,top] have all changed to new values (b/c another init_memory_mapping > > + * is called). Hence, the first time we enter this function, we save > > + * away the pgt_buf_start value and update the pgt_buf_[end,top]. > > + * > > + * When we detect that the "old" pgt_buf_start through pgt_buf_end > > + * PFNs have been reserved (so memblock_x86_reserve_range has been called), > > + * we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top. > > + * > > + * And then we update those "old" pgt_buf_[end|top] with the new ones > > + * so that we can redo this on the next pagetable. > > + */ > > +static __init void mark_rw_past_pgt(unsigned long pfn) { > > + > > + if (pfn && pgt_buf_end > pgt_buf_start) { > > + u64 addr, size; > > + > > + /* Save it away. */ > > + if (!__pgt_buf_start) { > > + __pgt_buf_start = pgt_buf_start; > > + __pgt_buf_end = pgt_buf_end; > > + __pgt_buf_top = pgt_buf_top; > > + return; > > + } > > + /* If we get the range that starts at __pgt_buf_end that means > > + * the range is reserved, and that in 'init_memory_mapping' > > + * the 'memblock_x86_reserve_range' has been called with the > > + * outdated __pgt_buf_start, __pgt_buf_end (the "new" > > + * pgt_buf_[start|end|top] refer now to a new pagetable. > > + * Note: we are called _after_ the pgt_buf_[..] have been > > + * updated.*/ > > + > > + addr = > > memblock_x86_find_in_range_size(PFN_PHYS(__pgt_buf_start), &size, > > PAGE_SIZE); > > + > > + /* Still not reserved, meaning 'memblock_x86_reserve_range' > > + * hasn't been called yet. Update the _end and _top.*/ > > + if (addr == PFN_PHYS(__pgt_buf_start)) { > > + __pgt_buf_end = pgt_buf_end; > > + __pgt_buf_top = pgt_buf_top; > > + return; > > + } > > + > > + /* OK, the area is reserved, meaning it is time for us to > > + * set RW for the old end->top PFNs. */ > > + > > + /* ..unless we had already done this. */ > > + if (__pgt_buf_end == __last_pgt_set_rw) > > + return; > > + > > + addr = PFN_PHYS(__pgt_buf_end); > > + > > + /* set as RW the rest */ > > + printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n", > > + PFN_PHYS(__pgt_buf_end), PFN_PHYS(__pgt_buf_top)); > > + > > + while (addr < PFN_PHYS(__pgt_buf_top)) { > > + make_lowmem_page_readwrite(__va(addr)); > > + addr += PAGE_SIZE; > > + } > > + /* And update everything so that we are ready for the next > > + * pagetable (the one created for regions past 4GB) */ > > + __last_pgt_set_rw = __pgt_buf_end; > > + __pgt_buf_start = pgt_buf_start; > > + __pgt_buf_end = pgt_buf_end; > > + __pgt_buf_top = pgt_buf_top; > > + } > > + return; > > +} > > +#endif > > static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd) > > { > > #ifdef CONFIG_X86_64 > > @@ -1488,6 +1597,7 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t > > pte) > > { > > unsigned long pfn = pte_pfn(pte); > > > > + mark_rw_past_pgt(pfn); > > /* > > * If the new pfn is within the range of the newly allocated > > * kernel pagetable, and it isn't being mapped into an > > @@ -1495,7 +1605,7 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t > > pte) > > * it is RO. > > */ > > if (((!is_early_ioremap_ptep(ptep) && > > - pfn >= pgt_buf_start && pfn < pgt_buf_end)) || > > + pfn >= pgt_buf_start && pfn < pgt_buf_top)) || > > (is_early_ioremap_ptep(ptep) && pfn != (pgt_buf_end - > > 1))) > > pte = pte_wrprotect(pte); > > > > @@ -1997,6 +2107,8 @@ __init void xen_ident_map_ISA(void) > > > > static __init void xen_post_allocator_init(void) > > { > > + mark_rw_past_pgt(1); > > Why '1'? Perhaps this should be a different function rather than > overloading a single one? The 'mask_rw_pte' gets called quite often during startup. And a lot of times for the pte(0), so that was an optimization to not do the memblock search. But that can as well be expressed in 'mask_rw_pte' by if (pfn) mark_rw_past_pgt(); Let me do that. > > J _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.