[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Re: [PATCH 1/2] xen/mmu: Add workaround "x86-64, mm: Put early page table high"



On Tue, May 03, 2011 at 02:55:27AM +0200, Daniel Kiper wrote:
> On Mon, May 02, 2011 at 01:22:21PM -0400, Konrad Rzeszutek Wilk wrote:
> > As a consequence of the commit:
> >
> > commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > Author: Yinghai Lu <yinghai@xxxxxxxxxx>
> > Date:   Fri Dec 17 16:58:28 2010 -0800
> >
> >     x86-64, mm: Put early page table high
> >
> > it causes the Linux kernel to crash under Xen:
> >
> > mapping kernel into physical memory
> > Xen: setup ISA identity maps
> > about to get started...
> > (XEN) mm.c:2466:d0 Bad type (saw 7400000000000001 != exp 1000000000000000) 
> > for mfn b1d89 (pfn bacf7)
> > (XEN) mm.c:3027:d0 Error while pinning mfn b1d89
> > (XEN) traps.c:481:d0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 
> > [ec=0000]
> > (XEN) domain_crash_sync called from entry.S
> > (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> > ...
> 
> I was hit by this bug when I was working on memory hotplug.
> After some investigation I found myself above mentioned patch
> as a guilty and later I discovered that you are working on that
> issue. I have tested your patch and discoverd some issues with it.
> First of all it has compilation issues on gcc version 4.1.2 20061115
> (prerelease) (Debian 4.1.1-21). Details below.
> 
> Additionlly, I think that your patch does not work as you expected.
> I found that git commit 24bdb0b62cc82120924762ae6bc85afc8c3f2b26
> (xen: do not create the extra e820 region at an addr lower than 4G)
> do this work (to some extent). When this patch is removed domU
> is crashing with following error:

Which is "this patch" ? The 24bdb0b62cc82120924762ae6bc85afc8c3f2b26?

> 
> (early) Linux version 2.6.39-rc5-x86_64.xenU.all.r0+ (root@dev-00) (gcc 
> version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #5 SMP Tue May 3 
> 01:43:26 CEST 2011
> (early) Command line: root=/dev/xvda debug earlyprintk=xen noapic nolapic 
> console=hvc0
> (early) ACPI in unprivileged domain disabled
> (early) released 0 pages of unused memory
> (early) Set 0 page(s) to 1-1 mapping.
> (early) BIOS-provided physical RAM map:
> (early)  Xen: 0000000000000000 - 00000000000a0000 (usable)
> (early)  Xen: 00000000000a0000 - 0000000000100000 (reserved)
> (early)  Xen: 0000000000100000 - 0000000026000000 (usable)
> (early) bootconsole [xenboot0] enabled
> (early) NX (Execute Disable) protection: active
> (early) DMI not present or invalid.
> (early) e820 update range: 0000000000000000 - 0000000000010000 (early) 
> (usable)(early)  ==> (early) (reserved)(early)
> (early) e820 remove range: 00000000000a0000 - 0000000000100000 (early) 
> (usable)(early)
> (early) No AGP bridge found
> (early) last_pfn = 0x26000 max_arch_pfn = 0x400000000
> (early) initial memory mapped : 0 - 01693000
> (early) Base memory trampoline at [ffff88000009e000] 9e000 size 8192
> (early) init_memory_mapping: 0000000000000000-0000000026000000
> (early)  0000000000 - 0026000000 page 4k
> (early) kernel direct mapping tables up to 26000000 @ 256ce000-25800000
> (early) BUG: unable to handle kernel (early) NULL pointer dereference(early)  
> at           (null)
> (early) IP:(early)  [<ffffffff814a33f2>] find_range_array+0x4e/0x57
> (early) PGD 0 (early)
> (early) Oops: 0003 [#1] (early) SMP (early)
> (early) last sysfs file:
> (early) CPU 0 (early)
> (early) Modules linked in:(early)
> (early)
> (early) Pid: 0, comm: swapper Not tainted 2.6.39-rc5-x86_64.xenU.all.r0+ 
> #5(early)   (early)
> (early) RIP: e030:[<ffffffff814a33f2>] (early)  [<ffffffff814a33f2>] 
> find_range_array+0x4e/0x57
> (early) RSP: e02b:ffffffff81427e58  EFLAGS: 00010046
> (early) RAX: 0000000000000000 RBX: 00000000000000e0 RCX: 00000000000000e0
> (early) RDX: ffff8800257fff20 RSI: 00000000257fff20 RDI: ffff8800257fff20
> (early) RBP: ffffffff81427e68 R08: 0000000000000005 R09: 0000000000000050
> (early) R10: 0000000000000005 R11: 0000000025800000 R12: ffffffff814bd000
> (early) R13: 0000000000000000 R14: 000000000000000e R15: 0000000000001000
> (early) FS:  0000000000000000(0000) GS:ffffffff8147f000(0000) 
> knlGS:0000000000000000
> (early) CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> (early) CR2: 0000000000000000 CR3: 0000000001441000 CR4: 0000000000002660
> (early) DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> (early) DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> (early) Process swapper (pid: 0, threadinfo ffffffff81426000, task 
> ffffffff81449020)
> (early) Stack:
> (early)  ffffffff81427e88(early)  0000000000000000(early)  
> ffffffff81427ea8(early)  ffffffff814a343d(early)
> (early)  0000000000000100(early)  0000000026000000(early)  
> ffffffff814bd000(early)  ffffffffffffffff(early)
> (early)  0000000000000000(early)  0000000000000000(early)  
> ffffffff81427eb8(early)  ffffffff814a3577(early)
> (early) Call Trace:
> (early)  [<ffffffff814a343d>] __memblock_x86_memory_in_range+0x42/0x171
> (early)  [<ffffffff814a3577>] memblock_x86_memory_in_range+0xb/0xd
> (early)  [<ffffffff81497348>] memblock_find_dma_reserve+0x15/0x3b
> (early)  [<ffffffff81496b50>] setup_arch+0x721/0x7e5
> (early)  [<ffffffff810065fd>] ? __raw_callee_save_xen_irq_disable+0x11/0x1e
> (early)  [<ffffffff81493955>] start_kernel+0x8a/0x2db
> (early)  [<ffffffff81493299>] x86_64_start_reservations+0x84/0x88
> (early)  [<ffffffff814947d9>] xen_start_kernel+0x3e1/0x3e8
> (early) Code: (early) 66 (early) 00 (early) 00 (early) 48 (early) 85 (early) 
> c0 (early) 75 (early) 0c (early) 48 (early) c7 (early) c7 (early) 86 (early) 
> 80 (early) 3a (early) 81 (early) e8 (early) 55 (early) 41 (early) b9 (early) 
> ff (early) 48 (early) bf (early) 00 (early) 00 (early) 00 (early) 00 (early) 
> 00 (early) 88 (early) ff (early) ff (early) 48 (early) 89 (early) d9 (early) 
> 48 (early) 8d (early) 14 (early) 38 (early) 31 (early) c0 (early) fc (early) 
> 48 (early) 89 (early) d7 (early) <f3> (early) aa (early) 48 (early) 89 
> (early) d0 (early) 5f (early) 5b (early) c9 (early) c3 (early) 55 (early) 48 
> (early) 89 (early) e5 (early) 41 (early) 57 (early) 49 (early) 89 (early) f7 
> (early) 49 (early) c1 (early) ef (early)
> (early) RIP (early)  [<ffffffff814a33f2>] find_range_array+0x4e/0x57
> (early)  RSP <ffffffff81427e58>
> (early) CR2: 0000000000000000
> (early) ---[ end trace 4eaa2a86a8e2da22 ]---
> (early) Kernel panic - not syncing: Attempted to kill the idle task!
> (early) Pid: 0, comm: swapper Tainted: G      D     
> 2.6.39-rc5-x86_64.xenU.all.r0+ #5
> (early) Call Trace:
> (early)  [<ffffffff810375ed>] panic+0xbd/0x1c7
> (early)  [<ffffffff810383c7>] ? printk+0x67/0x69
> (early)  [<ffffffff811fd7b0>] ? account+0xe1/0xf0
> (early)  [<ffffffff8103a641>] do_exit+0xb4/0x676
> (early)  [<ffffffff81037b53>] ? spin_unlock_irqrestore+0x9/0xb
> (early)  [<ffffffff81038c24>] ? kmsg_dump+0x4a/0xd9
> (early)  [<ffffffff8100d11d>] oops_end+0xc1/0xc9
> (early)  [<ffffffff810252b1>] no_context+0x1f5/0x204
> (early)  [<ffffffff81025448>] __bad_area_nosemaphore+0x188/0x1ab
> (early)  [<ffffffff810065df>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
> (early)  [<ffffffff810254e1>] bad_area_nosemaphore+0xe/0x10
> (early)  [<ffffffff81025991>] do_page_fault+0x18c/0x337
> (early)  [<ffffffff810065df>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
> (early)  [<ffffffff810065df>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
> (early)  [<ffffffff810065df>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
> (early)  [<ffffffff812eebd5>] page_fault+0x25/0x30
> (early)  [<ffffffff814a33f2>] ? find_range_array+0x4e/0x57
> (early)  [<ffffffff814a33ca>] ? find_range_array+0x26/0x57
> (early)  [<ffffffff814a343d>] __memblock_x86_memory_in_range+0x42/0x171
> (early)  [<ffffffff814a3577>] memblock_x86_memory_in_range+0xb/0xd
> (early)  [<ffffffff81497348>] memblock_find_dma_reserve+0x15/0x3b
> (early)  [<ffffffff81496b50>] setup_arch+0x721/0x7e5
> (early)  [<ffffffff810065fd>] ? __raw_callee_save_xen_irq_disable+0x11/0x1e
> (early)  [<ffffffff81493955>] start_kernel+0x8a/0x2db
> (early)  [<ffffffff81493299>] x86_64_start_reservations+0x84/0x88
> (early)  [<ffffffff814947d9>] xen_start_kernel+0x3e1/0x3e8
> 
> I think that (Stefano please confirm or not) this patch was prepared
> as workaround for similar issues. However, I do not like this patch
> because on systems with small amount of memory it leaves huge (to some
> extent) hole between max_low_pfn and 4G. Additionally, it affects
> memory hotplug a bit because it allocates memory starting from current
> max_mfn. It also breaks memory hotplug on i386 (maybe also others
> thinks, however, I could not confirm that). If it stay for some
> reason it should be amended in follwing way:
> 
> #ifdef CONFIG_X86_32
>         xen_extra_mem_start = mem_end;
> #else
>         xen_extra_mem_start = max((1ULL << 32), mem_end);
> #endif
> 
> Regarding comment for this patch it should be mentioned that without this
> patch e820_end_of_low_ram_pfn() is not broken. It is not called simply.
> 
> Last but least. I found that memory sizes below and including exactly 1 GiB 
> and
> exactly 2 GiB, 3 GiB (maybe higher, i.e. 4 GiB, 5 GiB, ...; I was not able to 
> test
> them because I do not have sufficient memory) are magic. It means that if 
> memory
> is set with those sizes everything is working good (without 
> 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> and 24bdb0b62cc82120924762ae6bc85afc8c3f2b26 applied). It means that domU
> should be tested with sizes which are not power of two nor multiple of that.

Hmm, I thought I did test 1500M.

> > +#ifdef CONFIG_X86_64
> > +static __initdata u64 __last_pgt_set_rw = 0;
> > +static __initdata u64 __pgt_buf_start = 0;
> > +static __initdata u64 __pgt_buf_end = 0;
> > +static __initdata u64 __pgt_buf_top = 0;
> 
> Please look into include/linux/init.h for proper
> usage of __init macros. It should be changed to
> 
> static u64 __last_pgt_set_rw __initdata = 0;
> ...
> ...
> 
> Additionally,
> 
> static const struct pv_mmu_ops xen_mmu_ops __initdata = {
> 
> should be changed to:
> 
> static const struct pv_mmu_ops xen_mmu_ops __initconst = {
> 
> It is not in your patch, however, it conflicts
> with your definitions.

Ok. I am not that worried about the changes this patch brings. I hope
to have it removed in 2.6.40-ish time -frame.
> 
> > +/*
> > + * As a consequence of the commit:
> > + * 
> > + * commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > + * Author: Yinghai Lu <yinghai@xxxxxxxxxx>
> > + * Date:   Fri Dec 17 16:58:28 2010 -0800
> > + * 
> > + *     x86-64, mm: Put early page table high
> > + * 
> > + * at some point init_memory_mapping is going to reach the pagetable pages
> > + * area and map those pages too (mapping them as normal memory that falls
> > + * in the range of addresses passed to init_memory_mapping as argument).
> > + * Some of those pages are already pagetable pages (they are in the range
> > + * pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
> > + * everything is fine.
> > + * Some of these pages are not pagetable pages yet (they fall in the range
> > + * pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
> > + * are going to be mapped RW.  When these pages become pagetable pages and
> > + * are hooked into the pagetable, xen will find that the guest has already
> > + * a RW mapping of them somewhere and fail the operation.
> > + * The reason Xen requires pagetables to be RO is that the hypervisor needs
> > + * to verify that the pagetables are valid before using them. The 
> > validation
> > + * operations are called "pinning".
> > + * 
> > + * In order to fix the issue we mark all the pages in the entire range
> > + * pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
> > + * is completed only the range pgt_buf_start-pgt_buf_end is reserved by
> > + * init_memory_mapping. Hence the kernel is going to crash as soon as one
> > + * of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
> > + * ranges are RO).
> > + * 
> > + * For this reason, 'mark_rw_past_pgt' is introduced which is called 
> > _after_
> > + * the init_memory_mapping has completed (in a perfect world we would
> > + * call this function from init_memory_mapping, but lets ignore that).
> > + * 
> > + * Because we are called _after_ init_memory_mapping the pgt_buf_[start,
> > + * end,top] have all changed to new values (b/c init_memory_mapping
> > + * is called and setting up another new page-table). Hence, the first time
> > + * we enter this function, we save away the pgt_buf_start value and update
> > + * the pgt_buf_[end,top].
> > + * 
> > + * When we detect that the "old" pgt_buf_start through pgt_buf_end
> > + * PFNs have been reserved (so memblock_x86_reserve_range has been called),
> > + * we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
> > + * 
> > + * And then we update those "old" pgt_buf_[end|top] with the new ones
> > + * so that we can redo this on the next pagetable.
> > + */
> > +static __init void mark_rw_past_pgt(void) {
> 
> Please look into include/linux/init.h. I found much more similar
> mistakes in current Xen code. I will prepare relevant patch
> shortly.

Excellent. Looking forward to them.
..
> > +   return;
> 
> I think that this return is superfluous.

>nods>
> 
> > +}
> > +#else
> > +static __init void mark_rw_past_pgt(void) { }
> 
> Dito.

Ok. Not that much worried - I think we will get rid of this in 2.6.40 anyhow
(or I hope so).
> 
> > +#endif
> >  static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd)
> >  {
> >  #ifdef CONFIG_X86_64
> > @@ -1489,6 +1602,14 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t 
> > pte)
> >     unsigned long pfn = pte_pfn(pte);
> >  
> >     /*
> > +    * A bit of optimization. We do not need to call the workaround
> > +    * when xen_set_pte_init is called with a PTE with 0 as PFN.
> > +    * That is b/c the pagetable at that point are just being populated
> > +    * with empty values and we can save some cycles by not calling
> > +    * the 'memblock' code.*/
> > +   if (pfn)
> > +           mark_rw_past_pgt();
> > +   /*
> >      * If the new pfn is within the range of the newly allocated
> >      * kernel pagetable, and it isn't being mapped into an
> >      * early_ioremap fixmap slot as a freshly allocated page, make sure
> > @@ -1997,6 +2118,8 @@ __init void xen_ident_map_ISA(void)
> >  
> >  static __init void xen_post_allocator_init(void)
> 
> Dito.

<nods> That looks like a candidate for another patch.
> 
> Daniel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.