[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] page faults on machines with > 4TB memory



On 23/07/15 17:35, Elena Ufimtseva wrote:
> Hi
>
> While working on bugs during boot time on large oracle server x4-8,
> There is a problem with booting Xen on large machines with > 4TB memory,
> such as Oracle x4-8.
> The page fault occured initially while loading xen pm info into hypervisor
> (you can see it in serial log attahced named 4.4.2_no_mem_override).
> Tracing down an issue shows that page fault occures in timer.c code
> while getting heap size.
>
> Here is the original call trace:
> rocessor: Uploading Xen processor PM info 
> @ (XEN) ----[ Xen-4.4.3-preOVM  x86_64  debug=n  Tainted:    C ]---- 
> @ (XEN) CPU:    0 
> @ (XEN) RIP:    e008:[<ffff82d08022e747>] add_entry+0x27/0x120 
> @ (XEN) RFLAGS: 0000000000010082   CONTEXT: hypervisor 
> @ (XEN) rax: ffff8a2d080513a20   rbx: ffff83808e802300   rcx:
> 00000000000000e8 
> @ (XEN) rdx: 00000000000000e8   rsi: 00000000000000e8   rdi:
> ffff83808e802300 
> @ (XEN) rbp: ffff82d080513a20   rsp: ffff82d0804d7c70   r8:
> ffff8840ffdb5010 
> @ (XEN) r9:  0000000000000017   r10: ffff83808e802180   r11:
> 0200200200200200 
> @ (XEN) r12: ffff82d080533080   r13: 0000000000000296   r14:
> 0100100100100100 
> @ (XEN) r15: 00000000000000e8   cr0: 0000000080050033   cr4:
> 00000000001526f0 
> @ (XEN) cr3: 00000100818b2000   cr2: ffff8840ffdb5010 
> @ (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008 
> @ (XEN) Xen stack trace from rsp=ffff82d0804d7c70: 
> @ (XEN)    ffff83808e802300 ffff82d080513a20 ffff82d08022f59b
> ffff82d080533080 
> @ (XEN)    ffff82d080532f50 00000000000000e8 ffff83808e802328
> 0000000000000000 
> @ (XEN)    ffff82d080513a20 ffff83808e8022c0 ffff82d080533200
> 00000000000000e8 
> @ (XEN)    00000000000000f0 ffff82d0805331c0 ffff82d0802458e2
> 0000000000000000 
> @ (XEN)    00000000000000e8 ffff83808e802334 ffff8384be7979b0
> ffff82d0804d7d78 
> @ (XEN)    0000000000000000 ffff8384be77c700 ffff82d0804d7d78
> ffff82d080513a20 
> @ (XEN)    ffff82d080246207 00000000000000e8 00000000000000e8
> ffff8384be7979b0 
> @ (XEN)    ffff82d08024518a ffff82d080533080 0000000000000070
> ffff82d080533da8 
> @ (XEN)    00000001000000e8 ffff8384be797a00 000000e800000001
> 002ab980002abd68 
> @ (XEN)    0000271000124f80 002abd6800124f80 00000000002ab980
> ffff82d0803753e0 
> @ (XEN)    0000000000010101 0000000000000001 ffff82d0804d7e18
> ffff881fb4afbc88 
> @ (XEN)    ffff82d0804d0000 ffff881fb28a4400 ffff82d0804fca80
> ffffffff819b7080 
> @ (XEN)    ffff82d080266c16 ffff83808fb46ba8 ffff82d080208a82
> ffff83006bddd190 
> @ (XEN)    0000000000000292 0300000100000036 00000001000000f6
> 000000000000000f 
> @ (XEN)    0000007f000c0082 0000000000000000 0000007f000c0082
> 0000000000000000 
> @ (XEN)    000000000000000a ffff881fb28a4400 0000000000000005
> 0000000000000000 
> @ (XEN)    0000000000000000 00000000000000fe 0000000000000001
> 0000000000000001 
> @ (XEN)    0000000000000000 0000000000000000 ffff82d08031f521
> 0000000000000000 
> @ (XEN)    0000000000000246 ffffffff810010ea 0000000000000000
> ffffffff810010ea 
> @ (XEN)    000000000000e030 0000000000000246 ffff83006bddd000
> ffff881fb4afbd48 
> @ (XEN) Xen call trace: 
> @ (XEN)    [<ffff82d08022e747>] add_entry+0x27/0x120 
> @ (XEN)    [<ffff82d08022f59b>] set_timer+0x10b/0x220 
> @ (XEN)    [<ffff82d0802458e2>] cpufreq_governor_dbs+0x1e2/0x2f0 
> @ (XEN)    [<ffff82d080246207>] __cpufreq_set_policy+0x87/0x120 
> @ (XEN)    [<ffff82d08024518a>] cpufreq_add_cpu+0x24a/0x4f0 
> @ (XEN)    [<ffff82d080266c16>] do_platform_op+0x9c6/0x1650 
> @ (XEN)    [<ffff82d080208a82>] evtchn_check_pollers+0x22/0xb0 
> @ (XEN)    [<ffff82d08031f521>] do_iret+0xc1/0x1a0 
> @ (XEN)    [<ffff82d0803243a9>] syscall_enter+0xa9/0xae 
> @ (XEN) 
> @ (XEN) Pagetable walk from ffff8840ffdb5010: 
> @ (XEN)  L4[0x110] = 00000100818b3067 00000000000018b3 
> @ (XEN)  L3[0x103] = 0000000000000000 ffffffffffffffff 
> @ (XEN) 
> @ (XEN) ****************************************
>
> 0xffff82d08022e720 <add_entry>: movzwl 0x28(%rdi),%edx
>    0xffff82d08022e724 <add_entry+4>:    push   %rbp
>    0xffff82d08022e725 <add_entry+5>:    
>     lea    0x2e52f4(%rip),%rax        # 0xffff82d080513a20 <__per_cpu_offset>
>    0xffff82d08022e72c <add_entry+12>:   
>     lea    0x30494d(%rip),%r10        # 0xffff82d080533080 <per_cpu__timers>
>    0xffff82d08022e733 <add_entry+19>:   push   %rbx
>    0xffff82d08022e734 <add_entry+20>:   add    (%rax,%rdx,8),%r10
>    0xffff82d08022e738 <add_entry+24>:   movl   $0x0,0x8(%rdi)
>    0xffff82d08022e73f <add_entry+31>:   movb   $0x3,0x2a(%rdi)
>    0xffff82d08022e743 <add_entry+35>:   mov    0x8(%r10),%r8
>    0xffff82d08022e747 <add_entry+39>:   movzwl (%r8),%ecx
>
> And this points to 
> int sz = GET_HEAP_SIZE(heap);
> in add_entry of timer.c.
>
> static int add_entry(struct timer *t)                                         
>   
> {                                                                             
>   
> ffff82d08022cad3:   53                      push   %rbx                       
>   
>     struct timers *timers = &per_cpu(timers, t->cpu);                         
>   
> ffff82d08022cad4:   4c 03 14 d0             add    (%rax,%rdx,8),%r10         
>   
>     int rc;                                                                   
>   
>                                                                               
>   
>     ASSERT(t->status == TIMER_STATUS_invalid);                                
>   
>                                                                               
>   
>     /* Try to add to heap. t->heap_offset indicates whether we succeed. */    
>   
>     t->heap_offset = 0;                                                       
>   
> ffff82d08022cad8:   c7 47 08 00 00 00 00    movl   $0x0,0x8(%rdi)             
>   
>     t->status = TIMER_STATUS_in_heap;                                         
>   
> ffff82d08022cadf:   c6 47 2a 03             movb   $0x3,0x2a(%rdi)            
>   
>     rc = add_to_heap(timers->heap, t);                                        
>   
> ffff82d08022cae3:   4d 8b 42 08             mov    0x8(%r10),%r8              
>   
>                                                                               
>   
>                                                                               
>   
> /* Add new entry @t to @heap. Return TRUE if new top of heap. */              
>   
> static int add_to_heap(struct timer **heap, struct timer *t)                  
>   
> {                                                                             
>   
>     int sz = GET_HEAP_SIZE(heap);                                             
>   
> ffff82d08022cae7:   41 0f b7 08             movzwl (%r8),%ecx                 
>   
>                                                                               
>   
>     /* Fail if the heap is full. */                                           
>   
>     if ( unlikely(sz == GET_HEAP_LIMIT(heap)) )    
>
> But checking values for nr_cpumask_bits, nr_cpu_ids and NR_CPUS did not
> provide any clues on why it fails here.
>
> After disabling xen cpufreq in linux, the page fault did not appear, but
> creating new guest caused another fatal page fault:
>
> CPU:    0 
> @ (XEN) RIP:    e008:[<ffff82d08025d59b>] __find_first_bit+0xb/0x30 
> @ (XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor 
> @ (XEN) rax: 0000000000000000   rbx: 00000000ffdb53c0   rcx: 0000000000000004 
> @ (XEN) rdx: ffff82d080513a20   rsi: 00000000000000f0   rdi: ffff8840ffdb53c0 
> @ (XEN) rbp: 00000000000000e9   rsp: ffff82d0804d7d88   r8:  0000000000000000 
> @ (XEN) r9:  0000000000000000   r10: 0000000000000017   r11: 0000000000000000 
> @ (XEN) r12: ffff8381875ee3e0   r13: ffff82d0804d7e98   r14: 00000000000000e9 
> @ (XEN) r15: 00000000000000f0   cr0: 0000000080050033   cr4: 00000000001526f0 
> @ (XEN) cr3: 0000008174093000   cr2: ffff8840ffdb53c0 
> @ (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008 
> @ (XEN) Xen stack trace from rsp=ffff82d0804d7d88: 
> @ (XEN)    00000000000000e7 ffff82d080206030 000000cf7d47d0a2 
> 00000000000000e9 
> @ (XEN)    00000000000000f0 0000000000000002 ffff83808fb6ffd0 
> ffff82d080533db8 
> @ (XEN)    0000000000000000 ffff82d080532f50 ffff82d0804d0000 
> ffff82d080533db8 
> @ (XEN)    00007fa8c83e5004 ffff82d0804d7e08 ffff82d080533db8 
> ffff83818b4e5000 
> @ (XEN)    000000090000000f 00007fa8c8390001 00007fa800000002 
> 00007fa8ae7f8eb8 
> @ (XEN)    0000000000000002 00007fa898004170 000000000159c320 
> 00000034ccc6cffe 
> @ (XEN)    00007fa8c83e5000 0000000000000000 000000000159c320 
> fffffc73ffffffff 
> @ (XEN)    00000034ccf6e920 00000034ccf6e920 00000034ccf6e920 
> 00000034ccc94298 
> @ (XEN)    00007fa898004170 00000034ccc94220 ffffffffffffffff 
> ffffffffffffffff 
> @ (XEN)    ffffffffffffffff 000000ffffffffff 00000034ca0e08c7 
> 0000000000000100 
> @ (XEN)    00000034ca0e08c7 0000000000000033 0000000000000246 
> ffff83006bddd000 
> @ (XEN)    ffff8808456f1e98 00007fa8ae7f8d90 ffff88084ad1d900 
> 0000000000000001 
> @ (XEN)    00007fa8ae7f8d90 ffff82d0803243a9 00000000ffffffff 
> 0000000001d0085c 
> @ (XEN)    00007fa8c84549c0 00007fa898004170 ffff8808456f1e98 
> 00007fa8ae7f8d90 
> @ (XEN)    0000000000000282 00000000019c9998 0000000000000003 
> 0000000001d00a49 
> @ (XEN)    0000000000000024 ffffffff8100148a 00007fa898004170 
> 00007fa8ae7f8ed0 
> @ (XEN)    00007fa8c83e5004 0001010000000000 ffffffff8100148a 
> 000000000000e033 
> @ (XEN)    0000000000000282 ffff8808456f1e40 000000000000e02b 
> 0000000000000000 
> @ (XEN)    0000000000000000 0000000000000000 0000000000000000 
> 0000000000000000 
> @ (XEN)    ffff83006bddd000 0000000000000000 0000000000000000 
> @ (XEN) Xen call trace: 
> @ (XEN)    [<ffff82d08025d59b>] __find_first_bit+0xb/0x30 
> @ (XEN)    [<ffff82d080206030>] do_domctl+0x12b0/0x13d0 
> @ (XEN)    [<ffff82d0803243a9>] syscall_enter+0xa9/0xae 
> @ (XEN) 
> @ (XEN) Pagetable walk from ffff8840ffdb53c0: 
> @ (XEN)  L4[0x110] = 00000080818b3067 00000000000018b3
>
> While booting upstream on the same server (same command line as in other 
> cases)
> causes another page fault (see attaches upstream_no_mem_override.log);
>
> We remembered there there is another open bug about a problem when starting 
> with more than 4 TB memory. The workaround for this was to override mem at 
> Xen command line. Tried this, and with upstream Xen and one that 4.4.3 with 
> enabled cpufreq linux driver, problem dissapears. See attached logs 
> upstream_with_mem_override.log and 4.4.3_with_mem_overrride.log.
>
> Any information on what can be an issue here or any other pointers will be 
> very helpful.
> I will provide additional info if needed.
>
> Thank you
> Elena

This is an issue we have found in XenServer as well.

Observe that ffff8840ffdb53c0 is actually a pointer in the 64bit PV
virtual region, because the xenheap allocator has wandered off the top
of the directmap region.  This is a direct result of passing numa node
information to alloc_xenheap_page(), which overrides the check which
keeps the allocation inside the directmap region.

I have worked around in XenServer with

diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c
index 3c64f19..715765a 100644
--- a/xen/arch/x86/e820.c
+++ b/xen/arch/x86/e820.c
@@ -15,7 +15,7 @@
  * opt_mem: Limit maximum address of physical RAM.
  *          Any RAM beyond this address limit is ignored.
  */
-static unsigned long long __initdata opt_mem;
+static unsigned long long __initdata opt_mem = GB(5 * 1024);
 size_param("mem", opt_mem);
 
 /*

This causes Xen to ignore any RAM above the top of the directmap region,
which happens to be 5TiB on Xen 4.5.

In some copious free time, I was going to look into segmenting the
directmap region by numa node, rather than having it linear from 0, so
xenheap pages can still be properly numa-located.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.