Xen project Mailing List

Re: [Xen-devel] page faults on machines with > 4TB memory

To: Elena Ufimtseva <elena.ufimtseva@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Thu, 23 Jul 2015 18:01:45 +0100

Cc: adnan.misherfi@xxxxxxxxxx, jbeulich@xxxxxxxx

Delivery-date: Thu, 23 Jul 2015 17:02:19 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 23/07/15 17:35, Elena Ufimtseva wrote: > Hi > > While working on bugs during boot time on large oracle server x4-8, > There is a problem with booting Xen on large machines with > 4TB memory, > such as Oracle x4-8. > The page fault occured initially while loading xen pm info into hypervisor > (you can see it in serial log attahced named 4.4.2_no_mem_override). > Tracing down an issue shows that page fault occures in timer.c code > while getting heap size. > > Here is the original call trace: > rocessor: Uploading Xen processor PM info > @ (XEN) ----[ Xen-4.4.3-preOVM x86_64 debug=n Tainted: C ]---- > @ (XEN) CPU: 0 > @ (XEN) RIP: e008:[<ffff82d08022e747>] add_entry+0x27/0x120 > @ (XEN) RFLAGS: 0000000000010082 CONTEXT: hypervisor > @ (XEN) rax: ffff8a2d080513a20 rbx: ffff83808e802300 rcx: > 00000000000000e8 > @ (XEN) rdx: 00000000000000e8 rsi: 00000000000000e8 rdi: > ffff83808e802300 > @ (XEN) rbp: ffff82d080513a20 rsp: ffff82d0804d7c70 r8: > ffff8840ffdb5010 > @ (XEN) r9: 0000000000000017 r10: ffff83808e802180 r11: > 0200200200200200 > @ (XEN) r12: ffff82d080533080 r13: 0000000000000296 r14: > 0100100100100100 > @ (XEN) r15: 00000000000000e8 cr0: 0000000080050033 cr4: > 00000000001526f0 > @ (XEN) cr3: 00000100818b2000 cr2: ffff8840ffdb5010 > @ (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > @ (XEN) Xen stack trace from rsp=ffff82d0804d7c70: > @ (XEN) ffff83808e802300 ffff82d080513a20 ffff82d08022f59b > ffff82d080533080 > @ (XEN) ffff82d080532f50 00000000000000e8 ffff83808e802328 > 0000000000000000 > @ (XEN) ffff82d080513a20 ffff83808e8022c0 ffff82d080533200 > 00000000000000e8 > @ (XEN) 00000000000000f0 ffff82d0805331c0 ffff82d0802458e2 > 0000000000000000 > @ (XEN) 00000000000000e8 ffff83808e802334 ffff8384be7979b0 > ffff82d0804d7d78 > @ (XEN) 0000000000000000 ffff8384be77c700 ffff82d0804d7d78 > ffff82d080513a20 > @ (XEN) ffff82d080246207 00000000000000e8 00000000000000e8 > ffff8384be7979b0 > @ (XEN) ffff82d08024518a ffff82d080533080 0000000000000070 > ffff82d080533da8 > @ (XEN) 00000001000000e8 ffff8384be797a00 000000e800000001 > 002ab980002abd68 > @ (XEN) 0000271000124f80 002abd6800124f80 00000000002ab980 > ffff82d0803753e0 > @ (XEN) 0000000000010101 0000000000000001 ffff82d0804d7e18 > ffff881fb4afbc88 > @ (XEN) ffff82d0804d0000 ffff881fb28a4400 ffff82d0804fca80 > ffffffff819b7080 > @ (XEN) ffff82d080266c16 ffff83808fb46ba8 ffff82d080208a82 > ffff83006bddd190 > @ (XEN) 0000000000000292 0300000100000036 00000001000000f6 > 000000000000000f > @ (XEN) 0000007f000c0082 0000000000000000 0000007f000c0082 > 0000000000000000 > @ (XEN) 000000000000000a ffff881fb28a4400 0000000000000005 > 0000000000000000 > @ (XEN) 0000000000000000 00000000000000fe 0000000000000001 > 0000000000000001 > @ (XEN) 0000000000000000 0000000000000000 ffff82d08031f521 > 0000000000000000 > @ (XEN) 0000000000000246 ffffffff810010ea 0000000000000000 > ffffffff810010ea > @ (XEN) 000000000000e030 0000000000000246 ffff83006bddd000 > ffff881fb4afbd48 > @ (XEN) Xen call trace: > @ (XEN) [<ffff82d08022e747>] add_entry+0x27/0x120 > @ (XEN) [<ffff82d08022f59b>] set_timer+0x10b/0x220 > @ (XEN) [<ffff82d0802458e2>] cpufreq_governor_dbs+0x1e2/0x2f0 > @ (XEN) [<ffff82d080246207>] __cpufreq_set_policy+0x87/0x120 > @ (XEN) [<ffff82d08024518a>] cpufreq_add_cpu+0x24a/0x4f0 > @ (XEN) [<ffff82d080266c16>] do_platform_op+0x9c6/0x1650 > @ (XEN) [<ffff82d080208a82>] evtchn_check_pollers+0x22/0xb0 > @ (XEN) [<ffff82d08031f521>] do_iret+0xc1/0x1a0 > @ (XEN) [<ffff82d0803243a9>] syscall_enter+0xa9/0xae > @ (XEN) > @ (XEN) Pagetable walk from ffff8840ffdb5010: > @ (XEN) L4[0x110] = 00000100818b3067 00000000000018b3 > @ (XEN) L3[0x103] = 0000000000000000 ffffffffffffffff > @ (XEN) > @ (XEN) **************************************** > > 0xffff82d08022e720 <add_entry>: movzwl 0x28(%rdi),%edx > 0xffff82d08022e724 <add_entry+4>: push %rbp > 0xffff82d08022e725 <add_entry+5>: > lea 0x2e52f4(%rip),%rax # 0xffff82d080513a20 <__per_cpu_offset> > 0xffff82d08022e72c <add_entry+12>: > lea 0x30494d(%rip),%r10 # 0xffff82d080533080 <per_cpu__timers> > 0xffff82d08022e733 <add_entry+19>: push %rbx > 0xffff82d08022e734 <add_entry+20>: add (%rax,%rdx,8),%r10 > 0xffff82d08022e738 <add_entry+24>: movl $0x0,0x8(%rdi) > 0xffff82d08022e73f <add_entry+31>: movb $0x3,0x2a(%rdi) > 0xffff82d08022e743 <add_entry+35>: mov 0x8(%r10),%r8 > 0xffff82d08022e747 <add_entry+39>: movzwl (%r8),%ecx > > And this points to > int sz = GET_HEAP_SIZE(heap); > in add_entry of timer.c. > > static int add_entry(struct timer *t) > > { > > ffff82d08022cad3: 53 push %rbx > > struct timers *timers = &per_cpu(timers, t->cpu); > > ffff82d08022cad4: 4c 03 14 d0 add (%rax,%rdx,8),%r10 > > int rc; > > > > ASSERT(t->status == TIMER_STATUS_invalid); > > > > /* Try to add to heap. t->heap_offset indicates whether we succeed. */ > > t->heap_offset = 0; > > ffff82d08022cad8: c7 47 08 00 00 00 00 movl $0x0,0x8(%rdi) > > t->status = TIMER_STATUS_in_heap; > > ffff82d08022cadf: c6 47 2a 03 movb $0x3,0x2a(%rdi) > > rc = add_to_heap(timers->heap, t); > > ffff82d08022cae3: 4d 8b 42 08 mov 0x8(%r10),%r8 > > > > > > /* Add new entry @t to @heap. Return TRUE if new top of heap. */ > > static int add_to_heap(struct timer **heap, struct timer *t) > > { > > int sz = GET_HEAP_SIZE(heap); > > ffff82d08022cae7: 41 0f b7 08 movzwl (%r8),%ecx > > > > /* Fail if the heap is full. */ > > if ( unlikely(sz == GET_HEAP_LIMIT(heap)) ) > > But checking values for nr_cpumask_bits, nr_cpu_ids and NR_CPUS did not > provide any clues on why it fails here. > > After disabling xen cpufreq in linux, the page fault did not appear, but > creating new guest caused another fatal page fault: > > CPU: 0 > @ (XEN) RIP: e008:[<ffff82d08025d59b>] __find_first_bit+0xb/0x30 > @ (XEN) RFLAGS: 0000000000010246 CONTEXT: hypervisor > @ (XEN) rax: 0000000000000000 rbx: 00000000ffdb53c0 rcx: 0000000000000004 > @ (XEN) rdx: ffff82d080513a20 rsi: 00000000000000f0 rdi: ffff8840ffdb53c0 > @ (XEN) rbp: 00000000000000e9 rsp: ffff82d0804d7d88 r8: 0000000000000000 > @ (XEN) r9: 0000000000000000 r10: 0000000000000017 r11: 0000000000000000 > @ (XEN) r12: ffff8381875ee3e0 r13: ffff82d0804d7e98 r14: 00000000000000e9 > @ (XEN) r15: 00000000000000f0 cr0: 0000000080050033 cr4: 00000000001526f0 > @ (XEN) cr3: 0000008174093000 cr2: ffff8840ffdb53c0 > @ (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > @ (XEN) Xen stack trace from rsp=ffff82d0804d7d88: > @ (XEN) 00000000000000e7 ffff82d080206030 000000cf7d47d0a2 > 00000000000000e9 > @ (XEN) 00000000000000f0 0000000000000002 ffff83808fb6ffd0 > ffff82d080533db8 > @ (XEN) 0000000000000000 ffff82d080532f50 ffff82d0804d0000 > ffff82d080533db8 > @ (XEN) 00007fa8c83e5004 ffff82d0804d7e08 ffff82d080533db8 > ffff83818b4e5000 > @ (XEN) 000000090000000f 00007fa8c8390001 00007fa800000002 > 00007fa8ae7f8eb8 > @ (XEN) 0000000000000002 00007fa898004170 000000000159c320 > 00000034ccc6cffe > @ (XEN) 00007fa8c83e5000 0000000000000000 000000000159c320 > fffffc73ffffffff > @ (XEN) 00000034ccf6e920 00000034ccf6e920 00000034ccf6e920 > 00000034ccc94298 > @ (XEN) 00007fa898004170 00000034ccc94220 ffffffffffffffff > ffffffffffffffff > @ (XEN) ffffffffffffffff 000000ffffffffff 00000034ca0e08c7 > 0000000000000100 > @ (XEN) 00000034ca0e08c7 0000000000000033 0000000000000246 > ffff83006bddd000 > @ (XEN) ffff8808456f1e98 00007fa8ae7f8d90 ffff88084ad1d900 > 0000000000000001 > @ (XEN) 00007fa8ae7f8d90 ffff82d0803243a9 00000000ffffffff > 0000000001d0085c > @ (XEN) 00007fa8c84549c0 00007fa898004170 ffff8808456f1e98 > 00007fa8ae7f8d90 > @ (XEN) 0000000000000282 00000000019c9998 0000000000000003 > 0000000001d00a49 > @ (XEN) 0000000000000024 ffffffff8100148a 00007fa898004170 > 00007fa8ae7f8ed0 > @ (XEN) 00007fa8c83e5004 0001010000000000 ffffffff8100148a > 000000000000e033 > @ (XEN) 0000000000000282 ffff8808456f1e40 000000000000e02b > 0000000000000000 > @ (XEN) 0000000000000000 0000000000000000 0000000000000000 > 0000000000000000 > @ (XEN) ffff83006bddd000 0000000000000000 0000000000000000 > @ (XEN) Xen call trace: > @ (XEN) [<ffff82d08025d59b>] __find_first_bit+0xb/0x30 > @ (XEN) [<ffff82d080206030>] do_domctl+0x12b0/0x13d0 > @ (XEN) [<ffff82d0803243a9>] syscall_enter+0xa9/0xae > @ (XEN) > @ (XEN) Pagetable walk from ffff8840ffdb53c0: > @ (XEN) L4[0x110] = 00000080818b3067 00000000000018b3 > > While booting upstream on the same server (same command line as in other > cases) > causes another page fault (see attaches upstream_no_mem_override.log); > > We remembered there there is another open bug about a problem when starting > with more than 4 TB memory. The workaround for this was to override mem at > Xen command line. Tried this, and with upstream Xen and one that 4.4.3 with > enabled cpufreq linux driver, problem dissapears. See attached logs > upstream_with_mem_override.log and 4.4.3_with_mem_overrride.log. > > Any information on what can be an issue here or any other pointers will be > very helpful. > I will provide additional info if needed. > > Thank you > Elena This is an issue we have found in XenServer as well. Observe that ffff8840ffdb53c0 is actually a pointer in the 64bit PV virtual region, because the xenheap allocator has wandered off the top of the directmap region. This is a direct result of passing numa node information to alloc_xenheap_page(), which overrides the check which keeps the allocation inside the directmap region. I have worked around in XenServer with diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c index 3c64f19..715765a 100644 --- a/xen/arch/x86/e820.c +++ b/xen/arch/x86/e820.c @@ -15,7 +15,7 @@ * opt_mem: Limit maximum address of physical RAM. * Any RAM beyond this address limit is ignored. */ -static unsigned long long __initdata opt_mem; +static unsigned long long __initdata opt_mem = GB(5 * 1024); size_param("mem", opt_mem); /* This causes Xen to ignore any RAM above the top of the directmap region, which happens to be 5TiB on Xen 4.5. In some copious free time, I was going to look into segmenting the directmap region by numa node, rather than having it linear from 0, so xenheap pages can still be properly numa-located. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.