[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] Kernel BUG in page_alloc.c (mismatched start and end zone) using xl generated e820 map


  • To: "xen-users@xxxxxxxxxxxxxxxxxxxx" <xen-users@xxxxxxxxxxxxxxxxxxxx>
  • From: Simon Waterman <simon.waterman@xxxxxxxxxxx>
  • Date: Wed, 3 Jun 2015 22:57:23 +0000
  • Accept-language: en-GB, en-US
  • Delivery-date: Wed, 03 Jun 2015 22:58:48 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>
  • Thread-index: AQHQnk90VBaaD7EKZkuu0AsbDhRjrg==
  • Thread-topic: Kernel BUG in page_alloc.c (mismatched start and end zone) using xl generated e820 map

Hi,

We're hitting the kernel BUG below in one of our VMs running on Xen 4.4 and
Linux kernel 3.13.0.  We use the xl toolstack and are using PCI pass-through
to pass network cards and a disk controller.  It happens on a variety of our
hardware but not all servers and it seems to be related to the e820 map
passed by xl.

The problem occurs when we put the server under heavy load - the 'dd' command
at the top of the stack trace seems to be sufficient to cause the problem if
run a few times.

We didn't get a problem with previous versions of Xen (we were using 4.2.2)
but at that time we were using xend and as I understand it the RAM map
provided to the guest is fabricated rather than based upon the real hardware 
map.

root@server1:/home/user0# DD_PERF="$(dd if=/dev/zero of=/data/zeros bs=1M \
count=4096 2>&1 | tail -n 1 | cut -d ',' -f '2 3' ; rm -f /data/zeros)"
[  814.365651] ------------[ cut here ]------------
[  814.365668] kernel BUG at 
/build/ci/git/build/Kernel/kernel-trusty-domu/work/ubuntu-precise/mm/page_alloc.c:955!
[  814.365675] invalid opcode: 0000 [#1] SMP
[  814.365681] Modules linked in: drbd lru_cache libcrc32c xen_blkback 
xen_netback
xt_addrtype xt_multiport xt_hl nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT
xt_tcpudp xt_owner nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
xt_NFLOG nfnetlink_log nfnetlink ipt_ULOG ip6table_filter ip6_tables 
iptable_filter
ip_tables x_tables x86_pkg_temp_thermal dm_multipath coretemp crct10dif_pclmul
scsi_dh crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul
glue_helper ablk_helper cryptd tmem xenfs xen_privcmd zfs(POF) zunicode(POF)
zcommon(POF) znvpair(POF) spl(OF) zavl(POF) dm_mirror dm_region_hash dm_log 
raid0
multipath linear dm_raid raid456 async_raid6_recov async_memcpy async_pq 
async_xor
async_tx raid1 raid10 xor igb i2c_algo_bit dca ahci raid6_pq libahci ptp 
pps_core aufs
[  814.365772] CPU: 0 PID: 9772 Comm: dd Tainted: PF          O 
3.13.0-34-trusty-domu #60~precise1
[  814.365779] task: ffff88005d022fc0 ti: ffff880007a22000 task.ti: 
ffff880007a22000
[  814.365786] RIP: e030:[<ffffffff81145f84>]  [<ffffffff81145f84>] 
move_freepages+0x104/0x110
[  814.365799] RSP: e02b:ffff880007a23698  EFLAGS: 00010006
[  814.365803] RAX: ffff88010a24f000 RBX: 0000000000000000 RCX: 0000000000000001
[  814.365808] RDX: ffffea000428ffc0 RSI: ffffea0004288000 RDI: ffff88010a24ff00
[  814.365812] RBP: ffff880007a236a0 R08: ffff88010a24ff00 R09: 0000000000000000
[  814.365817] R10: 0000000000000000 R11: ffffea00042880a0 R12: ffffea0004288080
[  814.365821] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000014
[  814.365833] FS:  00007fed6b790740(0000) GS:ffff880109800000(0000) 
knlGS:ffff88001f800000
[  814.365838] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[  814.365843] CR2: 00007f8ebd683ab0 CR3: 00000000581af000 CR4: 0000000000002660
[  814.365848] Stack:
[  814.365851]  ffffffff81146003 ffff880007a23718 ffffffff811478eb 
0000000000017614
[  814.365859]  ffffffff81009ebd ffff88010a24ff88 ffffffff00000000 
ffffea00042880a0
[  814.365866]  ffff88010a24ff00 0000000200000000 0000000000000000 
0000000000000002
[  814.365874] Call Trace:
[  814.365880]  [<ffffffff81146003>] ? move_freepages_block+0x73/0x80
[  814.365887]  [<ffffffff811478eb>] __rmqueue+0x39b/0x4a0
[  814.365896]  [<ffffffff81009ebd>] ? xen_force_evtchn_callback+0xd/0x10
[  814.365902]  [<ffffffff81149e5c>] get_page_from_freelist+0x68c/0x930
[  814.365909]  [<ffffffff8114a27b>] __alloc_pages_nodemask+0x17b/0xb60
[  814.365915]  [<ffffffff8100a742>] ? check_events+0x12/0x20
[  814.365923]  [<ffffffff811e17fe>] ? __find_get_block+0xbe/0x230
[  814.365932]  [<ffffffff8115ecc9>] ? zone_statistics+0x89/0xa0
[  814.365939]  [<ffffffff81188983>] alloc_pages_current+0xa3/0x160
[  814.365946]  [<ffffffff811913a5>] new_slab+0x295/0x320
[  814.365954]  [<ffffffff8169a9b7>] __slab_alloc+0x2a8/0x459
[  814.365960]  [<ffffffff811e0d11>] ? alloc_buffer_head+0x21/0x70
[  814.365968]  [<ffffffff81277f0d>] ? jbd2_journal_dirty_metadata+0xcd/0x2d0
[  814.365975]  [<ffffffff81193213>] kmem_cache_alloc+0x183/0x1d0
[  814.365982]  [<ffffffff811e0d11>] alloc_buffer_head+0x21/0x70
[  814.365990]  [<ffffffff811a3406>] ? __mem_cgroup_commit_charge+0x156/0x3d0
[  814.365996]  [<ffffffff811e100a>] alloc_page_buffers+0x3a/0xc0
[  814.366002]  [<ffffffff811e1f2e>] create_empty_buffers+0x1e/0xd0
[  814.366009]  [<ffffffff811e2027>] create_page_buffers+0x47/0x50
[  814.366016]  [<ffffffff811e3081>] __block_write_begin+0x71/0x430
[  814.366022]  [<ffffffff81276723>] ? jbd2__journal_start+0xf3/0x1e0
[  814.366030]  [<ffffffff81230430>] ? __ext4_get_inode_loc+0x3e0/0x3e0
[  814.366037]  [<ffffffff81235dbc>] ? ext4_da_write_begin+0xec/0x2e0
[  814.366044]  [<ffffffff8125dfe9>] ? __ext4_journal_start_sb+0x69/0xe0
[  814.366050]  [<ffffffff81235dfe>] ext4_da_write_begin+0x12e/0x2e0
[  814.366057]  [<ffffffff8123684a>] ? ext4_da_write_end+0xba/0x250
[  814.366065]  [<ffffffff81140d68>] generic_file_buffered_write+0xf8/0x250
[  814.366073]  [<ffffffff81142421>] __generic_file_aio_write+0x1c1/0x3d0
[  814.366078]  [<ffffffff81142688>] generic_file_aio_write+0x58/0xa0
[  814.366084]  [<ffffffff8122be59>] ext4_file_write+0x99/0x400
[  814.366092]  [<ffffffff81097f74>] ? arch_vtime_task_switch+0x94/0xa0
[  814.366101]  [<ffffffff816b044e>] ? xen_hypervisor_callback+0x1e/0x30
[  814.366108]  [<ffffffff81009ef0>] ? xen_clocksource_read+0x20/0x30
[  814.366115]  [<ffffffff811ae43a>] do_sync_write+0x5a/0x90
[  814.366120]  [<ffffffff811aebc4>] vfs_write+0xb4/0x1f0
[  814.366126]  [<ffffffff811af5f9>] SyS_write+0x49/0xa0
[  814.366132]  [<ffffffff816aebff>] tracesys+0xe1/0xe6
[  814.366136] Code: de 41 d3 e6 4c 89 66 20 4d 89 48 08 4d 63 c6 4c 89 56 10 44
01 f0 49 c1 e0 06 4c 01 c6 48 39 f2 73 96 5b 41 5c 41 5d 41 5e 5d c3 <0f> 0b 66
2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 b8 00 00
[  814.366191] RIP  [<ffffffff81145f84>] move_freepages+0x104/0x110
[  814.366197]  RSP <ffff880007a23698>
[  814.366205] ---[ end trace cbb29943cef93713 ]---

We've annotated the code in page_alloc.c with some debug as shown below together
with the log output it produces when the BUG is hit.  It seems to happen when
move_freepages is called with a page at the top of RAM spanning the end of 
usable RAM.

----- Code from page_alloc.c with debug output
#ifndef CONFIG_HOLES_IN_ZONE
        /*
         * page_zone is not safe to call in this context when
         * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant
         * anyway as we check zone boundaries in move_freepages_block().
         * Remove at a later date when no bug reports exist related to
         * grouping pages by mobility
         */
        struct zone *zs, *ze;

        if (page_zone(start_page) != page_zone(end_page)) {
            zs = page_zone(start_page);
            ze = page_zone(end_page);
            printk(KERN_ERR "Input Zone = %s\n", zone->name);
            printk(KERN_ERR "Input Zone Start PFN = %lx\n", 
zone->zone_start_pfn);
            printk(KERN_ERR "Input Zone End PFN = %lx\n", zone_end_pfn(zone));
            printk(KERN_ERR "Start Zone = %s\n", zs->name);
            printk(KERN_ERR "Start PFN = %lx\n", page_to_pfn(start_page));
            printk(KERN_ERR "End Zone = %s\n", ze->name);
            printk(KERN_ERR "End PFN = %lx\n", page_to_pfn(end_page));
        }
        /* BUG_ON(page_zone(start_page) != page_zone(end_page)); */

----- Debug output when the BUG is hit
May 29 23:04:14 server1 kernel: [ 1212.185507] Input Zone Start PFN = 100000
May 29 23:04:14 server1 kernel: [ 1212.185511] Input Zone End PFN = 118000
May 29 23:04:14 server1 kernel: [ 1212.185514] Start Zone = Normal
May 29 23:04:14 server1 kernel: [ 1212.185516] Start PFN =10a200
May 29 23:04:14 server1 kernel: [ 1212.185519] End Zone = DMA
May 29 23:04:14 server1 kernel: [ 1212.185522] End PFN = 10a3ff

Output from dmesg is included below, showing the e820 map provided by xl.
If we tweak the e820 sanitize code in libxl_x86.c to align the end of usable 
RAM with
a 2MB (512 page) boundary everything seems fine but I'm not sure this is a good 
solution.
Hope someone can help us to understand the problem and a better solution.

[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
[    0.000000] Linux version 3.13.0-34-trusty-domu (root@zdev-ci-1)
(gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #60~precise1 SMP Fri May 29 
00:48:02 BST 2015
(Ubuntu 3.13.0-34.60~precise1-trusty-domu 3.13.11.4)
[    0.000000] Command line: 
root=/dev/zvol/diskvm/67ec09dd-a0ed-4c51-8b75-cc08efea62fa/bin/1
ro xencons=tty console=tty1 console=hvc0 iommu=soft libata.fua=1 boot=zfs-z 
rpool=diskvm
bootvol=67ec09dd-a0ed-4c51-8b75-cc08efea62fa/bin/1
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000] ACPI in unprivileged domain disabled
[    0.000000] Freeing 75dac-80000 pfn range: 41556 pages freed
[    0.000000] Released 41556 pages of unused memory
[    0.000000] Set 565844 page(s) to 1-1 mapping
[    0.000000] Populating 100000-10a254 pfn range: 41556 pages added
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] Xen: [mem 0x0000000000100000-0x0000000075dabfff] usable
[    0.000000] Xen: [mem 0x0000000075dac000-0x0000000075dbdfff] ACPI data
[    0.000000] Xen: [mem 0x0000000075dde000-0x000000008fffffff] reserved
[    0.000000] Xen: [mem 0x00000000beffe000-0x00000000beffefff] reserved
[    0.000000] Xen: [mem 0x00000000fec00000-0x00000000feefffff] reserved
[    0.000000] Xen: [mem 0x00000000ff800000-0x00000000ffffffff] reserved
[    0.000000] Xen: [mem 0x0000000100000000-0x000000010a253fff] usable
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI not present or invalid.
[    0.000000] e820: last_pfn = 0x10a254 max_arch_pfn = 0x400000000
[    0.000000] e820: last_pfn = 0x75dac max_arch_pfn = 0x400000000
[    0.000000] Scanning 1 areas for low memory corruption
[    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[    0.000000] init_memory_mapping: [mem 0x10a000000-0x10a1fffff]
[    0.000000] init_memory_mapping: [mem 0x108000000-0x109ffffff]
[    0.000000] init_memory_mapping: [mem 0x100000000-0x107ffffff]
[    0.000000] init_memory_mapping: [mem 0x00100000-0x75dabfff]
[    0.000000] init_memory_mapping: [mem 0x10a200000-0x10a253fff]
[    0.000000] RAMDISK: [mem 0x023dd000-0x05416fff]
[    0.000000] NUMA turned off
[    0.000000] Faking a node at [mem 0x0000000000000000-0x000000010a253fff]
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x10a253fff]
[    0.000000]   NODE_DATA [mem 0x10a24f000-0x10a253fff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   [mem 0x100000000-0x10a253fff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0x75dabfff]
[    0.000000]   node   0: [mem 0x100000000-0x10a253fff]
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
[    0.000000] e820: [mem 0xbefff000-0xfebfffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 4.4.3-pre (preserve-AD)
[    0.000000] setup_percpu: NR_CPUS:256 nr_cpumask_bits:256 nr_cpu_ids:2 
nr_node_ids:1
[    0.000000] PERCPU: Embedded 29 pages/cpu @ffff880109800000 s86080 r8192 
d24512 u1048576
[    0.000000] Built 1 zonelists in Node order, mobility grouping on.  Total 
pages: 515977
[    0.000000] Policy zone: Normal
[    0.000000] Kernel command line: 
root=/dev/zvol/diskvm/67ec09dd-a0ed-4c51-8b75-cc08efea62fa/bin/1
ro xencons=tty console=tty1 console=hvc0 iommu=soft libata.fua=1 boot=zfs-z 
rpool=diskvm
bootvol=67ec09dd-a0ed-4c51-8b75-cc08efea62fa/bin/1
[    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[    0.000000] software IO TLB [mem 0x103400000-0x107400000] (64MB) mapped at 
[ffff880103400000-ffff8801073fffff]
[    0.000000] Memory: 1921448K/2096764K available (6860K kernel code, 1077K 
rwdata, 3200K rodata, 1288K init, 1416K bss, 175316K reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
[    0.000000] Hierarchical RCU implementation.
[    0.000000]   RCU dyntick-idle grace-period acceleration is enabled.
[    0.000000]   RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=2.
[    0.000000]   Offload RCU callbacks from all CPUs
[    0.000000]   Offload RCU callbacks from CPUs: 0-1.
[    0.000000] NR_IRQS:16640 nr_irqs:288 16

Best wishes,

Simon
Zynstra is a private limited company registered in England and Wales 
(registered number 07864369). Our registered office and Headquarters are at The 
Innovation Centre, Broad Quay, Bath, BA1 1UD. This email, its contents and any 
attachments are confidential. If you have received this message in error please 
delete it from your system and advise the sender immediately.

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.