[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [BUG] kernel panics with drbd


  • To: xen-devel@xxxxxxxxxxxxx
  • From: Matthew Vernon <mcv21@xxxxxxxxx>
  • Date: Tue, 4 Aug 2015 14:52:56 +0100
  • Delivery-date: Tue, 04 Aug 2015 13:53:03 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hi,

I'm getting dom0 kernel panics, relating to moderately heavy use of
drbd. I think this is a Xen bug.

My Xen hosts are Debian jessie amd64 boxes, on slightly elderly Intel
kit. 

Linux ophon 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt9-3~deb8u1 (2015-04-24) 
x86_64 GNU/Linux
Linux opus 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17) 
x86_64 GNU/Linux

Both have the standard jessie versions of Xen - 4.4.1-9+deb8u1 and
xen-tools - 4.5-1

I have disable_sendpage enabled for drbd:
root@opus:~# cat /etc/modprobe.d/drbd.conf 
options drbd disable_sendpage=1
root@opus:~# cat /sys/module/drbd/parameters/disable_sendpage 
Y
root@ophon:~# cat /etc/modprobe.d/drbd.conf 
options drbd disable_sendpage=1
root@ophon:~# cat /sys/module/drbd/parameters/disable_sendpage 
Y

I have a script running on "ophon" that sets up a drbd device (itself
as primary, "opus" as secondary), makes an LVM pv+vg on top of that
drbd device,  and then calls xen-create-image[0]. "opus" typically kernel
panics shortly after xen-create-image starts.

I attach the relevant bit of kern.log from one such crash to this mail
- you can see the drbd operations happening a second or so before the
crash. I also attach the relevant drbd .res file

The bug is not 100% repeatable, but still fairly reliable (for obvious
reasons, extensive testing and hard-rebooting my kit is not a very
joyous prospect). I did once achieve a similar result by running
drbd-overview on opus, which said
kernel:[ 1127.630208] BUG: soft lockup - CPU#2 stuck for 23s!
[xenstored:864]
on console and then panicked much as before.

The "amusing" quirk is that similar code worked a couple of weeks ago
when I last tried it; that code does now also produce kernel panics
AFAICT (with a not-100%-repeatable bug and long reproduction
timescales 'cos of having to power-cycle etc. it's hard to be
completely certain).

The two hosts are part of a pacemaker cluster, and "opus" is otherwise
able to run guests fine.

I hope that's sufficient information; I'm happy to supply other config
files etc. if necessary.

Regards,

Matthew

[0] The code in question is in fact a python script; running on ophon,
it does the following (using ssh to run commands on opus):
--both hosts--
lvcreate -L 20G -nmwsig-mws-02474 guests
drbdadm -- --force create-md
drbdadm up mws-02474
--ophon only--
drbdadm wait-connect
drbdadm new-current-uuid --clear-bitmap minor-4
drbdadm primary mws-02474
pvcreate /dev/drbd4
vgcreate mws-02474-vg /dev/drbd4
xen-create-image ... --lvm mws-02474-vg

resource mws-02474 {
  device /dev/drbd4;
  disk /dev/guests/mwsig-mws-02474;
  meta-disk internal;
  on ophon {
    address 131.111.8.218:7792;
    }
  on opus {
    address 131.111.8.217:7792;
    }
  net {
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
  }
}

Aug  3 16:03:13 opus kernel: [ 1250.026811] drbd mws-priv-1: Starting worker 
thread (from drbdsetup-84 [12987])
Aug  3 16:03:13 opus kernel: [ 1250.027313] block drbd4: disk( Diskless -> 
Attaching ) 
Aug  3 16:03:13 opus kernel: [ 1250.027409] drbd mws-priv-1: Method to ensure 
write ordering: flush
Aug  3 16:03:13 opus kernel: [ 1250.027413] block drbd4: max BIO size = 4096
Aug  3 16:03:13 opus kernel: [ 1250.027418] block drbd4: drbd_bm_resize called 
with capacity == 41941688
Aug  3 16:03:13 opus kernel: [ 1250.027558] block drbd4: resync bitmap: 
bits=5242711 words=81918 pages=160
Aug  3 16:03:13 opus kernel: [ 1250.027561] block drbd4: size = 20 GB (20970844 
KB)
Aug  3 16:03:13 opus kernel: [ 1250.032268] block drbd4: Writing the whole 
bitmap, size changed
Aug  3 16:03:13 opus kernel: [ 1250.047827] block drbd4: bitmap WRITE of 160 
pages took 4 jiffies
Aug  3 16:03:13 opus kernel: [ 1250.061634] block drbd4: 20 GB (5242711 bits) 
marked out-of-sync by on disk bit-map.
Aug  3 16:03:13 opus kernel: [ 1250.180186] block drbd4: bitmap READ of 160 
pages took 2 jiffies
Aug  3 16:03:13 opus kernel: [ 1250.180291] block drbd4: recounting of set bits 
took additional 0 jiffies
Aug  3 16:03:13 opus kernel: [ 1250.180293] block drbd4: 20 GB (5242711 bits) 
marked out-of-sync by on disk bit-map.
Aug  3 16:03:13 opus kernel: [ 1250.180304] block drbd4: Suspended AL updates
Aug  3 16:03:13 opus kernel: [ 1250.180307] block drbd4: disk( Attaching -> 
Inconsistent ) 
Aug  3 16:03:13 opus kernel: [ 1250.180310] block drbd4: attached to UUIDs 
0000000000000004:0000000000000000:0000000000000000:0000000000000000
Aug  3 16:03:13 opus kernel: [ 1250.191161] drbd mws-priv-1: conn( StandAlone 
-> Unconnected ) 
Aug  3 16:03:13 opus kernel: [ 1250.191183] drbd mws-priv-1: Starting receiver 
thread (from drbd_w_mws-priv [12989])
Aug  3 16:03:13 opus kernel: [ 1250.191345] drbd mws-priv-1: receiver 
(re)started
Aug  3 16:03:13 opus kernel: [ 1250.191360] drbd mws-priv-1: conn( Unconnected 
-> WFConnection ) 
Aug  3 16:03:13 opus kernel: [ 1250.689576] drbd mws-priv-1: Handshake 
successful: Agreed network protocol version 101
Aug  3 16:03:13 opus kernel: [ 1250.689580] drbd mws-priv-1: Agreed to support 
TRIM on protocol level
Aug  3 16:03:13 opus kernel: [ 1250.689616] drbd mws-priv-1: conn( WFConnection 
-> WFReportParams ) 
Aug  3 16:03:13 opus kernel: [ 1250.689631] drbd mws-priv-1: Starting asender 
thread (from drbd_r_mws-priv [12992])
Aug  3 16:03:13 opus kernel: [ 1250.737084] block drbd4: max BIO size = 1048576
Aug  3 16:03:13 opus kernel: [ 1250.737091] block drbd4: drbd_sync_handshake:
Aug  3 16:03:13 opus kernel: [ 1250.737094] block drbd4: self 
0000000000000004:0000000000000000:0000000000000000:0000000000000000 
bits:5242711 flags:0
Aug  3 16:03:13 opus kernel: [ 1250.737096] block drbd4: peer 
0000000000000004:0000000000000000:0000000000000000:0000000000000000 
bits:5242711 flags:0
Aug  3 16:03:13 opus kernel: [ 1250.737098] block drbd4: uuid_compare()=0 by 
rule 10
Aug  3 16:03:13 opus kernel: [ 1250.737100] block drbd4: No resync, but 5242711 
bits in bitmap!
Aug  3 16:03:13 opus kernel: [ 1250.737105] block drbd4: peer( Unknown -> 
Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent 
) 
Aug  3 16:03:13 opus kernel: [ 1250.737109] block drbd4: Resumed AL updates
Aug  3 16:03:14 opus kernel: [ 1250.773903] block drbd4: Accepted new current 
UUID, preparing to skip initial sync
Aug  3 16:03:14 opus kernel: [ 1250.777061] block drbd4: bitmap WRITE of 160 
pages took 1 jiffies
Aug  3 16:03:14 opus kernel: [ 1250.788564] block drbd4: 0 KB (0 bits) marked 
out-of-sync by on disk bit-map.
Aug  3 16:03:14 opus kernel: [ 1250.788573] block drbd4: disk( Inconsistent -> 
UpToDate ) pdsk( Inconsistent -> UpToDate ) 
Aug  3 16:03:14 opus kernel: [ 1250.797104] block drbd4: receiver updated UUIDs 
to 14460554106EF79A:0000000000000000:0000000000000000:0000000000000000
Aug  3 16:03:14 opus kernel: [ 1250.797117] block drbd4: peer( Secondary -> 
Primary ) 
Aug  3 16:03:15 opus kernel: [ 1251.748952] kernel tried to execute 
NX-protected page - exploit attempt? (uid: 0)
Aug  3 16:03:15 opus kernel: [ 1251.748983] BUG: unable to handle kernel paging 
request at ffff8800022f3d88
Aug  3 16:03:15 opus kernel: [ 1251.749016] IP: [<ffff8800022f3d88>] 
0xffff8800022f3d88
Aug  3 16:03:15 opus kernel: [ 1251.749041] PGD 1814067 PUD 1815067 PMD 2f81067 
PTE 80100000022f3067
Aug  3 16:03:15 opus kernel: [ 1251.749082] Oops: 0011 [#1] SMP 
Aug  3 16:03:15 opus kernel: [ 1251.749106] Modules linked in: xt_physdev 
iptable_filter ip_tables x_tables xen_netback xen_blkback xen_gntdev xen_evtchn 
xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache 
sunrpc bridge stp llc intel_powerclamp coretemp crc32_pclmul 
ghash_clmulni_intel joydev hid_generic iTCO_wdt iTCO_vendor_support aesni_intel 
evdev aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd usbhid ttm hid 
drm_kms_helper pcspkr drm i2c_i801 lpc_ich mfd_core i7core_edac ioatdma 
edac_core tpm_tis tpm ipmi_si ipmi_msghandler button shpchp processor 
thermal_sys drbd lru_cache libcrc32c autofs4 ext4 crc16 mbcache jbd2 dm_mod 
raid1 md_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul 
crct10dif_common crc32c_intel ahci libahci libata ehci_pci uhci_hcd ehci_hcd 
scsi_mod usbcore usb_common igb i2c_algo_bit i2c_core dca ptp pps_core
Aug  3 16:03:15 opus kernel: [ 1251.749691] CPU: 0 PID: 12993 Comm: 
drbd_a_mws-priv Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt11-1+deb8u2
Aug  3 16:03:15 opus kernel: [ 1251.749726] Hardware name: Intel Corporation 
S5500WBV/S5500WB, BIOS S5500.86B.01.00.0061.030920121535 03/09/2012
Aug  3 16:03:15 opus kernel: [ 1251.749765] task: ffff8800171742d0 ti: 
ffff8800022f0000 task.ti: ffff8800022f0000
Aug  3 16:03:15 opus kernel: [ 1251.749847] RIP: e030:[<ffff8800022f3d88>]  
[<ffff8800022f3d88>] 0xffff8800022f3d88
Aug  3 16:03:15 opus kernel: [ 1251.749934] RSP: e02b:ffff8800022f3d90  EFLAGS: 
00010212
Aug  3 16:03:15 opus kernel: [ 1251.749984] RAX: 00000000fffffffc RBX: 
ffffffffffffffff RCX: 0000000000000113
Aug  3 16:03:15 opus kernel: [ 1251.750039] RDX: 0000000000000113 RSI: 
00000000fffffe01 RDI: ffffffff81463f75
Aug  3 16:03:15 opus kernel: [ 1251.750094] RBP: ffff8800171742d0 R08: 
ffff8800022f0000 R09: 0000000000000000
Aug  3 16:03:15 opus kernel: [ 1251.750150] R10: ffff88001751b890 R11: 
0000000000000000 R12: 0000000000000001
Aug  3 16:03:15 opus kernel: [ 1251.750205] R13: 0000000000000000 R14: 
0000000000000010 R15: ffff880016c92000
Aug  3 16:03:15 opus kernel: [ 1251.750263] FS:  00007f90c43c0740(0000) 
GS:ffff88001fa00000(0000) knlGS:0000000000000000
Aug  3 16:03:15 opus kernel: [ 1251.750348] CS:  e033 DS: 0000 ES: 0000 CR0: 
000000008005003b
Aug  3 16:03:15 opus kernel: [ 1251.750399] CR2: ffff8800022f3d88 CR3: 
000000001073f000 CR4: 0000000000002660
Aug  3 16:03:15 opus kernel: [ 1251.750454] Stack:
Aug  3 16:03:15 opus kernel: [ 1251.750493]  ffff8800022f3d88 0000000000000010 
0000000000000000 0000000000000000
Aug  3 16:03:15 opus kernel: [ 1251.750589]  ffff8800022f3d90 0000000000000001 
0000000000000000 0000000000000000
Aug  3 16:03:15 opus kernel: [ 1251.750686]  0000000000004100 ffffffffa02577be 
ffff880016c92080 0000001000000000
Aug  3 16:03:15 opus kernel: [ 1251.750783] Call Trace:
Aug  3 16:03:15 opus kernel: [ 1251.750829]  [<ffffffffa02577be>] ? 
drbd_asender+0x27e/0x750 [drbd]
Aug  3 16:03:15 opus kernel: [ 1251.750887]  [<ffffffffa0260d00>] ? 
drbd_destroy_connection+0xc0/0xc0 [drbd]
Aug  3 16:03:15 opus kernel: [ 1251.750947]  [<ffffffffa0260d46>] ? 
drbd_thread_setup+0x46/0x130 [drbd]
Aug  3 16:03:15 opus kernel: [ 1251.751006]  [<ffffffffa0260d00>] ? 
drbd_destroy_connection+0xc0/0xc0 [drbd]
Aug  3 16:03:15 opus kernel: [ 1251.751065]  [<ffffffff81087fad>] ? 
kthread+0xbd/0xe0
Aug  3 16:03:15 opus kernel: [ 1251.751114]  [<ffffffff81087ef0>] ? 
kthread_create_on_node+0x180/0x180
Aug  3 16:03:15 opus kernel: [ 1251.751170]  [<ffffffff815114d8>] ? 
ret_from_fork+0x58/0x90
Aug  3 16:03:15 opus kernel: [ 1251.751221]  [<ffffffff81087ef0>] ? 
kthread_create_on_node+0x180/0x180
Aug  3 16:03:15 opus kernel: [ 1251.751274] Code: ff ff ff ff ff ff ff ff ff ff 
ff 88 3d 2f 02 00 88 ff ff 30 e0 00 00 00 00 00 00 12 02 01 00 00 00 00 00 90 
3d 2f 02 00 88 ff ff <2b> e0 00 00 00 00 00 00 88 3d 2f 02 00 88 ff ff 10 00 00 
00 00 
Aug  3 16:03:15 opus kernel: [ 1251.751694] RIP  [<ffff8800022f3d88>] 
0xffff8800022f3d88
Aug  3 16:03:15 opus kernel: [ 1251.751747]  RSP <ffff8800022f3d90>
Aug  3 16:03:15 opus kernel: [ 1251.751790] CR2: ffff8800022f3d88
Aug  3 16:03:15 opus kernel: [ 1251.752128] ---[ end trace 975e04f66c2d9004 ]---
Aug  3 16:03:15 opus kernel: [ 1251.835012] BUG: unable to handle kernel NULL 
pointer dereference at 0000000000000008
Aug  3 16:03:15 opus kernel: [ 1251.835235] IP: [<ffffffffa02453bd>] 
drbd_endio_write_sec_final+0x9d/0x480 [drbd]
Aug  3 16:03:15 opus kernel: [ 1251.835408] PGD 0 
Aug  3 16:03:15 opus kernel: [ 1251.835531] Oops: 0002 [#2] SMP 
Aug  3 16:03:15 opus kernel: [ 1251.835704] Modules linked in: xt_physdev 
iptable_filter ip_tables x_tables xen_netback xen_blkback xen_gntdev xen_evtchn 
xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache 
sunrpc bridge stp llc intel_powerclamp coretemp crc32_pclmul 
ghash_clmulni_intel joydev hid_generic iTCO_wdt iTCO_vendor_support aesni_intel 
evdev aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd usbhid ttm hid 
drm_kms_helper pcspkr drm i2c_i801 lpc_ich mfd_core i7core_edac ioatdma 
edac_core tpm_tis tpm ipmi_si ipmi_msghandler button shpchp processor 
thermal_sys drbd lru_cache libcrc32c autofs4 ext4 crc16 mbcache jbd2 dm_mod 
raid1 md_mod sg sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul 
crct10dif_common crc32c_intel ahci libahci libata ehci_pci uhci_hcd ehci_hcd 
scsi_mod usbcore usb_common igb i2c_algo_bit i2c_core dca ptp pps_core
Aug  3 16:03:15 opus kernel: [ 1251.840379] CPU: 0 PID: 12992 Comm: 
drbd_r_mws-priv Tainted: G      D       3.16.0-4-amd64 #1 Debian 
3.16.7-ckt11-1+deb8u2
Aug  3 16:03:15 opus kernel: [ 1251.840515] Hardware name: Intel Corporation 
S5500WBV/S5500WB, BIOS S5500.86B.01.00.0061.030920121535 03/09/2012
Aug  3 16:03:15 opus kernel: [ 1251.840648] task: ffff880016c1a050 ti: 
ffff8800173e4000 task.ti: ffff8800173e4000
Aug  3 16:03:15 opus kernel: [ 1251.840771] RIP: e030:[<ffffffffa02453bd>]  
[<ffffffffa02453bd>] drbd_endio_write_sec_final+0x9d/0x480 [drbd]
Aug  3 16:03:15 opus kernel: [ 1251.840951] RSP: e02b:ffff8800173e7ce0  EFLAGS: 
00010097
Aug  3 16:03:15 opus kernel: [ 1251.841041] RAX: 0000000000000000 RBX: 
ffff880017439700 RCX: 000000000000009c
Aug  3 16:03:15 opus kernel: [ 1251.841137] RDX: 0000000000000000 RSI: 
ffff88000c850200 RDI: ffff88000c85bed0
Aug  3 16:03:15 opus kernel: [ 1251.841233] RBP: ffff88000cb22800 R08: 
0000000000000cce R09: ffff88000c850200
Aug  3 16:03:15 opus kernel: [ 1251.841329] R10: 0000000000007ff0 R11: 
0000000000000000 R12: ffff880002b676a0
Aug  3 16:03:15 opus kernel: [ 1251.841424] R13: ffff88001f8463b0 R14: 
ffff88000cb22bb0 R15: ffff88000cb22800
Aug  3 16:03:15 opus kernel: [ 1251.841523] FS:  00007f90c43c0740(0000) 
GS:ffff88001fa00000(0000) knlGS:0000000000000000
Aug  3 16:03:15 opus kernel: [ 1251.841648] CS:  e033 DS: 0000 ES: 0000 CR0: 
000000008005003b
Aug  3 16:03:15 opus kernel: [ 1251.841739] CR2: 0000000000000008 CR3: 
000000001073f000 CR4: 0000000000002660
Aug  3 16:03:15 opus kernel: [ 1251.841835] Stack:
Aug  3 16:03:15 opus kernel: [ 1251.841913]  0000000000000000 0000000000030003 
ffff8800174397b8 0000000000000000
Aug  3 16:03:15 opus kernel: [ 1251.842216]  0000000000000000 0000000000102800 
0000000000400000 0000000000104800
Aug  3 16:03:15 opus kernel: [ 1251.842518]  0000000000000000 0000000000102800 
0000000000000000 0000000000000000
Aug  3 16:03:15 opus kernel: [ 1251.842820] Call Trace:
Aug  3 16:03:15 opus kernel: [ 1251.842906]  [<ffffffffa0254bb5>] ? 
drbd_submit_peer_request+0x85/0x330 [drbd]
Aug  3 16:03:15 opus kernel: [ 1251.843033]  [<ffffffffa02556ea>] ? 
receive_Data+0x36a/0xe40 [drbd]
Aug  3 16:03:15 opus kernel: [ 1251.843130]  [<ffffffffa0257407>] ? 
drbd_receiver+0x117/0x250 [drbd]
Aug  3 16:03:15 opus kernel: [ 1251.843228]  [<ffffffffa0260d00>] ? 
drbd_destroy_connection+0xc0/0xc0 [drbd]
Aug  3 16:03:15 opus kernel: [ 1251.843328]  [<ffffffffa0260d46>] ? 
drbd_thread_setup+0x46/0x130 [drbd]
Aug  3 16:03:15 opus kernel: [ 1251.843427]  [<ffffffffa0260d00>] ? 
drbd_destroy_connection+0xc0/0xc0 [drbd]
Aug  3 16:03:15 opus kernel: [ 1251.843524]  [<ffffffff81087fad>] ? 
kthread+0xbd/0xe0
Aug  3 16:03:15 opus kernel: [ 1251.843614]  [<ffffffff81087ef0>] ? 
kthread_create_on_node+0x180/0x180
Aug  3 16:03:15 opus kernel: [ 1251.843709]  [<ffffffff815114d8>] ? 
ret_from_fork+0x58/0x90
Aug  3 16:03:15 opus kernel: [ 1251.843801]  [<ffffffff81087ef0>] ? 
kthread_create_on_node+0x180/0x180
Aug  3 16:03:15 opus kernel: [ 1251.843894] Code: 04 48 8b 45 00 48 8d b8 d0 00 
00 00 e8 dd bc 2c e1 8b 53 58 49 89 c1 c1 ea 09 01 95 54 02 00 00 49 83 fd ff 
48 8b 13 48 8b 43 08 <48> 89 42 08 48 89 10 48 8d 85 c0 03 00 00 48 8b 95 c8 03 
00 00 
Aug  3 16:03:15 opus kernel: [ 1251.846996] RIP  [<ffffffffa02453bd>] 
drbd_endio_write_sec_final+0x9d/0x480 [drbd]
Aug  3 16:03:15 opus kernel: [ 1251.847168]  RSP <ffff8800173e7ce0>
Aug  3 16:03:15 opus kernel: [ 1251.847252] CR2: 0000000000000008
Aug  3 16:03:15 opus kernel: [ 1251.847335] ---[ end trace 975e04f66c2d9005 ]---
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.