[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] mpt3sas bug with Debian jessie kernel only under Xen - "swiotlb buffer is full"
Hi, I have a Debian jessie server with an LSI SAS controller using the mpt3sas driver. Under the Debian jessie amd64 kernel (linux-image-3.16.0-4-amd64 3.16.36-1+deb8u2) running under Xen, I cannot put the system's storage under heavy load without receiving a bunch of "swiotlb buffer is full" kernel error messages and severely degraded performance. Sometimes the system panics and reboots itself. These problems do not happen if booting the kernel on bare metal. With a bit of searching I found someone having a similar issue with the Debian jessie kernel (though 686 and several versions back) and the tg3 driver: https://lists.debian.org/debian-kernel/2015/05/msg00307.html They mention that suggestions on this list led them to compile a kernel with NEED_DMA_MAP_STATE set. I already seem to have that set: $ grep NEED_DMA /boot/config-3.16.0-4-amd64 CONFIG_NEED_DMA_MAP_STATE=y Is there something similar that I could try? The machine has two SSDs in an md RAID-10 and two spinning disks in another RAID-10. I can induce the situation within a few seconds by telling mdadm to check both of those arrays at the same time. i.e.: # /usr/share/mdadm/checkarray /dev/md4 # Spinny disks # /usr/share/mdadm/checkarray /dev/md5 # SSDs I expect to see 200,000K/sec (my set maximum) checking rate reported in /proc/mdstat for md5, and about 98,000K/sec for md4. This happens on bare metal. Under Xen, it starts off well but then the kernel errors appear within a few seconds; md4's speed drops to ~90,000K/sec and md5's drops right down to just ~100K/sec. If the machine doesn't do a kernel panic and reset itself very soon, it becomes unusably slow anyway. I can also trigger it with fio if I run jobs against filesystems on both arrays at once. Some logs appended at the end of this email. Would it be useful for me to show you a "dmesg" and "xl dmesg"? Shall I try a kernel and/or hypervisor from testing? Thanks, Andy Dec 4 07:06:00 elephant kernel: [22019.373653] mpt3sas 0000:01:00.0: swiotlb buffer is full (sz: 57344 bytes) Dec 4 07:06:00 elephant kernel: [22019.374707] mpt3sas 0000:01:00.0: swiotlb buffer is full Dec 4 07:06:00 elephant kernel: [22019.375754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 Dec 4 07:06:00 elephant kernel: [22019.376430] IP: [<ffffffffa004e779>] _base_build_sg_scmd_ieee+0x1f9/0x2d0 [mpt3sas] Dec 4 07:06:00 elephant kernel: [22019.377122] PGD 0 Dec 4 07:06:00 elephant kernel: [22019.377825] Oops: 0000 [#1] SMP Dec 4 07:06:00 elephant kernel: [22019.378494] Modules linked in: binfmt_misc xen_gntdev xen_evtchn xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc ipt_REJECT xt_LOG xt_limit xt_NFLOG nfnetlink_log nfnetlink xt_multiport xt_tcpudp iptable_filter ip_tables x_tables bonding joydev hid_generic usbhid hid x86_pkg_temp_thermal coretemp crc32_pclmul crc32c_intel iTCO_wdt iTCO_vendor_support evdev aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd pcspkr i2c_i801 ast ttm drm_kms_helper xhci_hcd ehci_pci ehci_hcd drm lpc_ich mfd_core mei_me usbcore mei usb_common igb ptp pps_core dca sg i2c_algo_bit i2c_core shpchp tpm_tis tpm button wmi ipmi_si ipmi_msghandler processor thermal_sys acpi_power_meter fuse autofs4 ext4 crc16 mbcache jbd2 dm_mod raid10 raid1 md_mod sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common ahci libahci libata mpt3sas raid_class scsi_transport_sas scsi_mod Dec 4 07:06:00 elephant kernel: [22019.384778] CPU: 0 PID: 29516 Comm: md5_resync Not tainted 3.16.0-4-amd64 #1 Debian 3.16.36-1+deb8u2 Dec 4 07:06:00 elephant kernel: [22019.385574] Hardware name: Supermicro Super Server/X10SRH-CLN4F, BIOS 2.0a 09/20/2016 Dec 4 07:06:00 elephant kernel: [22019.386400] task: ffff8800704ae2d0 ti: ffff88005c410000 task.ti: ffff88005c410000 Dec 4 07:06:00 elephant kernel: [22019.387204] RIP: e030:[<ffffffffa004e779>] [<ffffffffa004e779>] _base_build_sg_scmd_ieee+0x1f9/0x2d0 [mpt3sas] Dec 4 07:06:00 elephant kernel: [22019.388054] RSP: e02b:ffff88005c413a00 EFLAGS: 00010282 Dec 4 07:06:00 elephant kernel: [22019.388855] RAX: 0000000000000010 RBX: ffff88006fb84070 RCX: ffff88006fb41be0 Dec 4 07:06:00 elephant kernel: [22019.389684] RDX: 0000000000000000 RSI: 00000000ffffff30 RDI: ffff88005c507300 Dec 4 07:06:00 elephant kernel: [22019.390572] RBP: 00000000ffffffff R08: ffff88006f90ae80 R09: 0000000000000000 Dec 4 07:06:00 elephant kernel: [22019.391395] R10: ffff880078eec000 R11: 0000000000000001 R12: ffff880071230720 Dec 4 07:06:00 elephant kernel: [22019.392235] R13: 00000000ffffffeb R14: 00000000fffffff3 R15: 0000000000000000 Dec 4 07:06:00 elephant kernel: [22019.393031] FS: 0000000000000000(0000) GS:ffff880078400000(0000) knlGS:0000000000000000 Dec 4 07:06:00 elephant kernel: [22019.393850] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 4 07:06:00 elephant kernel: [22019.394639] CR2: 0000000000000010 CR3: 000000006ece5000 CR4: 0000000000042660 Dec 4 07:06:00 elephant kernel: [22019.395434] Stack: Dec 4 07:06:00 elephant kernel: [22019.396253] ffff88006f90ae70 0000008068812a20 ffff88005a1a1500 ffff880071230000 Dec 4 07:06:00 elephant kernel: [22019.397065] ffff88006f880800 ffff88006fcc2800 ffff88006fb84000 000000000000000a Dec 4 07:06:00 elephant kernel: [22019.397888] ffffffffa0058b9f ffff880071230720 0200000000000080 ffff88005a1a1500 Dec 4 07:06:00 elephant kernel: [22019.398727] Call Trace: Dec 4 07:06:00 elephant kernel: [22019.399529] [<ffffffffa0058b9f>] ? _scsih_qcmd+0x26f/0x3d0 [mpt3sas] Dec 4 07:06:00 elephant kernel: [22019.400387] [<ffffffffa00023e4>] ? scsi_dispatch_cmd+0xb4/0x2d0 [scsi_mod] Dec 4 07:06:00 elephant kernel: [22019.401200] [<ffffffffa000acbd>] ? scsi_request_fn+0x2fd/0x500 [scsi_mod] Dec 4 07:06:00 elephant kernel: [22019.402002] [<ffffffff8127fe7f>] ? __blk_run_queue+0x2f/0x40 Dec 4 07:06:00 elephant kernel: [22019.402755] [<ffffffff8127ff39>] ? queue_unplugged+0x29/0xc0 Dec 4 07:06:00 elephant kernel: [22019.403509] [<ffffffff81284277>] ? blk_flush_plug_list+0x1f7/0x230 Dec 4 07:06:00 elephant kernel: [22019.404286] [<ffffffff812844ca>] ? blk_queue_bio+0x21a/0x3a0 Dec 4 07:06:00 elephant kernel: [22019.405022] [<ffffffff8127fcb0>] ? generic_make_request+0xb0/0x100 Dec 4 07:06:00 elephant kernel: [22019.405757] [<ffffffffa010908a>] ? sync_request+0x133a/0x18b0 [raid10] Dec 4 07:06:00 elephant kernel: [22019.406526] [<ffffffffa00dfdf4>] ? md_do_sync+0x944/0xd80 [md_mod] Dec 4 07:06:00 elephant kernel: [22019.407258] [<ffffffffa00dcb87>] ? md_thread+0x107/0x120 [md_mod] Dec 4 07:06:00 elephant kernel: [22019.407971] [<ffffffff81514951>] ? __schedule+0x2b1/0x6f0 Dec 4 07:06:00 elephant kernel: [22019.408715] [<ffffffffa00dca80>] ? md_stop+0x40/0x40 [md_mod] Dec 4 07:06:00 elephant kernel: [22019.409427] [<ffffffff810894bd>] ? kthread+0xbd/0xe0 Dec 4 07:06:00 elephant kernel: [22019.410151] [<ffffffff81089400>] ? kthread_create_on_node+0x180/0x180 Dec 4 07:06:00 elephant kernel: [22019.410856] [<ffffffff815184d8>] ? ret_from_fork+0x58/0x90 Dec 4 07:06:00 elephant kernel: [22019.411539] [<ffffffff81089400>] ? kthread_create_on_node+0x180/0x180 Dec 4 07:06:00 elephant kernel: [22019.412257] Code: 24 ba 04 00 00 44 89 f6 0f af f0 45 85 f6 0f 85 5e ff ff ff 45 85 ed c6 43 0f 80 c6 43 0e 00 89 73 08 48 89 13 74 48 41 83 fd 01 <49> 8b 47 10 41 8b 57 18 0f 84 a7 00 00 00 41 c6 40 0f 00 41 c6 Dec 4 07:06:00 elephant kernel: [22019.413728] RIP [<ffffffffa004e779>] _base_build_sg_scmd_ieee+0x1f9/0x2d0 [mpt3sas] Dec 4 07:06:00 elephant kernel: [22019.414469] RSP <ffff88005c413a00> Dec 4 07:06:00 elephant kernel: [22019.415174] CR2: 0000000000000010 Dec 4 07:06:00 elephant kernel: [22019.415857] ---[ end trace 3fa287bf370969b9 ]--- Dec 4 07:06:31 elephant kernel: [22049.728424] sd 0:0:1:0: attempting task abort! scmd(ffff88005a1a1500) Dec 4 07:06:31 elephant kernel: [22049.729312] sd 0:0:1:0: [sdb] CDB: Dec 4 07:06:31 elephant kernel: [22049.729962] Read(10): 28 00 00 04 44 00 00 04 00 00 Dec 4 07:06:31 elephant kernel: [22049.730616] scsi target0:0:1: handle(0x000a), sas_address(0x4433221101000000), phy(1) Dec 4 07:06:31 elephant kernel: [22049.731283] scsi target0:0:1: enclosure_logical_id(0x500304801cb30101), slot(1) Dec 4 07:06:31 elephant kernel: [22049.760538] sd 0:0:1:0: task abort: FAILED scmd(ffff88005a1a1500) Dec 4 07:06:31 elephant kernel: [22049.940030] sd 0:0:1:0: attempting device reset! scmd(ffff88005a1a1500) Dec 4 07:06:31 elephant kernel: [22049.940032] sd 0:0:1:0: [sdb] CDB: Dec 4 07:06:31 elephant kernel: [22049.940034] Read(10): 28 00 00 04 44 00 00 04 00 00 Dec 4 07:06:31 elephant kernel: [22049.940036] scsi target0:0:1: handle(0x000a), sas_address(0x4433221101000000), phy(1) Dec 4 07:06:31 elephant kernel: [22049.940037] scsi target0:0:1: enclosure_logical_id(0x500304801cb30101), slot(1) Dec 4 07:06:31 elephant kernel: [22049.967889] sd 0:0:1:0: device reset: FAILED scmd(ffff88005a1a1500) Dec 4 07:06:31 elephant kernel: [22049.967891] scsi target0:0:1: attempting target reset! scmd(ffff88005a1a1500) Dec 4 07:06:31 elephant kernel: [22049.967891] sd 0:0:1:0: [sdb] CDB: Dec 4 07:06:31 elephant kernel: [22049.967893] Read(10): 28 00 00 04 44 00 00 04 00 00 Dec 4 07:06:31 elephant kernel: [22049.967894] scsi target0:0:1: handle(0x000a), sas_address(0x4433221101000000), phy(1) Dec 4 07:06:31 elephant kernel: [22049.967894] scsi target0:0:1: enclosure_logical_id(0x500304801cb30101), slot(1) Dec 4 07:06:31 elephant kernel: [22049.995732] scsi target0:0:1: target reset: FAILED scmd(ffff88005a1a1500) Dec 4 07:06:31 elephant kernel: [22049.995733] mpt3sas0: attempting host reset! scmd(ffff88005a1a1500) Dec 4 07:06:31 elephant kernel: [22049.995733] sd 0:0:1:0: [sdb] CDB: Dec 4 07:06:31 elephant kernel: [22049.995735] Read(10): 28 00 00 04 44 00 00 04 00 00 Dec 4 07:06:41 elephant kernel: [22059.992654] mpt3sas0: sending diag reset !! Dec 4 07:06:42 elephant kernel: [22061.013062] mpt3sas0: diag reset: SUCCESS Dec 4 07:06:43 elephant kernel: [22061.122098] mpt3sas0: LSISAS3008: FWVersion(12.00.02.00), ChipRevision(0x02), BiosVersion(08.29.01.00) Dec 4 07:06:43 elephant kernel: [22061.122099] mpt3sas0: Protocol=( Dec 4 07:06:43 elephant kernel: [22061.122099] Initiator Dec 4 07:06:43 elephant kernel: [22061.122099] ,Target Dec 4 07:06:43 elephant kernel: [22061.122100] ), Dec 4 07:06:43 elephant kernel: [22061.122100] Capabilities=( Dec 4 07:06:43 elephant kernel: [22061.122100] TLR Dec 4 07:06:43 elephant kernel: [22061.122100] ,EEDP Dec 4 07:06:43 elephant kernel: [22061.122101] ,Snapshot Buffer Dec 4 07:06:43 elephant kernel: [22061.122101] ,Diag Trace Buffer Dec 4 07:06:43 elephant kernel: [22061.122101] ,Task Set Full Dec 4 07:06:43 elephant kernel: [22061.122101] ,NCQ Dec 4 07:06:43 elephant kernel: [22061.122101] ) Dec 4 07:06:43 elephant kernel: [22061.122153] mpt3sas0: sending port enable !! Dec 4 07:06:50 elephant kernel: [22068.799886] mpt3sas0: port enable: SUCCESS Dec 4 07:06:50 elephant kernel: [22068.800615] mpt3sas0: search for end-devices: start Dec 4 07:06:50 elephant kernel: [22068.801606] scsi target0:0:0: handle(0x0009), sas_addr(0x4433221100000000), enclosure logical id(0x500304801cb30101), slot(0) Dec 4 07:06:50 elephant kernel: [22068.802221] scsi target0:0:1: handle(0x000a), sas_addr(0x4433221101000000), enclosure logical id(0x500304801cb30101), slot(1) Dec 4 07:06:50 elephant kernel: [22068.802875] scsi target0:0:2: handle(0x000b), sas_addr(0x4433221106000000), enclosure logical id(0x500304801cb30101), slot(6) Dec 4 07:06:50 elephant kernel: [22068.803466] scsi target0:0:3: handle(0x000c), sas_addr(0x4433221107000000), enclosure logical id(0x500304801cb30101), slot(7) Dec 4 07:06:50 elephant kernel: [22068.804055] mpt3sas0: search for end-devices: complete Dec 4 07:06:50 elephant kernel: [22068.804588] mpt3sas0: search for expanders: start Dec 4 07:06:50 elephant kernel: [22068.805143] mpt3sas0: search for expanders: complete Dec 4 07:06:50 elephant kernel: [22068.805645] mpt3sas0: host reset: SUCCESS scmd(ffff88005a1a1500) Dec 4 07:07:01 elephant kernel: [22079.849149] mpt3sas0: removing unresponding devices: start Dec 4 07:07:01 elephant kernel: [22079.849927] mpt3sas0: removing unresponding devices: end-devices Dec 4 07:07:01 elephant kernel: [22079.850454] mpt3sas0: removing unresponding devices: expanders Dec 4 07:07:01 elephant kernel: [22079.850971] mpt3sas0: removing unresponding devices: complete Dec 4 07:07:01 elephant kernel: [22079.851481] mpt3sas0: scan devices: start Dec 4 07:07:01 elephant kernel: [22079.852242] mpt3sas0: scan devices: expanders start Dec 4 07:07:01 elephant kernel: [22079.852785] mpt3sas0: break from expander scan: ioc_status(0x0022), loginfo(0x310f0400) Dec 4 07:07:01 elephant kernel: [22079.853305] mpt3sas0: scan devices: expanders complete Dec 4 07:07:01 elephant kernel: [22079.853809] mpt3sas0: scan devices: end devices start Dec 4 07:07:01 elephant kernel: [22079.854930] mpt3sas0: break from end device scan: ioc_status(0x0022), loginfo(0x310f0400) Dec 4 07:07:01 elephant kernel: [22079.855432] mpt3sas0: scan devices: end devices complete Dec 4 07:07:01 elephant kernel: [22079.855931] mpt3sas0: scan devices: complete Dec 4 07:15:20 elephant kernel: [22578.567343] BUG: unable to handle kernel NULL pointer dereference at (null) Dec 4 07:15:20 elephant kernel: [22578.568618] IP: [<ffffffff8108e6eb>] exit_creds+0x1b/0x60 Dec 4 07:15:20 elephant kernel: [22578.569809] PGD 0 Dec 4 07:15:20 elephant kernel: [22578.571006] Oops: 0002 [#2] SMP Dec 4 07:15:20 elephant kernel: [22578.572188] Modules linked in: binfmt_misc xen_gntdev xen_evtchn xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc ipt_REJECT xt_LOG xt_limit xt_NFLOG nfnetlink_log nfnetlink xt_multiport xt_tcpudp iptable_filter ip_tables x_tables bonding joydev hid_generic usbhid hid x86_pkg_temp_thermal coretemp crc32_pclmul crc32c_intel iTCO_wdt iTCO_vendor_support evdev aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd pcspkr i2c_i801 ast ttm drm_kms_helper xhci_hcd ehci_pci ehci_hcd drm lpc_ich mfd_core mei_me usbcore mei usb_common igb ptp pps_core dca sg i2c_algo_bit i2c_core shpchp tpm_tis tpm button wmi ipmi_si ipmi_msghandler processor thermal_sys acpi_power_meter fuse autofs4 ext4 crc16 mbcache jbd2 dm_mod raid10 raid1 md_mod sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common ahci libahci libata mpt3sas raid_class scsi_transport_sas scsi_mod Dec 4 07:15:20 elephant kernel: [22578.581851] CPU: 1 PID: 29917 Comm: checkarray Tainted: G D 3.16.0-4-amd64 #1 Debian 3.16.36-1+deb8u2 Dec 4 07:15:20 elephant kernel: [22578.583055] Hardware name: Supermicro Super Server/X10SRH-CLN4F, BIOS 2.0a 09/20/2016 Dec 4 07:15:20 elephant kernel: [22578.584230] task: ffff880065446c60 ti: ffff880058240000 task.ti: ffff880058240000 Dec 4 07:15:20 elephant kernel: [22578.585394] RIP: e030:[<ffffffff8108e6eb>] [<ffffffff8108e6eb>] exit_creds+0x1b/0x60 Dec 4 07:15:20 elephant kernel: [22578.586554] RSP: e02b:ffff880058243e28 EFLAGS: 00010292 Dec 4 07:15:20 elephant kernel: [22578.587690] RAX: ffffffff812380d0 RBX: ffff8800704ae2d0 RCX: 0000000000000000 Dec 4 07:15:20 elephant kernel: [22578.588818] RDX: ffffffff81886ee0 RSI: ffff8800704ae2d0 RDI: 0000000000000000 Dec 4 07:15:20 elephant kernel: [22578.589924] RBP: ffff8800704ae2d0 R08: ffffffff818439c0 R09: 00007f7a067d0670 Dec 4 07:15:20 elephant kernel: [22578.591014] R10: 00007f7a06c157f0 R11: 0000000000000246 R12: 0000000000000000 Dec 4 07:15:20 elephant kernel: [22578.592091] R13: ffff880059b1ac00 R14: 0000000000000005 R15: 0000000000000002 Dec 4 07:15:20 elephant kernel: [22578.593170] FS: 00007f7a06c07700(0000) GS:ffff880078420000(0000) knlGS:0000000000000000 Dec 4 07:15:20 elephant kernel: [22578.594219] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 4 07:15:20 elephant kernel: [22578.595250] CR2: 0000000000000000 CR3: 0000000059f43000 CR4: 0000000000042660 Dec 4 07:15:20 elephant kernel: [22578.596274] Stack: Dec 4 07:15:20 elephant kernel: [22578.597275] ffff8800704ae2d0 ffffffff8106560d 0000000000000000 ffff8800704ae2d0 Dec 4 07:15:20 elephant kernel: [22578.598280] ffffffff810897b8 ffff88006e418a40 ffff88006ee81150 0000000000000005 Dec 4 07:15:20 elephant kernel: [22578.599272] ffffffffa00dcbe0 ffff88006ee81000 ffff88006ee811b8 ffffffffa00e4645 Dec 4 07:15:20 elephant kernel: [22578.600246] Call Trace: Dec 4 07:15:20 elephant kernel: [22578.601222] [<ffffffff8106560d>] ? __put_task_struct+0x4d/0x130 Dec 4 07:15:20 elephant kernel: [22578.602166] [<ffffffff810897b8>] ? kthread_stop+0x108/0x110 Dec 4 07:15:20 elephant kernel: [22578.603115] [<ffffffffa00dcbe0>] ? md_unregister_thread+0x40/0x80 [md_mod] Dec 4 07:15:20 elephant kernel: [22578.604031] [<ffffffffa00e4645>] ? md_reap_sync_thread+0x15/0x150 [md_mod] Dec 4 07:15:20 elephant kernel: [22578.604929] [<ffffffffa00e47f9>] ? action_store+0x79/0x230 [md_mod] Dec 4 07:15:20 elephant kernel: [22578.605809] [<ffffffffa00e06f4>] ? md_attr_store+0xb4/0x100 [md_mod] Dec 4 07:15:20 elephant kernel: [22578.606672] [<ffffffff8121aa0a>] ? kernfs_fop_write+0xda/0x150 Dec 4 07:15:20 elephant kernel: [22578.607515] [<ffffffff811aa872>] ? vfs_write+0xb2/0x1f0 Dec 4 07:15:20 elephant kernel: [22578.608338] [<ffffffff811ab3b2>] ? SyS_write+0x42/0xa0 Dec 4 07:15:20 elephant kernel: [22578.609139] [<ffffffff8151a5a8>] ? page_fault+0x28/0x30 Dec 4 07:15:20 elephant kernel: [22578.609920] [<ffffffff8151858d>] ? system_call_fast_compare_end+0x10/0x15 Dec 4 07:15:20 elephant kernel: [22578.610688] Code: ff 85 c0 0f 84 4f fe ff ff e9 26 fe ff ff 66 90 0f 1f 44 00 00 53 48 89 fb 48 8b bf e0 04 00 00 48 c7 83 e0 04 00 00 00 00 00 00 <f0> ff 0f 74 20 48 8b bb e8 04 00 00 48 c7 83 e8 04 00 00 00 00 Dec 4 07:15:20 elephant kernel: [22578.612291] RIP [<ffffffff8108e6eb>] exit_creds+0x1b/0x60 Dec 4 07:15:20 elephant kernel: [22578.613040] RSP <ffff880058243e28> Dec 4 07:15:20 elephant kernel: [22578.613781] CR2: 0000000000000000 Dec 4 07:15:20 elephant kernel: [22578.614508] ---[ end trace 3fa287bf370969ba ]--- _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |