[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

System hangs when NVMe is under load



Hello,

I would like to be excused beforehand if i am sending something the the wrong folks.

We have a strange situation going on here with a couple of our servers. We've been experiencing issues with the combination of Debian+XEN+Samsung NVMe.

Problem:

It all began with https://serverfault.com/questions/1006366/samsung-nvme-disappears-when-server-on-average-to-high-load

The situation is close to the one described above with some differences. Now It can be reproduced.

  • OS: 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1
  • CPUS: Intel(R) Xeon(R) CPU E5-1650 v4
  • NVMe: Samsung MZ1LB1T9HALS-00007
  • xen_version            : 4.11.4-pre
  • Server: Supermicro Super Server/X10SRW-F, BIOS 3.2

We've gathered some more information - It happens only when XEN is loaded.

The command that breaks everything is the following and it breaks it fast. In the following situation it just needs approx 20 secs to hang the whole system. I am attaching the Call trace which occurs during the hang up.

date; echo; fio --filename=/dev/nvme0n1 --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=345600 --numjobs=10 --time_based --group_reporting --name=iops-test-job --readonly --output=fio_log.randread4k.log; date

I have currently ran the test on one of the nodes where I have booted without xen. Have in mind that all servers are provisioned with Ansible and are the same.

What is tried so far:

Setting kernel option nvme_core.default_ps_max_latency_us to 5500/200 as read https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe#Samsung_drive_errors_on_Linux_4.10 and https://askubuntu.com/questions/905710/ext4-fs-error-after-ubuntu-17-04-upgrade

Setting kernel option nvme_core.force_apst=1 thus trying to force APST since (nvme id-ctrl /dev/nvme0n1  | grep apst
apsta     : 0 )

  • First try - no success.
  • Forcing APST to Y - no success.

I have kind of "overheated" on the subject right now and could be possibly missing something important out. Let me know if you need any more information.

NB: We began testing this cluster because it was showing really slow disk related operations (on the nvme). For comparison - the other cluster (mentioned in serverfault), never showed any performance issues.

Best Regards,

-- 
Stanislav Ivanov
System Administrator
–––––––––––––––––––––––––
Abilix Soft LTD.
Варна, ул."Студентска" №1А, Офис 24Б
Support: +359 700 911 44
https://abscloud.eu

Attachment: CallTraceXenNvmeProblem.txt
Description: Text document


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.