[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Network stalls on domU under Xen-4.14.x -- solved, -ish




Den 15.02.2021 12:02, skrev Håkon Alstadheim:
I'm recently having total network stalls on some domUs . Dmesg on domU shows a number of lines like:

Feb 15 11:12:38 gt kernel: net eth0: rx->offset: 0, size: -1
Update, for the record.

One piece of info missing from my previous report: I have been running the backend with Voluntary preemption on in the dom0 kernel. My set-up is basically an organically grown fuzzing machine for xen, and I have several ill-considered settings turned on. Anyway changing the kernel config to what is seen below, and upgrading to linux kernel 5.10.17 fixes my issue. Changelog for linux-5.10.17 pointed me to the main culprit, a dead-lock on xen_netback, but even with that fix in, I still had issues. I figured that turning off preemption would further lessen risk of constipation on the backend, and that seems to be true.

----linux config that works: ----

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set

[ extra info from previous mail :]

Feb 15 11:12:38 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:38 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:39 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:39 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:40 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:42 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:45 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:52 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:13:05 gt kernel: net eth0: rx->offset: 0, size: -1

On occasion, with the longer stalls (~ 5 minutes) I get:

Feb 15 09:29:04 gt kernel: net_ratelimit: 5 callbacks suppressed

I have tried this on xen 4.14.0, 4.14.1 and 4.14.2-pre, with various guest kernels ranging from linux-4.19.170 to the early 5.10.x kernels. Newer 5.10 kernels give me some other error,  to do with interrupts. Seems interrupts vectors point to La-La-Land, or else they are routed to the wrong CPU. I'm fairly certain I did not have this issue running Xen-4.14-staging with the earliest linux-5.10.x, but that had other issues. File-system corruption got me a week around christmas with the whole system down :- ( . Allowed me to learn how to use bacula from a grml rescue cd without a  catalog-database :-) .

The stalls happen under load (net or cpu, don't know which matters more). I can reliably reproduce if i run a lot of compilations& network fetches in the domu while simultaneously lanunching firefox and thunderbird. I have home mounted with nfs from the dom0, so lots of traffic when thunderbird and firefox launch.

On occation the stalls are caught by the kernel, and I get a stack-trace, but I guess those are consequences of the network stall, incidental to the real issue. like:

Feb 15 09:09:38 gt kernel:     status: r
Feb 15 09:09:38 gt kernel: net_ratelimit: 5 callbacks suppressed
Feb 15 09:09:38 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:09:38 gt root[45567]: ACPI event unhandled: jack/lineout LINEOUT unplug Feb 15 09:09:38 gt root[45570]: ACPI event unhandled: jack/videoout VIDEOOUT unplug
Feb 15 09:09:44 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:09:57 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:10:01 gt CROND[45682]: (root) CMD (/usr/lib/sa/sa1 1 1)
Feb 15 09:10:23 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:11:17 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:11:58 gt kernel: INFO: task IndexedDB #3:45442 blocked for more than 122 seconds.
Feb 15 09:11:58 gt kernel:       Not tainted 5.4.80-gentoo-r1-x86_64 #1
Feb 15 09:11:58 gt kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 09:11:58 gt kernel: IndexedDB #3    D    0 45442   3451 0x00000000
Feb 15 09:11:58 gt kernel: Call Trace:
Feb 15 09:11:58 gt kernel:  __schedule+0x2a3/0x7a0
Feb 15 09:11:58 gt kernel:  ? nfs_pageio_complete+0xa8/0xf0
Feb 15 09:11:58 gt kernel:  schedule+0x34/0xa0
Feb 15 09:11:58 gt kernel:  io_schedule+0x3c/0x60
Feb 15 09:11:58 gt kernel:  wait_on_page_bit_common+0x125/0x330
Feb 15 09:11:58 gt kernel:  ? trace_event_raw_event_file_check_and_advance_wb_err+0xf0/0xf0
Feb 15 09:11:58 gt kernel:  __filemap_fdatawait_range+0x7b/0xe0
Feb 15 09:11:58 gt kernel:  file_write_and_wait_range+0x67/0x90
Feb 15 09:11:58 gt kernel:  nfs_file_fsync+0x83/0x190
Feb 15 09:11:58 gt kernel:  __x64_sys_fsync+0x2f/0x60
Feb 15 09:11:58 gt kernel:  do_syscall_64+0x51/0x130
Feb 15 09:11:58 gt kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 15 09:11:58 gt kernel: RIP: 0033:0x7f4db9580e1b
Feb 15 09:11:58 gt kernel: Code: Bad RIP value.
Feb 15 09:11:58 gt kernel: RSP: 002b:00007f4d9b4b4d50 EFLAGS: 00000293 ORIG_RAX: 000000000000004a Feb 15 09:11:58 gt kernel: RAX: ffffffffffffffda RBX: 00007f4d9f2abd28 RCX: 00007f4db9580e1b Feb 15 09:11:58 gt kernel: RDX: 0000000000000002 RSI: 0000000000000002 RDI: 0000000000000072 Feb 15 09:11:58 gt kernel: RBP: 0000000000000002 R08: 0000000000000000 R09: 00007f4d9b4b4d70 Feb 15 09:11:58 gt kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000001f5 Feb 15 09:11:59 gt kernel: R13: 00007f4d9f2abc70 R14: 0000000000000000 R15: 00007f4da63774e0
---------

My xl info just now:

xl info
host                   : gentoo
release                : 5.4.97-gentoo-x86_64
version                : #1 SMP Wed Feb 10 16:43:41 CET 2021
machine                : x86_64
nr_cpus                : 12
max_cpu_id             : 11
nr_nodes               : 2
cores_per_socket       : 6
threads_per_core       : 1
cpu_mhz                : 2399.981
hw_caps                : bfebfbff:77fef3ff:2c100800:00000021:00000001:000037ab:00000000:00000100 virt_caps              : pv hvm hvm_directio pv_directio hap shadow iommu_hap_pt_share
total_memory           : 130953
free_memory            : 1551
sharing_freed_memory   : 0
sharing_used_memory    : 0
outstanding_claims     : 0
free_cpus              : 0
xen_major              : 4
xen_minor              : 14
xen_extra              : .2-pre
xen_version            : 4.14.2-pre
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler          : credit2
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          :
xen_commandline        : xen.cfg xen-marker-51 console_timestamps=date iommu=1 com1=115200,8n1 console=com1 conswitch=lx cpufreq=xen:performance,verbose smt=0 maxcpus=12 core_parking=power nmi=dom0 gnttab_max_frames=512 gnttab_max_maptrack_frames=1024 vcpu_migration_delay=2000 tickle_one_idle_cpu=1 spec-ctrl=no-xen sched=credit2 timer_slop=5000 max_cstate=2 dom0_mem=16G,max:16G dom0_max_vcpus=8 ept=exec_sp=1
cc_compiler            : gcc (Gentoo 9.3.0-r2 p4) 9.3.0
cc_compile_by          : hakon
cc_compile_domain      : alstadheim.priv.no
cc_compile_date        : Sat Feb 13 22:07:40 CET 2021
build_id               : d3fb26987b749da48c2549b12ba9ea4a
xend_config_format     : 4
0:root@gentoo xen-consoles #


P.S: I know I should do something about my dmarc set-up, so I can have a separate, unprotected "from:" address for posting to mailing-lists. Pointers to how-to appreciated.

---

Håkon







 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.