[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [BUG] Core scheduling patches causing deadlock in some situations



On 29.05.20 14:51, Michał Leszczyński wrote:
----- 29 maj 2020 o 14:44, Jürgen Groß jgross@xxxxxxxx napisał(a):

On 29.05.20 14:30, Michał Leszczyński wrote:
Hello,

I'm running DRAKVUF on Dell Inc. PowerEdge R640/08HT8T server with Intel(R)
Xeon(R) Gold 6132 CPU @ 2.60GHz CPU.
When upgrading from Xen RELEASE 4.12 to 4.13, we have noticed some stability
problems concerning freezes of Dom0 (Debian Buster):

---

maj 27 23:17:02 debian kernel: rcu: INFO: rcu_sched self-detected stall on CPU
maj 27 23:17:02 debian kernel: rcu: 0-....: (5250 ticks this GP)
idle=cee/1/0x4000000000000002 softirq=11964/11964 fqs=2515
maj 27 23:17:02 debian kernel: rcu: (t=5251 jiffies g=27237 q=799)
maj 27 23:17:02 debian kernel: NMI backtrace for cpu 0
maj 27 23:17:02 debian kernel: CPU: 0 PID: 643 Comm: z_rd_int_1 Tainted: P OE
4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u2
maj 27 23:17:02 debian kernel: Hardware name: Dell Inc. PowerEdge R640/08HT8T,
BIOS 2.1.8 04/30/2019
maj 27 23:17:02 debian kernel: Call Trace:
maj 27 23:17:02 debian kernel: <IRQ>
maj 27 23:17:02 debian kernel: dump_stack+0x5c/0x80
maj 27 23:17:02 debian kernel: nmi_cpu_backtrace.cold.4+0x13/0x50
maj 27 23:17:02 debian kernel: ? lapic_can_unplug_cpu.cold.29+0x3b/0x3b
maj 27 23:17:02 debian kernel: nmi_trigger_cpumask_backtrace+0xf9/0xfb
maj 27 23:17:02 debian kernel: rcu_dump_cpu_stacks+0x9b/0xcb
maj 27 23:17:02 debian kernel: rcu_check_callbacks.cold.81+0x1db/0x335
maj 27 23:17:02 debian kernel: ? tick_sched_do_timer+0x60/0x60
maj 27 23:17:02 debian kernel: update_process_times+0x28/0x60
maj 27 23:17:02 debian kernel: tick_sched_handle+0x22/0x60

---

This usually results in machine being completely unresponsive and performing an
automated reboot after some time.

I've bisected commits between 4.12 and 4.13 and it seems like this is the patch
which introduced a bug:
https://github.com/xen-project/xen/commit/7c7b407e77724f37c4b448930777a59a479feb21

Enclosed you can find the `xl dmesg` log (attachment: dmesg.txt) from the fresh
boot of the machine on which the bug was reproduced.

I'm also attaching the `xl info` output from this machine:

---

release : 4.19.0-6-amd64
version : #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11)
machine : x86_64
nr_cpus : 14
max_cpu_id : 223
nr_nodes : 1
cores_per_socket : 14
threads_per_core : 1
cpu_mhz : 2593.930
hw_caps :
bfebfbff:77fef3ff:2c100800:00000121:0000000f:d19ffffb:00000008:00000100
virt_caps : pv hvm hvm_directio pv_directio hap shadow
total_memory : 130541
free_memory : 63591
sharing_freed_memory : 0
sharing_used_memory : 0
outstanding_claims : 0
free_cpus : 0
xen_major : 4
xen_minor : 13
xen_extra : -unstable
xen_version : 4.13-unstable
xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p
hvm-3.0-x86_64
xen_scheduler : credit2
xen_pagesize : 4096
platform_params : virt_start=0xffff800000000000
xen_changeset : Wed Oct 2 09:27:27 2019 +0200 git:7c7b407e77-dirty

Which is your original Xen base? This output is clearly obtained at the
end of the bisect process.

There have been quite some corrections since the release of Xen 4.13, so
please make sure you are running the most actual version (4.13.1).


Juergen

Sure, we have tested both RELEASE 4.13 and RELEASE 4.13.1. Unfortunately these 
corrections didn't help and the bug is still reproducible.

 From our testing it turns out that:

Known working revision: 997d6248a9ae932d0dbaac8d8755c2b15fec25dc
Broken revision: 6278553325a9f76d37811923221b21db3882e017
First bad commit: 7c7b407e77724f37c4b448930777a59a479feb21

Would it be possible to test xen unstable, too?

I could imagine e.g. commit b492c65da5ec5ed or 99266e31832fb4a4 to have
an impact here.


Juergen



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.