Hi All,
We have a problem that is ongoing for more than 1 month
We have several servers running xcp-ng and we are facing kernel oops that crash the server
My skill is not enough to debug the issue So need someone to point me to the right direction
the issue is not hardware related
it occurred on servers that are of different processor , nic and even kernel version (all under 4.19)
the stack trace looks like this
[2399526.430672] ALERT: BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[2399526.430695] INFO: PGD 447268067 P4D 447268067 PUD 44775f067 PMD 0
[2399526.430710] WARN: Oops: 0000 [#1] SMP NOPTI
[2399526.430720] WARN: CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted 4.19.108 #1
[2399526.430728] WARN: Hardware name: HP ProLiant SL230s Gen8 /, BIOS P75 05/24/2019
[2399526.430745] WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
[2399526.430753] WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88 d0 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01 0a f0 ff
[2399526.430773] WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
[2399526.430780] WARN: RAX: ffff88842087b900 RBX: 0000000000000001 RCX: 0000000000000000
[2399526.430789] WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001 RDI: ffff8883de0b9c00
[2399526.430801] WARN: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
[2399526.430811] WARN: R10: 0000000000000000 R11: ffff8883de0b9d40 R12: 0000000000000001
[2399526.430823] WARN: R13: ffff8883db210a00 R14: 0000000000000002 R15: ffff8883de0b9c00
[2399526.430852] WARN: FS: 00007ffac43fe700(0000) GS:ffff888451240000(0000) knlGS:0000000000000000
[2399526.430868] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[2399526.430879] WARN: CR2: 0000000000000004 CR3: 000000044ad58000 CR4: 0000000000040660
[2399526.430899] WARN: Call Trace:
[2399526.430914] WARN: __qdisc_run+0xa2/0x4f0
[2399526.430928] WARN: ? __switch_to_asm+0x41/0x70
[2399526.430940] WARN: net_tx_action+0x148/0x230
[2399526.430949] WARN: __do_softirq+0xd1/0x28c
[2399526.430966] WARN: run_ksoftirqd+0x26/0x40
[2399526.430980] WARN: smpboot_thread_fn+0x10e/0x160
[2399526.430993] WARN: kthread+0xf8/0x130
[2399526.431004] WARN: ? sort_range+0x20/0x20
[2399526.431010] WARN: ? kthread_bind+0x10/0x10
[2399526.431017] WARN: ret_from_fork+0x35/0x40
[2399526.431027] WARN: Modules linked in: act_police cls_basic sch_ingress sch_tbf tun rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs lockd grace fscache bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 dm_multipath xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter sunrpc hid_generic sb_edac intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc dm_mod aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_rapl_perf psmouse lpc_ich usbhid hid sg hpilo ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ip_tables x_tables raid1 md_mod sd_mod serio_raw uhci_hcd ahci libahci igb libata ehci_pci ehci_hcd bnx2x mdio libcrc32c mpt3sas
[2399526.431154] WARN: raid_class scsi_transport_sas scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 crc_ccitt
[2399526.431177] WARN: CR2: 0000000000000004
[2399526.431189] WARN: ---[ end trace 32a268c3653eb10c ]---
[2399526.431201] WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
[2399526.431212] WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88 d0 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01 0a f0 ff
[2399526.431238] WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
[2399526.431247] WARN: RAX: ffff88842087b900 RBX: 0000000000000001 RCX: 0000000000000000
[2399526.431260] WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001 RDI: ffff8883de0b9c00
[2399526.431270] WARN: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
[2399526.431280] WARN: R10: 0000000000000000 R11: ffff8883de0b9d40 R12: 0000000000000001
[2399526.431289] WARN: R13: ffff8883db210a00 R14: 0000000000000002 R15: ffff8883de0b9c00
[2399526.431307] WARN: FS: 00007ffac43fe700(0000) GS:ffff888451240000(0000) knlGS:0000000000000000
[2399526.431319] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[2399526.431331] WARN: CR2: 0000000000000004 CR3: 000000044ad58000 CR4: 0000000000040660
[2399526.431355] EMERG: Kernel panic - not syncing: Fatal exception in interrupt
xen crash analyzer generate many other files as well
dmesg.kexec.log
dom0.log (
for each dom )
dom0.structures.log
for each dom (
for each dom )
....
lspci-tv.out
lspci-vv.out
lspci-vvxxxx.out
readelf-Wl.out
readelf-Wn.out
time-v.out
xen.log
xen.pcpu0.stack.log (
for each pcpu)
...
xen-crashdump-analyser.log
the log can be seen from xenlog file as
Call Trace:
[ffffffff810014aa] xen_hypercall_kexec_op+0xa/0x20
ffffffff81071f85 panic+0x111/0x27c
ffffffff81027a7f oops_end+0xcf/0xd0
ffffffff8105da63 no_context+0x1b3/0x3c0
ffffffff816c0223 inet_gro_receive+0x213/0x2b0
ffffffff8105e32a __do_page_fault+0xaa/0x4f0
ffffffff8162cd44 netif_receive_skb_internal+0x34/0xe0
ffffffff81800f6e page_fault+0x1e/0x30
ffffffff81663ac9 pfifo_fast_dequeue+0xc9/0x140
ffffffff81663f38 __qdisc_run+0xa8/0x4e0
ffffffff816290c8 net_tx_action+0x148/0x220
ffffffff81a000d1 __softirqentry_text_start+0xd1/0x28c
ffffffff81077ff6 run_ksoftirqd+0x26/0x40
ffffffff8109763e smpboot_thread_fn+0x10e/0x160
ffffffff81093b68 kthread+0xf8/0x130
ffffffff81097530 smpboot_thread_fn+0/0x160
ffffffff81093a70 kthread+0/0x130
ffffffff81800215 ret_from_fork+0x35/0x40
I did use a tool to trace the source code where the issue occure
./decode_stacktrace.sh /usr/lib/debug/lib/modules/4.19.108/vmlinux /usr/lib/debug/lib/modules/4.19.108/ < ./trace2 > out3
and this is the output
[ffffffff810014aa] xen_hypercall_kexec_op (arch/x86/kernel/.tmp_head_64.o:?)
ffffffff81071f85 panic (/usr/src/debug/kernel-4.19.19/kernel/panic.c:209)
ffffffff81027a7f oops_end (/usr/src/debug/kernel-4.19.19/arch/x86/kernel/dumpstack.c:352)
ffffffff8105da63 no_context (/usr/src/debug/kernel-4.19.19/arch/x86/mm/fault.c:808)
ffffffff816c0223 inet_gro_receive (/usr/src/debug/kernel-4.19.19/include/linux/skbuff.h:2350 /usr/src/debug/kernel-4.19.19/net/ipv4/af_inet.c:1495)
ffffffff8105e32a __do_page_fault (/usr/src/debug/kernel-4.19.19/arch/x86/mm/fault.c:1435)
ffffffff8162cd44 netif_receive_skb_internal (/usr/src/debug/kernel-4.19.19/net/core/dev.c:5152)
ffffffff81800f6e page_fault (/usr/src/debug////////kernel-4.19.19/arch/x86/entry/entry_64.S:1204)
ffffffff81663ac9 pfifo_fast_dequeue (/usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:723 /usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:740 /usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:747 /usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:677)
ffffffff81663f38 __qdisc_run (/usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:283 /usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:385 /usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:403)
ffffffff816290c8 net_tx_action (/usr/src/debug/kernel-4.19.19/include/linux/seqlock.h:235 /usr/src/debug/kernel-4.19.19/include/linux/seqlock.h:388 /usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:145 /usr/src/debug/kernel-4.19.19/include/net/pkt_sched.h:121 /usr/src/debug/kernel-4.19.19/net/core/dev.c:4595)
ffffffff81a000d1 __softirqentry_text_start (/usr/src/debug/kernel-4.19.19/kernel/softirq.c:292 /usr/src/debug/kernel-4.19.19/include/linux/jump_label.h:138 /usr/src/debug/kernel-4.19.19/include/trace/events/irq.h:142 /usr/src/debug/kernel-4.19.19/kernel/softirq.c:293)
ffffffff81077ff6 run_ksoftirqd (/usr/src/debug/kernel-4.19.19/arch/x86/include/asm/paravirt.h:799 /usr/src/debug/kernel-4.19.19/kernel/softirq.c:654)
ffffffff8109763e smpboot_thread_fn (/usr/src/debug/kernel-4.19.19/kernel/smpboot.c:164)
ffffffff81093b68 kthread (/usr/src/debug/kernel-4.19.19/kernel/kthread.c:246)
ffffffff81097530 smpboot_thread_fn+0/0x160
ffffffff81093a70 kthread+0/0x130
ffffffff81800215 ret_from_fork (/usr/src/debug////////kernel-4.19.19/arch/x86/entry/entry_64.S:421)
based on that the issue occurred when calling
thats as far as i can reach not sure how to debug further to find the root cause and fix it