Xen project Mailing List

On Mon, Jul 27, 2020 at 3:27 AM Jürgen Groß <jgross@xxxxxxxx> wrote:

On 26.07.20 17:47, moftah moftah wrote:
> Hi All,
> We have a problem that is ongoing for more than 1 month
>
> We have several servers running xcp-ng and we are facing kernel oops
> that crash the server
>
> My skill is not enough to debug the issue So need someone to point me to
> the right direction
> the issue is not hardware related
> it occurred on servers that are of different processor , nic and even
> kernel version (all under 4.19)
>
> the stack trace looks like this
>
> [2399526.430672] ALERT: BUG: unable to handle kernel NULL pointer
> dereference at 0000000000000004
> [2399526.430695] INFO: PGD 447268067 P4D 447268067 PUD 44775f067 PMD 0
> [2399526.430710] WARN: Oops: 0000 [#1] SMP NOPTI
> [2399526.430720] WARN: CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted
> 4.19.108 #1
> [2399526.430728] WARN: Hardware name: HP ProLiant SL230s Gen8 /,
> BIOS P75 05/24/2019
> [2399526.430745] WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
> [2399526.430753] WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b
> 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88
> d0 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01
> 0a f0 ff
> [2399526.430773] WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
> [2399526.430780] WARN: RAX: ffff88842087b900 RBX: 0000000000000001
> RCX: 0000000000000000
> [2399526.430789] WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001
> RDI: ffff8883de0b9c00
> [2399526.430801] WARN: RBP: 0000000000000000 R08: 0000000000000000
> R09: 0000000000000020
> [2399526.430811] WARN: R10: 0000000000000000 R11: ffff8883de0b9d40
> R12: 0000000000000001
> [2399526.430823] WARN: R13: ffff8883db210a00 R14: 0000000000000002
> R15: ffff8883de0b9c00
> [2399526.430852] WARN: FS: 00007ffac43fe700(0000)
> GS:ffff888451240000(0000) knlGS:0000000000000000
> [2399526.430868] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2399526.430879] WARN: CR2: 0000000000000004 CR3: 000000044ad58000
> CR4: 0000000000040660
> [2399526.430899] WARN: Call Trace:
> [2399526.430914] WARN: __qdisc_run+0xa2/0x4f0
> [2399526.430928] WARN: ? __switch_to_asm+0x41/0x70
> [2399526.430940] WARN: net_tx_action+0x148/0x230
> [2399526.430949] WARN: __do_softirq+0xd1/0x28c
> [2399526.430966] WARN: run_ksoftirqd+0x26/0x40
> [2399526.430980] WARN: smpboot_thread_fn+0x10e/0x160
> [2399526.430993] WARN: kthread+0xf8/0x130
> [2399526.431004] WARN: ? sort_range+0x20/0x20
> [2399526.431010] WARN: ? kthread_bind+0x10/0x10
> [2399526.431017] WARN: ret_from_fork+0x35/0x40

I wonder whether you are missing all fixes for commit 021a17ed796b
which went into kernel 4.18. It needs following fixes on top:

d518d2ed8640 (went into 5.4), 90b2be27bb0e (went into 5.5).

From the backtrace I really doubt this is a Xen problem, BTW. Maybe
running under Xen makes the problem more likely due to different
timing.

Juergen

Re: repeated Kernel oops need help to debug