On Tue, Nov 06, 2018 at 03:31:31PM +0530, Rishi wrote:
>
> So after knowing the stack trace, it appears that the CPU was getting stuck
> for xen_hypercall_xen_version
That hypercall is used when a PV kernel (re-)enables interrupts. See
xen_irq_enable. The purpose is to force the kernel to switch to
hypervisor.
>
> watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
>
>
> [30569.582740] watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
> [swapper/0:0]
>
> [30569.588186] Kernel panic - not syncing: softlockup: hung tasks
>
> [30569.591307] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G L 4.19.1
> #1
>
> [30569.595110] Hardware name: Xen HVM domU, BIOS 4.4.1-xs132257 12/12/2016
>
> [30569.598356] Call Trace:
>
> [30569.599597] <IRQ>
>
> [30569.600920] dump_stack+0x5a/0x73
>
> [30569.602998] panic+0xe8/0x249
>
> [30569.604806] watchdog_timer_fn+0x200/0x230
>
> [30569.607029] ? softlockup_fn+0x40/0x40
>
> [30569.609246] __hrtimer_run_queues+0x133/0x270
>
> [30569.611712] hrtimer_interrupt+0xfb/0x260
>
> [30569.613800] xen_timer_interrupt+0x1b/0x30
>
> [30569.616972] __handle_irq_event_percpu+0x69/0x1a0
>
> [30569.619831] handle_irq_event_percpu+0x30/0x70
>
> [30569.622382] handle_percpu_irq+0x34/0x50
>
> [30569.625048] generic_handle_irq+0x1e/0x30
>
> [30569.627216] __evtchn_fifo_handle_events+0x163/0x1a0
>
> [30569.629955] __xen_evtchn_do_upcall+0x41/0x70
>
> [30569.632612] xen_evtchn_do_upcall+0x27/0x50
>
> [30569.635136] xen_do_hypervisor_callback+0x29/0x40
>
> [30569.638181] RIP: e030:xen_hypercall_xen_version+0xa/0x20
What is the asm code for this RIP?
Wei.
The issue of crash is getting resolved with appending "noirqbalance" at xen command line. This way all dom0 cpus are available but not irq balanced at xen.
Even though I'm running irqbalance service in dom0 the irqs seems to be not moving. <- this is dom0 perspective, I do not know yet, if it follows Xen irq.
I tried objdump, while I have have the function in out but there is no asm code of it. Its just "..."
ffffffff81001220 <xen_hypercall_xen_version>:
...
ffffffff81001240 <xen_hypercall_console_io>:
...
All "hypercalls" appear similarly.
How frequent can be that hypercall/xen_irq_enable()? Like n/s or once a while?
During my tests, the system runs stable unless I'm downloading a large file. Files around a GB size are getting downloaded without crash, but system crash comes when its above it. I'm using a 2.1GB file & wget to download.
Is there a way I can simulate PV kernel (re-)enable of interrupt using a kernel module with a controlled fashion?
If this is on right track
ffffffff8101ab70 <xen_force_evtchn_callback>:
ffffffff8101ab70: 31 ff xor %edi,%edi
ffffffff8101ab72: 31 f6 xor %esi,%esi
ffffffff8101ab74: e8 a7 66 fe ff callq ffffffff81001220 <xen_hypercall_xen_version>
ffffffff8101ab79: c3 retq
ffffffff8101ab7a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
It seems I'm hitting following code from
xen_irq_enable
barrier(); /* unmask then check (avoid races) */
if (unlikely(vcpu->evtchn_upcall_pending))
xen_force_evtchn_callback();
The code says unlikely yet, it is being called, And I got following structure
struct vcpu_info {
/*
* 'evtchn_upcall_pending' is written non-zero by Xen to indicate
* a pending notification for a particular VCPU. It is then cleared
* by the guest OS /before/ checking for pending work, thus avoiding
* a set-and-check race. Note that the mask is only accessed by Xen
* on the CPU that is currently hosting the VCPU. This means that the
* pending and mask flags can be updated by the guest without special
* synchronisation (i.e., no need for the x86 LOCK prefix).
Let me know if the thread is being spammed with such intermediates.