[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] hvm domU crash on linux 4.14.0 with xen 4.8.2





On Thu, Nov 16, 2017 at 7:59 AM, Juergen Gross <jgross@xxxxxxxx> wrote:
On 16/11/17 06:55, Tomas Mozes wrote:
> Hello,
> after upgrading from Linux 4.4.95 to 4.14.0 we've observed a hvm domU
> crash with Xen 4.8.2.
>
> [70418.163271] int3: 0000 [#1] SMP
> [70418.173003] CPU: 1 PID: 11265 Comm: app Not tainted 4.14.0-gentoo #1
> [70418.194535] Hardware name: Xen HVM domU, BIOS 4.8.2 11/14/2017
> [70418.205537] task: ffff9b2d943e8dc0 task.stack: ffffa314c1e80000
> [70418.217757] RIP: 0010:xen_hypercall_event_channel_op+0xb/0x20

Hmm, this looks very strange. RIP points into the hypercall page outside
the hypercall code. As the system seems to be up rather long we can
assume hypercalls have been working millions of times.

Anything you did in the domU right before that crash, e.g. applying a
live patch, doing kernel tracing, ...?

Hello Juergen,
a kernel update was done yesterday on the dom0 and on all 6 domUs as well (all to 4.14.0). Only one of the machines died, probably the one using the most cpu/net/disk, but it happened during the night where nobody did anything with it. By checking the monitoring it does not seem anything special was going on, it was writing about 3MB/s to the disk, the network bandwidth was about 4Mbit/s and the load was around 0.7 (of 6 core system). Then a sudden crash.


> [70418.231311] RSP: 0018:ffff9b2de5843d00 EFLAGS: 00000002
> [70418.242846] RAX: 0000000000000000 RBX: 0000000000000003 RCX:
> 0000000000000000
> [70418.256365] RDX: 0000000000000000 RSI: ffff9b2de5843d0c RDI:
> 0000000000000004
> [70418.274236] RBP: ffff9b2de5843d10 R08: ffff9b2ddd95ac00 R09:
> ffff9b2dde403230
> [70418.285667] R10: 0000000000000000 R11: 0000000000000040 R12:
> ffff9b2de58db240
> [70418.298853] R13: 0000000000000001 R14: ffff9b2d943e8dc0 R15:
> ffff9b2de5854168
> [70418.311083] FS:  00007f5e8365e700(0000) GS:ffff9b2de5840000(0000)
> knlGS:0000000000000000
> [70418.323074] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [70418.332810] CR2: 00000000023de0e0 CR3: 000000010cad6005 CR4:
> 00000000001606e0
> [70418.435201] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [70418.447396] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [70418.492649] Call Trace:
> [70418.498815]  <IRQ>
> [70418.503783]  ? notify_remote_via_irq+0x24/0x40
> [70418.512792]  xen_send_IPI_one+0x2d/0x70

Doing an event_channel_op from here seems okay, so RIP is somehow
related to the backtrace. This makes a random stack corruption a bit
more unlikely.


Juergen

> [70418.521370]  xen_smp_send_reschedule+0xb/0x10
> [70418.530176]  trigger_load_balance+0x11e/0x220
> [70418.537916]  scheduler_tick+0xa1/0xd0
> [70418.545604]  ? tick_sched_do_timer+0x40/0x40
> [70418.557234]  update_process_times+0x42/0x50
> [70418.565434]  tick_sched_handle+0x34/0x50
> [70418.573146]  tick_sched_timer+0x34/0x80
> [70418.609000]  __hrtimer_run_queues+0xd1/0x200
> [70418.617187]  hrtimer_interrupt+0xac/0x200
> [70418.624093]  xen_timer_interrupt+0x1d/0x30
> [70418.631425]  __handle_irq_event_percpu+0x79/0x190
> [70418.669817]  handle_irq_event_percpu+0x1e/0x50
> [70418.678067]  handle_percpu_irq+0x35/0x50
> [70418.684258]  generic_handle_irq+0x1d/0x30
> [70418.688950]  __evtchn_fifo_handle_events+0x13c/0x150
> [70418.696224]  evtchn_fifo_handle_events+0xb/0x10
> [70418.702332]  __xen_evtchn_do_upcall+0x3e/0x70
> [70418.708984]  xen_evtchn_do_upcall+0x26/0x40
> [70418.714106]  xen_hvm_callback_vector+0x93/0xa0
> [70418.718635]  </IRQ>
> [70418.721253] RIP: 0010:memcpy_erms+0x6/0x10
> [70418.725842] RSP: 0018:ffffa314c1e83ac8 EFLAGS: 00000286 ORIG_RAX:
> ffffffffffffff0c
> [70418.734146] RAX: ffff9b2cff560010 RBX: ffff9b2d656b9180 RCX:
> 0000000000018028
> [70418.742037] RDX: 0000000000019018 RSI: ffff9b2ddda72fdc RDI:
> ffff9b2cff561000
> [70418.746973] RBP: ffffa314c1e83af0 R08: 000000000001d1e0 R09:
> 00000000000c2260
> [70418.752124] R10: 0000000000000000 R11: 0000000000000002 R12:
> ffff9b2ddbbae000
> [70418.757905] R13: ffffa314c1e83d10 R14: 0000000000019018 R15:
> fffffffffffffffe
> [70418.763805]  ? __d_alloc+0x68/0x1d0
> [70418.765764]  d_alloc+0x15/0x80
> [70418.767645]  d_alloc_parallel+0x2b/0x470
> [70418.770036]  ? __follow_mount_rcu.isra.31+0x3f/0x100
> [70418.773607]  ? legitimize_links+0x41/0xb0
> [70418.776073]  ? legitimize_path.isra.32+0x29/0x60
> [70418.779101]  lookup_slow+0x6e/0x130
> [70418.781000]  walk_component+0x1c0/0x310
> [70418.783690]  ? trailing_symlink+0x15d/0x240
> [70418.786197]  path_lookupat+0x62/0x200
> [70418.788575]  filename_lookup+0xa4/0x160
> [70418.790906]  ? current_time+0x33/0x70
> [70418.792964]  ? getname_flags+0x6d/0x1f0
> [70418.795520]  user_path_at_empty+0x31/0x40
> [70418.798264]  ? user_path_at_empty+0x31/0x40
> [70418.801346]  vfs_statx+0x63/0xb0
> [70418.803347]  SYSC_newstat+0x2e/0x50
> [70418.805599]  ? __do_page_fault+0x251/0x420
> [70418.808063]  ? do_page_fault+0x2e/0xe0
> [70418.810253]  SyS_newstat+0x9/0x10
> [70418.811886]  entry_SYSCALL_64_fastpath+0x1a/0xa5
> [70418.814436] RIP: 0033:0x7f5e82967549
> [70418.816711] RSP: 002b:00007ffdb89d9700 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000004
> [70418.820899] RAX: ffffffffffffffda RBX: 00000000023de0d0 RCX:
> 00007f5e82967549
> [70418.824408] RDX: 00007ffdb89d9740 RSI: 00007ffdb89d9740 RDI:
> 00007ffdb89d97d0
> [70418.827934] RBP: 00007f5e82c2aae0 R08: 00007f5e82c27100 R09:
> 0000000000000024
> [70418.832181] R10: 0000000000000073 R11: 0000000000000246 R12:
> 0000000000000000
> [70418.836093] R13: 0000000000000000 R14: 0000000000000003 R15:
> 0000000000000000
> [70418.840216] Code: b8 1f 00 00 00 0f 01 c1 c3 cc cc cc cc cc cc cc cc
> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc b8 20 00 00 00 0f 01 c1 c3
> cc cc <cc> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
> [70418.855259] RIP: xen_hypercall_event_channel_op+0xb/0x20 RSP:
> ffff9b2de5843d00
> [70418.859524] ---[ end trace 9604f5ac6b837408 ]---
> [70418.863837] Kernel panic - not syncing: Fatal exception in interrupt
> [70418.872681] Kernel Offset: 0x2c600000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
>
> _______________________________________________
> Xen-users mailing list
> Xen-users@xxxxxxxxxxxxx
> https://lists.xen.org/xen-users
>


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
https://lists.xen.org/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.