[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c


  • To: <jbeulich@xxxxxxxx>
  • From: <Peter.Kurfer@xxxxxxxx>
  • Date: Wed, 29 Jan 2020 13:52:54 +0000
  • Accept-language: de-DE, en-US
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Wed, 29 Jan 2020 13:53:00 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHVz4sLNgmZNGFJEEmdW8QDt3MmIqgBW+yF///49wCAAB82fQ==
  • Thread-topic: Host freezing after "fixing" recursive fault starting in multicalls.c

> Right, but the bad news is that there are no helpful hypervisor
> messages at all. Sadly this is partly my fault, because I should
> have asked you to do this log collection with a debug hypervisor.
> Most of the possibly interesting messages would appear only there.

> In any event, problems start quite a bit earlier, and typically
> it's the first instance of a problem that is the most helpful to
> analyze, as later ones may be cascade issues. The first sign of
> problems is an overlapping

To be honest, I was already wondering why there were only so few logs but while 
I already found the CMDLINE_XEN options for debug logs I didn't find any 
documentation how to build a debug hypervisor so far and it took me some time 
to work around the fact that I don't have physical access to the server to 
attach an actual serial cable and so on.

I will try to compile Xen with debug enabled and collect more logs afterwards.
Anything to be aware of?


Von: Jan Beulich <jbeulich@xxxxxxxx>
Gesendet: Mittwoch, 29. Januar 2020 09:59
An: Kurfer, Peter
Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx
Betreff: Re: Host freezing after "fixing" recursive fault starting in 
multicalls.c
    
On 29.01.2020 09:29, Peter.Kurfer@xxxxxxxx wrote:
> As requested I configured one host with:
> 
>> loglvl=all guest_loglvl=all
> 
> and collected one day of logs via serial interface:
> 
>  
> https://drive.google.com/drive/folders/1sQvyNH0Sz28tUeVRZl9mowhB0Htd8ZpO?usp=sharing
> 
> searching for "error" or "multicalls.c" leads to some stacktraces that might 
> be interesting.

Right, but the bad news is that there are no helpful hypervisor
messages at all. Sadly this is partly my fault, because I should
have asked you to do this log collection with a debug hypervisor.
Most of the possibly interesting messages would appear only there.

In any event, problems start quite a bit earlier, and typically
it's the first instance of a problem that is the most helpful to
analyze, as later ones may be cascade issues. The first sign of
problems is an overlapping

[14991.827762] BUG: unable to handle page fault for address: ffff888ae2eb6bd8

and

[14991.828172] WARNING: CPU: 5 PID: 2585 at arch/x86/xen/multicalls.c:102 
xen_mc_flush+0x194/0x1c0

on CPUs 8 and 5.

> As far as I know the ACPI errors in the context of IPMI can be ignored.

Looks like so, yes, at least for the purposes here. What I wouldn't
put off as a possible reason for problems is the significant amount
of temperature related messages. What I also find at least curious
(but possibly just because I know too little of the respective
aspects of modern kernels) are the recurring __text_poke() instances
on the stack traces. Assuming these are to be expected in the first
place, there might be a race here which is either Xen-specific or
simply has a much better chance of hitting (larger window?) when
running on Xen. But I'm afraid this will need looking into (or at
least commenting on) by a kernel person.

Jan
    
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.