[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Debugging sudden hangs





On Sun, Aug 19, 2018, 8:51 AM Liwei <xieliwei@xxxxxxxxx> wrote:
Hi list,
    We recently updated our system and started experiencing random
hangs. It happens, on average, once every 1.5 days (sometimes taking 2
days to occur, other times happening multiple times a day, somewhat
proportional to IO load).

    Before troubling the developers too much, I'd like to collect more
information, however, the problem is the hangs occur without any
symptoms/crashes/panics. I've booted xen and dom0 with:
"loglvl=all guest_loglvl=all" and "loglevel=10 debug initcall_debug"
respectively.

    When the hang occurs, all domUs and dom0 just stop responding to
key presses, networking and there is no IO activity. Nothing gets
generated in the console/logs (no symptoms either, no logs out of the
ordinary). Even hitting ctrl+a multiple times in the console does
nothing (indicating xen is dead too). On the video console, we just
have a blinking cursor after the last console log (though my
understanding is that the cursor blink might be generated by the video
card rather than any indication that at least something is still
running). If the hardware WDT is on, the watchdog eventually bites and
reboots the system.

    Although I believe it isn't related (since dom0 stalls too, and
we're looking at a completely stalled system rather than just domUs
having issues with disk IO), I added "gnttab_max_frames=256" to the
xen boot arguments anyway. Didn't seem to change anything.

    Then, grasping at straws, I turned off HWPM in BIOS, which we had
to do so on another machine hosting VMware ESX, obviously didn't seem
to change anything either.

    At this point, I'd like to know what is the best way to approach
this? Can I enable further levels of debugging so that I can even
begin to look towards a certain culprit? Is there a good way to
determine if it may be the hardware?

    I've tried running the same kernel without xen and just simulating
heavy IO on the disk array without issues, which leans me towards xen
being part of the equation. But then again, doing random file
read/writes isn't a good simulation of the type of workload the domUs
put on the server.

    OS: Debian Buster
    Kernel: 4.17.0-1-amd64
    Xen: 4.8.4-pre (Debian 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9)
    CPU: Xeon E5-2699 v4
    RAM: Samsung 96GB ECC Registered
    MB: Supermicro X10SRi-F

    In case it is relevant, since it might be IO related...
    Net: Chelsio T520-CR (2 x XGB links, shared to domU using VF)
    RAID: LSI SAS3224 with 10 SAS3 drives

Warm regards,
Liwei

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

In my experience, as a non-Xen user on nearly the identical motherboard (X10SRA), I would suggest the motherboard.

I've purchased 4 of these boards and run various Windows and Linux kernels.  They all have different CPUs (some Retail, some Engineering Samples), different ECC ram and different storage setups (some using onboard SATA, some using on LSI cards, etc).

They all, every single one of them, experience random hard-lockups just like you describe: becomes completely unresponsive, screen freezes, etc.

I don't run Xen on any of them.  I've swapped all sorts of hardware, tried several beta BIOS versions from support, RMA'd 3 of them...  They all continued to lockup.

This went on for about two years until I had enough.  I swapped all boards out for the X10DLA, using the exact same components, and I have had zero issues since.

Again, this is just one user's experience - and I just happened to be on the Xen mailing list and saw this.
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.