[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-users] Debugging sudden hangs
Hi list, We recently updated our system and started experiencing random hangs. It happens, on average, once every 1.5 days (sometimes taking 2 days to occur, other times happening multiple times a day, somewhat proportional to IO load). Before troubling the developers too much, I'd like to collect more information, however, the problem is the hangs occur without any symptoms/crashes/panics. I've booted xen and dom0 with: "loglvl=all guest_loglvl=all" and "loglevel=10 debug initcall_debug" respectively. When the hang occurs, all domUs and dom0 just stop responding to key presses, networking and there is no IO activity. Nothing gets generated in the console/logs (no symptoms either, no logs out of the ordinary). Even hitting ctrl+a multiple times in the console does nothing (indicating xen is dead too). On the video console, we just have a blinking cursor after the last console log (though my understanding is that the cursor blink might be generated by the video card rather than any indication that at least something is still running). If the hardware WDT is on, the watchdog eventually bites and reboots the system. Although I believe it isn't related (since dom0 stalls too, and we're looking at a completely stalled system rather than just domUs having issues with disk IO), I added "gnttab_max_frames=256" to the xen boot arguments anyway. Didn't seem to change anything. Then, grasping at straws, I turned off HWPM in BIOS, which we had to do so on another machine hosting VMware ESX, obviously didn't seem to change anything either. At this point, I'd like to know what is the best way to approach this? Can I enable further levels of debugging so that I can even begin to look towards a certain culprit? Is there a good way to determine if it may be the hardware? I've tried running the same kernel without xen and just simulating heavy IO on the disk array without issues, which leans me towards xen being part of the equation. But then again, doing random file read/writes isn't a good simulation of the type of workload the domUs put on the server. OS: Debian Buster Kernel: 4.17.0-1-amd64 Xen: 4.8.4-pre (Debian 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9) CPU: Xeon E5-2699 v4 RAM: Samsung 96GB ECC Registered MB: Supermicro X10SRi-F In case it is relevant, since it might be IO related... Net: Chelsio T520-CR (2 x XGB links, shared to domU using VF) RAID: LSI SAS3224 with 10 SAS3 drives Warm regards, Liwei _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |