[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] Xen 4.10: domU crashes during/after live-migrate
On 09/12/2018 08:55 PM, Sarah Newman wrote: > On 09/04/2018 08:41 AM, Hans van Kranenburg wrote: > >>> We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 >>> (Debian >>> Stretch) and 4.15.11-1 (Debian Buster). >>> >>> [...] >> >> So... flash forward *whoosh*: >> >> For Debian users, it seems best to avoid the Debian 4.9 LTS Linux (for >> dom0 as well as domU) if you want to use live migration, or maybe even >> in general together with Xen. >> >> A few of the things I could cause to happen with recent Linux 4.9 in >> dom0/domU: >> >> 1) blk-mq related Oops >> >> Oops in the domU while resuming after live migrate (blkfront_resume -> >> blk_mq_update_nr_hw_queues -> blk_mq_queue_reinit -> >> blk_mq_insert_requests). A related fix might be >> https://patchwork.kernel.org/patch/9462771/ but that's only present in >> later kernels. >> >> Apparently having this happen upsets the dom0 side of it, since any >> subsequent domU that is live migrated to the same dom0, also using >> blk-mq will immediately crash with the same Oops, after which is starts >> raining general protection faults inside. But, at the same time, I can >> still live migrate 3.16 kernels, but also 4.17 domU kernels on and off >> that dom0. > > Do you see any errors at all on the dom0? Nope. > You said you tested with both 4.9 and 4.15 kernels, does this depend only on > a 4.9 kernel in the domU? I don't know for sure (about 4.15 and if it has the mentioned patch or not). We (exploratory style) tested a few combinations of things some time ago, when 4.15 was in stretch-backports. At the end of the day the results were so unpredictable that we put doing testing in a more structured way on the todo-list (6-dimensional matrix of possibilities D: ). What I did recently is again just randomly trying things for a few hours, and then I started to see the pattern that whenever 4.9 was in the mix anywhere, bad things happened. Doing the reverse, eliminating 4.9 in dom0 as well as domU resulted in not being able to reproduce anything bad any more. So, very pragmatic. :) >> 2) Dom0 crash on live migration with multiple active nics >> >> I actually have to do more testing for specifically this, but at least >> I'm able to reliably crash a 4.9 Linux dom0 running on Xen 4.4 (last >> tested a few months ago, Debian Jessie) by live migrating a domU that >> has multiple network interfaces, actively routing traffic over them, to >> it. *poof*, hypervisor reporting '(XEN) Domain 0 crashed: 'noreboot' set >> - not rebooting.' *BOOM* everything gone. > > Can you post a full backtrace? Did you ever test with anything other than 4.9 > kernel + 4.4 hypervisor? Did not re-test yet. Ah, I found my notes. It's a bit different. When just doing live migrate, it would upset the bnx2x driver or network card itself and I would lose network connectivity to the machine (and all other domUs). See attached bnx2x-crash.txt for console output while the poor thing is drowning and gasping for air. When disabling SR-IOV (which I do not use, but which was listed somewhere as a workaround for a similar problem, related to HP Shared Memory blah, so why not try it to see what happens) in the BIOS for the 10G card and then trying the same, the dom0 crashed immediately when the live migrated domU was resumed. See dom0-crash.txt No trace or anything, it just disappears. > What does "actively routing traffic" mean in terms of packet frequency, and > did you test when there was > no network traffic but the interface was up? A linux domU doing NAT with 1 external and 6 internal interfaces, having a conntrack table with ~20k entries of active traffic flows. However, not doing many pps and not using much bandwidth (between 0 and 100 Mbit/s). Without any traffic it doesn't explode immediately. I think I could live migrate the inactive router of a stateful (conntrackd) pair. > A quick test with a 4.9 kernel + xen 4.8 but not terribly heavy network > traffic did not duplicate this. I'll get around to reproducing this (or not being able to with Xen 4.11+ Linux 4.17+ with maybe newer bnx2x). Currently the network infra related domUs are still on Jessie (Xen 4.4 Linux 3.16 dom0) hardware, also because of this one: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=899044 And while speaking of that, we've not seen this happen again with 4.17+ in the dom0, and same openvswitch and Xen 4.11 version. -- Hans van Kranenburg Attachment:
bnx2x-crash.txt Attachment:
dom0-crash.txt _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |