[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] Xen 4.10: domU crashes during/after live-migrate
On 09/12/2018 01:21 PM, Hans van Kranenburg wrote: > On 09/12/2018 08:55 PM, Sarah Newman wrote: >> On 09/04/2018 08:41 AM, Hans van Kranenburg wrote: >> >>>> We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 >>>> (Debian >>>> Stretch) and 4.15.11-1 (Debian Buster). >>>> >>>> [...] >>> >>> So... flash forward *whoosh*: >>> >>> For Debian users, it seems best to avoid the Debian 4.9 LTS Linux (for >>> dom0 as well as domU) if you want to use live migration, or maybe even >>> in general together with Xen. >>> >>> A few of the things I could cause to happen with recent Linux 4.9 in >>> dom0/domU: >>> >>> 1) blk-mq related Oops >>> >>> Oops in the domU while resuming after live migrate (blkfront_resume -> >>> blk_mq_update_nr_hw_queues -> blk_mq_queue_reinit -> >>> blk_mq_insert_requests). A related fix might be >>> https://patchwork.kernel.org/patch/9462771/ but that's only present in >>> later kernels. >>> >>> Apparently having this happen upsets the dom0 side of it, since any >>> subsequent domU that is live migrated to the same dom0, also using >>> blk-mq will immediately crash with the same Oops, after which is starts >>> raining general protection faults inside. But, at the same time, I can >>> still live migrate 3.16 kernels, but also 4.17 domU kernels on and off >>> that dom0. >> >> Do you see any errors at all on the dom0? > > Nope. What is your storage stack? > >> You said you tested with both 4.9 and 4.15 kernels, does this depend only on >> a 4.9 kernel in the domU? > > I don't know for sure (about 4.15 and if it has the mentioned patch or > not). We (exploratory style) tested a few combinations of things some > time ago, when 4.15 was in stretch-backports. At the end of the day the > results were so unpredictable that we put doing testing in a more > structured way on the todo-list (6-dimensional matrix of possibilities > D: ). What I did recently is again just randomly trying things for a few > hours, and then I started to see the pattern that whenever 4.9 was in > the mix anywhere, bad things happened. Doing the reverse, eliminating > 4.9 in dom0 as well as domU resulted in not being able to reproduce > anything bad any more. > > So, very pragmatic. :) So to rephrase you don't know if you saw failures with a 4.15 domU and a 4.9 dom0? The mentioned patch is d1b1cea1e58477dad88ff769f54c0d2dfa56d923 and was added in 4.10. I assume you think it should be added to 4.9? Why do you think it is related? > >>> 2) Dom0 crash on live migration with multiple active nics >>> >>> I actually have to do more testing for specifically this, but at least >>> I'm able to reliably crash a 4.9 Linux dom0 running on Xen 4.4 (last >>> tested a few months ago, Debian Jessie) by live migrating a domU that >>> has multiple network interfaces, actively routing traffic over them, to >>> it. *poof*, hypervisor reporting '(XEN) Domain 0 crashed: 'noreboot' set >>> - not rebooting.' *BOOM* everything gone. >> >> Can you post a full backtrace? Did you ever test with anything other than >> 4.9 kernel + 4.4 hypervisor? > > Did not re-test yet. > > Ah, I found my notes. It's a bit different. When just doing live > migrate, it would upset the bnx2x driver or network card itself and I > would lose network connectivity to the machine (and all other domUs). > See attached bnx2x-crash.txt for console output while the poor thing is > drowning and gasping for air. > > When disabling SR-IOV (which I do not use, but which was listed > somewhere as a workaround for a similar problem, related to HP Shared > Memory blah, so why not try it to see what happens) in the BIOS for the > 10G card and then trying the same, the dom0 crashed immediately when the > live migrated domU was resumed. See dom0-crash.txt No trace or anything, > it just disappears. This shared memory is an HP only thing, right? I think I saw some recommendations to the reverse, to disable shared memory and enable SR-IOV. >> What does "actively routing traffic" mean in terms of packet frequency, and >> did you test when there was >> no network traffic but the interface was up? > > A linux domU doing NAT with 1 external and 6 internal interfaces, having > a conntrack table with ~20k entries of active traffic flows. However, > not doing many pps and not using much bandwidth (between 0 and 100 Mbit/s). > > Without any traffic it doesn't explode immediately. I think I could live > migrate the inactive router of a stateful (conntrackd) pair. > >> A quick test with a 4.9 kernel + xen 4.8 but not terribly heavy network >> traffic did not duplicate this. > I'll get around to reproducing this (or not being able to with Xen 4.11+ > Linux 4.17+ with maybe newer bnx2x). > > Currently the network infra related domUs are still on Jessie (Xen 4.4 > Linux 3.16 dom0) hardware, also because of this one: > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=899044 > > And while speaking of that, we've not seen this happen again with 4.17+ > in the dom0, and same openvswitch and Xen 4.11 version. > Have you ever rebuilt your kernel with options such as DEBUG_PAGEALLOC? I found some errors almost immediately with one of our network drivers after doing so. _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |