Xen project Mailing List

Re: [Xen-devel] HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI

On Fri, May 31, 2013 at 11:59:22AM +0100, George Dunlap wrote: > On 31/05/13 11:54, George Dunlap wrote: > >On 31/05/13 09:34, Diana Crisan wrote: > >>George, > >>On 30/05/13 17:06, George Dunlap wrote: > >>>On 05/30/2013 04:55 PM, Diana Crisan wrote: > >>>>On 30/05/13 16:26, George Dunlap wrote: > >>>>>On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@xxxxxxxxxxxx> > >>>>>wrote: > >>>>>>Hi, > >>>>>> > >>>>>> > >>>>>>On 26/05/13 09:38, Ian Campbell wrote: > >>>>>>>On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote: > >>>>>>>>George, > >>>>>>>> > >>>>>>>>--On 24 May 2013 17:16:07 +0100 George Dunlap > >>>>>>>><George.Dunlap@xxxxxxxxxxxxx> > >>>>>>>>wrote: > >>>>>>>> > >>>>>>>>>>FWIW it's reproducible on every host h/w platform we've tried > >>>>>>>>>>(a total of 2). > >>>>>>>>>Do you see the same effects if you do a local-host migrate? > >>>>>>>>I hadn't even realised that was possible. That would > >>>>>>>>have made testing > >>>>>>>>live > >>>>>>>>migrate easier! > >>>>>>>That's basically the whole reason it is supported ;-) > >>>>>>> > >>>>>>>>How do you avoid the name clash in xen-store? > >>>>>>>Most toolstacks receive the incoming migration into a domain named > >>>>>>>FOO-incoming or some such and then rename to FOO upon > >>>>>>>completion. Some > >>>>>>>also rename the outgoing domain "FOO-migratedaway" > >>>>>>>towards the end so > >>>>>>>that the bits of the final teardown which can safely > >>>>>>>happen after the > >>>>>>>target have start can be done so. > >>>>>>> > >>>>>>>Ian. > >>>>>>> > >>>>>>> > >>>>>>I am unsure what I am doing wrong, but I cannot seem to > >>>>>>be able to do a > >>>>>>localhost migrate. > >>>>>> > >>>>>>I created a domU using "xl create xl.conf" and once it > >>>>>>fully booted I > >>>>>>issued > >>>>>>an "xl migrate 11 localhost". This fails and gives the output below. > >>>>>> > >>>>>>Would you please advise on how to get this working? > >>>>>> > >>>>>>Thanks, > >>>>>>Diana > >>>>>> > >>>>>> > >>>>>>root@ubuntu:~# xl migrate 11 localhost > >>>>>>root@localhost's password: > >>>>>>migration target: Ready to receive domain. > >>>>>>Saving to migration stream new xl format (info 0x0/0x0/2344) > >>>>>>Loading new save file <incoming migration stream> (new xl fmt info > >>>>>>0x0/0x0/2344) > >>>>>> Savefile contains xl domain config > >>>>>>xc: progress: Reloading memory pages: 53248/1048575 5% > >>>>>>xc: progress: Reloading memory pages: 105472/1048575 10% > >>>>>>libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12 > >>>>>>device > >>>>>>model: spawn failed (rc=-3) > >>>>>>libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device > >>>>>>model > >>>>>>did not start: -3 > >>>>>>libxl: error: > >>>>>>libxl_dm.c:1311:libxl__destroy_device_model: Device > >>>>>>Model > >>>>>>already exited > >>>>>>migration target: Domain creation failed (code -3). > >>>>>>libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream > >>>>>>truncated > >>>>>>reading ready message from migration receiver stream > >>>>>>libxl: info: > >>>>>>libxl_exec.c:118:libxl_report_child_exitstatus: > >>>>>>migration > >>>>>>target process [10934] exited with error status 3 > >>>>>>Migration failed, resuming at sender. > >>>>>>xc: error: Cannot resume uncooperative HVM guests: Internal error > >>>>>>libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume > >>>>>>failed for > >>>>>>domain 11: Success > >>>>>Aha -- I managed to reproduce this one as well. > >>>>> > >>>>>Your problem is the "vncunused=0" -- that's instructing > >>>>>qemu "You must > >>>>>use this exact port for the vnc server". But when you do > >>>>>the migrate, > >>>>>that port is still in use by the "from" domain; so the qemu for the > >>>>>"to" domain can't get it, and fails. > >>>>> > >>>>>Obviously this should fail a lot more gracefully, but that's a bit of > >>>>>a lower-priority bug I think. > >>>>> > >>>>> -George > >>>>Yes, I managed to get to the bottom of it too and got vms migrating on > >>>>localhost on our end. > >>>> > >>>>I can confirm I did get the clock stuck problem while doing > >>>>a localhost > >>>>migrate. > >>> > >>>Does the script I posted earlier "work" for you (i.e., does it > >>>fail after some number of migrations)? > >>> > >> > >>I left your script running throughout the night and it seems > >>that it does not always catch the problem. I see the following: > >> > >>1. vm has the clock stuck > >>2. script is still running as it seems the vm is still ping-able. > >>3. migration fails on the basis that the vm is does not ack the > >>suspend request (see below). > > > >So I wrote a script to run "date", sleep for 2 seconds, and run > >"date" a second time -- and eventually the *sleep* hung. > > > >The VM is still responsive, and I can log in; if I type "date" > >manually successive times then I get an advancing clock, but if I > >type "sleep 1" it just hangs. > > > >If you run "dmesg" in the guest, do you see the following line? > > > >CE: Reprogramming failure. Giving up > > I think this must be it; on my other box, I got the following messages: > > [ 224.732083] PM: late freeze of devices complete after 3.787 msecs > [ 224.736062] Xen HVM callback vector for event delivery is enabled > [ 224.736062] Xen Platform PCI: I/O protocol version 1 > [ 224.736062] xen: --> irq=8, pirq=16 > [ 224.736062] xen: --> irq=12, pirq=17 > [ 224.736062] xen: --> irq=1, pirq=18 > [ 224.736062] xen: --> irq=6, pirq=19 > [ 224.736062] xen: --> irq=4, pirq=20 > [ 224.736062] xen: --> irq=7, pirq=21 > [ 224.736062] xen: --> irq=28, pirq=22 > [ 224.736062] ata_piix 0000:00:01.1: restoring config space at > offset 0x1 (was 0x2800001, writing 0x2800005) > [ 224.736062] PM: early restore of devices complete after 5.854 msecs > [ 224.739692] ata_piix 0000:00:01.1: setting latency timer to 64 > [ 224.739782] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28 > (level, low) -> IRQ 28 > [ 224.746900] PM: restore of devices complete after 7.540 msecs > [ 224.758612] Setting capacity to 16777216 > [ 224.758749] Setting capacity to 16777216 > [ 224.898426] ata2.01: NODEV after polling detection > [ 224.900941] ata2.00: configured for MWDMA2 > [ 231.055978] CE: xen increased min_delta_ns to 150000 nsec > [ 231.055986] hrtimer: interrupt took 14460 ns > [ 247.893303] PM: freeze of devices complete after 2.168 msecs > [ 247.893306] suspending xenstore... > [ 247.896977] PM: late freeze of devices complete after 3.666 msecs > [ 247.900067] Xen HVM callback vector for event delivery is enabled > [ 247.900067] Xen Platform PCI: I/O protocol version 1 > [ 247.900067] xen: --> irq=8, pirq=16 > [ 247.900067] xen: --> irq=12, pirq=17 > [ 247.900067] xen: --> irq=1, pirq=18 > [ 247.900067] xen: --> irq=6, pirq=19 > [ 247.900067] xen: --> irq=4, pirq=20 > [ 247.900067] xen: --> irq=7, pirq=21 > [ 247.900067] xen: --> irq=28, pirq=22 > [ 247.900067] ata_piix 0000:00:01.1: restoring config space at > offset 0x1 (was 0x2800001, writing 0x2800005) > [ 247.900067] PM: early restore of devices complete after 4.612 msecs > [ 247.906454] ata_piix 0000:00:01.1: setting latency timer to 64 > [ 247.906558] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28 > (level, low) -> IRQ 28 > [ 247.914770] PM: restore of devices complete after 8.762 msecs > [ 247.926557] Setting capacity to 16777216 > [ 247.926661] Setting capacity to 16777216 > [ 248.066661] ata2.01: NODEV after polling detection > [ 248.067326] CE: xen increased min_delta_ns to 225000 nsec > [ 248.067344] CE: xen increased min_delta_ns to 337500 nsec > [ 248.067361] CE: xen increased min_delta_ns to 506250 nsec > [ 248.067378] CE: xen increased min_delta_ns to 759375 nsec > [ 248.067396] CE: xen increased min_delta_ns to 1139062 nsec > [ 248.067413] CE: xen increased min_delta_ns to 1708593 nsec > [ 248.067428] CE: xen increased min_delta_ns to 2562889 nsec > [ 248.067441] CE: xen increased min_delta_ns to 3844333 nsec > [ 248.067453] CE: xen increased min_delta_ns to 4000000 nsec > [ 248.067466] CE: Reprogramming failure. Giving up > [ 248.068075] ata2.00: configured for MWDMA2 > > Note the "CE: xen increased min_delta_ns to 150000nsec" at 231 for > the previous suspend, and now it's increasing it up to 4 > milliseconds before giving up for this suspend. > > Konrad, stefano, any idea what's going on here? VIRQ_TIMER not being delievered. Aka this commit bee980d9e9642e96351fa3ca9077b853ecf62f57 xen/events: Handle VIRQ_TIMER before any other hardirq in event loop. should be back-ported but didn't yet. Let me put that on my TODO list. > > -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.