[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI



On Fri, May 31, 2013 at 11:59:22AM +0100, George Dunlap wrote:
> On 31/05/13 11:54, George Dunlap wrote:
> >On 31/05/13 09:34, Diana Crisan wrote:
> >>George,
> >>On 30/05/13 17:06, George Dunlap wrote:
> >>>On 05/30/2013 04:55 PM, Diana Crisan wrote:
> >>>>On 30/05/13 16:26, George Dunlap wrote:
> >>>>>On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@xxxxxxxxxxxx>
> >>>>>wrote:
> >>>>>>Hi,
> >>>>>>
> >>>>>>
> >>>>>>On 26/05/13 09:38, Ian Campbell wrote:
> >>>>>>>On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:
> >>>>>>>>George,
> >>>>>>>>
> >>>>>>>>--On 24 May 2013 17:16:07 +0100 George Dunlap
> >>>>>>>><George.Dunlap@xxxxxxxxxxxxx>
> >>>>>>>>wrote:
> >>>>>>>>
> >>>>>>>>>>FWIW it's reproducible on every host h/w platform we've tried
> >>>>>>>>>>(a total of 2).
> >>>>>>>>>Do you see the same effects if you do a local-host migrate?
> >>>>>>>>I hadn't even realised that was possible. That would
> >>>>>>>>have made testing
> >>>>>>>>live
> >>>>>>>>migrate easier!
> >>>>>>>That's basically the whole reason it is supported ;-)
> >>>>>>>
> >>>>>>>>How do you avoid the name clash in xen-store?
> >>>>>>>Most toolstacks receive the incoming migration into a domain named
> >>>>>>>FOO-incoming or some such and then rename to FOO upon
> >>>>>>>completion. Some
> >>>>>>>also rename the outgoing domain "FOO-migratedaway"
> >>>>>>>towards the end so
> >>>>>>>that the bits of the final teardown which can safely
> >>>>>>>happen after the
> >>>>>>>target have start can be done so.
> >>>>>>>
> >>>>>>>Ian.
> >>>>>>>
> >>>>>>>
> >>>>>>I am unsure what I am doing wrong, but I cannot seem to
> >>>>>>be able to do a
> >>>>>>localhost migrate.
> >>>>>>
> >>>>>>I created a domU using "xl create xl.conf" and once it
> >>>>>>fully booted I
> >>>>>>issued
> >>>>>>an "xl migrate 11 localhost". This fails and gives the output below.
> >>>>>>
> >>>>>>Would you please advise on how to get this working?
> >>>>>>
> >>>>>>Thanks,
> >>>>>>Diana
> >>>>>>
> >>>>>>
> >>>>>>root@ubuntu:~# xl migrate 11 localhost
> >>>>>>root@localhost's password:
> >>>>>>migration target: Ready to receive domain.
> >>>>>>Saving to migration stream new xl format (info 0x0/0x0/2344)
> >>>>>>Loading new save file <incoming migration stream> (new xl fmt info
> >>>>>>0x0/0x0/2344)
> >>>>>>  Savefile contains xl domain config
> >>>>>>xc: progress: Reloading memory pages: 53248/1048575 5%
> >>>>>>xc: progress: Reloading memory pages: 105472/1048575 10%
> >>>>>>libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12
> >>>>>>device
> >>>>>>model: spawn failed (rc=-3)
> >>>>>>libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device
> >>>>>>model
> >>>>>>did not start: -3
> >>>>>>libxl: error:
> >>>>>>libxl_dm.c:1311:libxl__destroy_device_model: Device
> >>>>>>Model
> >>>>>>already exited
> >>>>>>migration target: Domain creation failed (code -3).
> >>>>>>libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream
> >>>>>>truncated
> >>>>>>reading ready message from migration receiver stream
> >>>>>>libxl: info:
> >>>>>>libxl_exec.c:118:libxl_report_child_exitstatus:
> >>>>>>migration
> >>>>>>target process [10934] exited with error status 3
> >>>>>>Migration failed, resuming at sender.
> >>>>>>xc: error: Cannot resume uncooperative HVM guests: Internal error
> >>>>>>libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume
> >>>>>>failed for
> >>>>>>domain 11: Success
> >>>>>Aha -- I managed to reproduce this one as well.
> >>>>>
> >>>>>Your problem is the "vncunused=0" -- that's instructing
> >>>>>qemu "You must
> >>>>>use this exact port for the vnc server".  But when you do
> >>>>>the migrate,
> >>>>>that port is still in use by the "from" domain; so the qemu for the
> >>>>>"to" domain can't get it, and fails.
> >>>>>
> >>>>>Obviously this should fail a lot more gracefully, but that's a bit of
> >>>>>a lower-priority bug I think.
> >>>>>
> >>>>>  -George
> >>>>Yes, I managed to get to the bottom of it too and got vms migrating on
> >>>>localhost on our end.
> >>>>
> >>>>I can confirm I did get the clock stuck problem while doing
> >>>>a localhost
> >>>>migrate.
> >>>
> >>>Does the script I posted earlier "work" for you (i.e., does it
> >>>fail after some number of migrations)?
> >>>
> >>
> >>I left your script running throughout the night and it seems
> >>that it does not always catch the problem. I see the following:
> >>
> >>1. vm has the clock stuck
> >>2. script is still running as it seems the vm is still ping-able.
> >>3. migration fails on the basis that the vm is does not ack the
> >>suspend request (see below).
> >
> >So I wrote a script to run "date", sleep for 2 seconds, and run
> >"date" a second time -- and eventually the *sleep* hung.
> >
> >The VM is still responsive, and I can log in; if I type "date"
> >manually successive times then I get an advancing clock, but if I
> >type "sleep 1" it just hangs.
> >
> >If you run "dmesg" in the guest, do you see the following line?
> >
> >CE: Reprogramming failure. Giving up
> 
> I think this must be it; on my other box, I got the following messages:
> 
> [  224.732083] PM: late freeze of devices complete after 3.787 msecs
> [  224.736062] Xen HVM callback vector for event delivery is enabled
> [  224.736062] Xen Platform PCI: I/O protocol version 1
> [  224.736062] xen: --> irq=8, pirq=16
> [  224.736062] xen: --> irq=12, pirq=17
> [  224.736062] xen: --> irq=1, pirq=18
> [  224.736062] xen: --> irq=6, pirq=19
> [  224.736062] xen: --> irq=4, pirq=20
> [  224.736062] xen: --> irq=7, pirq=21
> [  224.736062] xen: --> irq=28, pirq=22
> [  224.736062] ata_piix 0000:00:01.1: restoring config space at
> offset 0x1 (was 0x2800001, writing 0x2800005)
> [  224.736062] PM: early restore of devices complete after 5.854 msecs
> [  224.739692] ata_piix 0000:00:01.1: setting latency timer to 64
> [  224.739782] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28
> (level, low) -> IRQ 28
> [  224.746900] PM: restore of devices complete after 7.540 msecs
> [  224.758612] Setting capacity to 16777216
> [  224.758749] Setting capacity to 16777216
> [  224.898426] ata2.01: NODEV after polling detection
> [  224.900941] ata2.00: configured for MWDMA2
> [  231.055978] CE: xen increased min_delta_ns to 150000 nsec
> [  231.055986] hrtimer: interrupt took 14460 ns
> [  247.893303] PM: freeze of devices complete after 2.168 msecs
> [  247.893306] suspending xenstore...
> [  247.896977] PM: late freeze of devices complete after 3.666 msecs
> [  247.900067] Xen HVM callback vector for event delivery is enabled
> [  247.900067] Xen Platform PCI: I/O protocol version 1
> [  247.900067] xen: --> irq=8, pirq=16
> [  247.900067] xen: --> irq=12, pirq=17
> [  247.900067] xen: --> irq=1, pirq=18
> [  247.900067] xen: --> irq=6, pirq=19
> [  247.900067] xen: --> irq=4, pirq=20
> [  247.900067] xen: --> irq=7, pirq=21
> [  247.900067] xen: --> irq=28, pirq=22
> [  247.900067] ata_piix 0000:00:01.1: restoring config space at
> offset 0x1 (was 0x2800001, writing 0x2800005)
> [  247.900067] PM: early restore of devices complete after 4.612 msecs
> [  247.906454] ata_piix 0000:00:01.1: setting latency timer to 64
> [  247.906558] xen-platform-pci 0000:00:03.0: PCI INT A -> GSI 28
> (level, low) -> IRQ 28
> [  247.914770] PM: restore of devices complete after 8.762 msecs
> [  247.926557] Setting capacity to 16777216
> [  247.926661] Setting capacity to 16777216
> [  248.066661] ata2.01: NODEV after polling detection
> [  248.067326] CE: xen increased min_delta_ns to 225000 nsec
> [  248.067344] CE: xen increased min_delta_ns to 337500 nsec
> [  248.067361] CE: xen increased min_delta_ns to 506250 nsec
> [  248.067378] CE: xen increased min_delta_ns to 759375 nsec
> [  248.067396] CE: xen increased min_delta_ns to 1139062 nsec
> [  248.067413] CE: xen increased min_delta_ns to 1708593 nsec
> [  248.067428] CE: xen increased min_delta_ns to 2562889 nsec
> [  248.067441] CE: xen increased min_delta_ns to 3844333 nsec
> [  248.067453] CE: xen increased min_delta_ns to 4000000 nsec
> [  248.067466] CE: Reprogramming failure. Giving up
> [  248.068075] ata2.00: configured for MWDMA2
> 
> Note the "CE: xen increased min_delta_ns to 150000nsec" at 231 for
> the previous suspend, and now it's increasing it up to 4
> milliseconds before giving up for this suspend.
> 
> Konrad, stefano, any idea what's going on here?

VIRQ_TIMER not being delievered. Aka this commit

bee980d9e9642e96351fa3ca9077b853ecf62f57
xen/events: Handle VIRQ_TIMER before any other hardirq in event loop.

should be back-ported but didn't yet. Let me put that
on my TODO list.
> 
>  -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.