[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] HVM Migration of domU on Qemu-upstream DM causes stuck system clock with ACPI
On 31/05/13 12:41, Diana Crisan wrote: On 31 May 2013, at 12:36, Diana Crisan <dcrisan@xxxxxxxxxxxx> wrote:On 31 May 2013, at 11:54, George Dunlap <george.dunlap@xxxxxxxxxxxxx> wrote:On 31/05/13 09:34, Diana Crisan wrote:George, On 30/05/13 17:06, George Dunlap wrote:On 05/30/2013 04:55 PM, Diana Crisan wrote:On 30/05/13 16:26, George Dunlap wrote:On Tue, May 28, 2013 at 4:06 PM, Diana Crisan <dcrisan@xxxxxxxxxxxx> wrote:Hi, On 26/05/13 09:38, Ian Campbell wrote:On Sat, 2013-05-25 at 11:18 +0100, Alex Bligh wrote:George, --On 24 May 2013 17:16:07 +0100 George Dunlap <George.Dunlap@xxxxxxxxxxxxx> wrote:FWIW it's reproducible on every host h/w platform we've tried (a total of 2).Do you see the same effects if you do a local-host migrate?I hadn't even realised that was possible. That would have made testing live migrate easier!That's basically the whole reason it is supported ;-)How do you avoid the name clash in xen-store?Most toolstacks receive the incoming migration into a domain named FOO-incoming or some such and then rename to FOO upon completion. Some also rename the outgoing domain "FOO-migratedaway" towards the end so that the bits of the final teardown which can safely happen after the target have start can be done so. Ian.I am unsure what I am doing wrong, but I cannot seem to be able to do a localhost migrate. I created a domU using "xl create xl.conf" and once it fully booted I issued an "xl migrate 11 localhost". This fails and gives the output below. Would you please advise on how to get this working? Thanks, Diana root@ubuntu:~# xl migrate 11 localhost root@localhost's password: migration target: Ready to receive domain. Saving to migration stream new xl format (info 0x0/0x0/2344) Loading new save file <incoming migration stream> (new xl fmt info 0x0/0x0/2344) Savefile contains xl domain config xc: progress: Reloading memory pages: 53248/1048575 5% xc: progress: Reloading memory pages: 105472/1048575 10% libxl: error: libxl_dm.c:1280:device_model_spawn_outcome: domain 12 device model: spawn failed (rc=-3) libxl: error: libxl_create.c:1091:domcreate_devmodel_started: device model did not start: -3 libxl: error: libxl_dm.c:1311:libxl__destroy_device_model: Device Model already exited migration target: Domain creation failed (code -3). libxl: error: libxl_utils.c:393:libxl_read_exactly: file/stream truncated reading ready message from migration receiver stream libxl: info: libxl_exec.c:118:libxl_report_child_exitstatus: migration target process [10934] exited with error status 3 Migration failed, resuming at sender. xc: error: Cannot resume uncooperative HVM guests: Internal error libxl: error: libxl.c:404:libxl__domain_resume: xc_domain_resume failed for domain 11: SuccessAha -- I managed to reproduce this one as well. Your problem is the "vncunused=0" -- that's instructing qemu "You must use this exact port for the vnc server". But when you do the migrate, that port is still in use by the "from" domain; so the qemu for the "to" domain can't get it, and fails. Obviously this should fail a lot more gracefully, but that's a bit of a lower-priority bug I think. -GeorgeYes, I managed to get to the bottom of it too and got vms migrating on localhost on our end. I can confirm I did get the clock stuck problem while doing a localhost migrate.Does the script I posted earlier "work" for you (i.e., does it fail after some number of migrations)?I left your script running throughout the night and it seems that it does not always catch the problem. I see the following: 1. vm has the clock stuck 2. script is still running as it seems the vm is still ping-able. 3. migration fails on the basis that the vm is does not ack the suspend request (see below).So I wrote a script to run "date", sleep for 2 seconds, and run "date" a second time -- and eventually the *sleep* hung. The VM is still responsive, and I can log in; if I type "date" manually successive times then I get an advancing clock, but if I type "sleep 1" it just hangs. If you run "dmesg" in the guest, do you see the following line? CE: Reprogramming failure. Giving upI do. It is preceded by: CE: xen increased min_delta_ns to 4000000 nsecIt seems that it is always getting stuck when the min_delta_ns is set to 4mil nsec. Could this be it? Overflow perhaps? No -- Linux is asking, "Can you give me an alarm in 5ns?" And Xen is saying, "No". So Linux is saying, "OK, how about 5us? 10us? 20us?" By the time it reaches 4ms, Linux has had enough, and says, "If this timer is so bad that it can't give me an event within 4ms it just won't use timers at all, thank you very much." The problem appears to be that Linux thinks it's asking for something in the future, but is actually asking for something in the past. It must look at its watch just before the final domain pause, and then asks for the time just after the migration resumes on the other side. So it doesn't realize that 10ms (or something) has already passed, and that it's actually asking for a timer in the past. The Xen timer driver in Linux specifically asks Xen for times set in the past to return an error. Xen is returning an error because the time is in the past, Linux thinks it's getting an error because the time is too close in the future and tries asking a little further away. Unfortunately I think this is something which needs to be fixed on the Linux side; I don't really see how we can work around it in Xen. -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |